MA40189 - Solution Sheet Four

MA40189 - Solution Sheet Four
Simon Shaw, [email protected]
http://people.bath.ac.uk/masss/ma40189.html
2016/17 Semester II
1. Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally independent
given a parameter θ.
(a) Let Xi | θ ∼ Bern(θ).
i. Show that f (xi | θ) belongs to the 1-parameter exponential family
and for X = (X1 , . . . , Xn ) state the sufficient statistic for learning
about θ.
Notice that we can write
f (xi | θ)
=
=
θxi (1 − θ)1−xi
θ
xi + log(1 − θ)
exp
log
1−θ
so that f (xi | θ) belongs to the 1-parameter exponential family with φ1 (θ) =
θ
, u1 (xi ) = xi , g(θ) = log(1 − θ) and h(xi ) = 0. Notice that, from
log 1−θ
Pn
Proposition 1 (see Lecture 11), tn = [n, i=1 Xi ] is a sufficient statistic.
ii. By viewing the likelihood as a function of θ, which generic family
of distributions (over θ) is the likelihood a kernel of ?
The likelihood, without expressing in the explicit exponential family form, is
f (x | θ)
=
θnx̄ (1 − θ)n−nx̄
which, viewing as a function of θ, we immediately recognise as a Beta kernel
(in particular, a Beta(nx̄ + 1, n − nx̄ + 1)).
iii. By first finding the corresponding posterior distribution for θ given
x = (x1 , . . . , xn ), show that this family of distributions is conjugate
with respect to the likelihood f (x | θ).
Taking θ ∼ Beta(α, β) we have that
f (θ | x) ∝
=
θnx̄ (1 − θ)n−nx̄ × θα−1 (1 − θ)β−1
θα+nx̄−1 (1 − θ)β+n−nx̄−1
so that θ | x ∼ Beta(α + nx̄, β + n − nx̄). Thus, the prior and the posterior
are in the same family giving conjugacy.
Deriving the results directly from exponential family representation
1
Expressed in the 1-parameter exponential family form the likelihood is
(
)
X
n
θ
f (x | θ) = exp
log
xi + n log(1 − θ)
1 − θ i=1
Pn
from which we immediately observe the sufficient statistic tn = [n, i=1 xi ].
Viewing f (x | θ) as a function of θ the natural conjugate prior is a member of
the 2-parameter exponential family of the form
θ
f (θ) = exp a log
+ d log(1 − θ) + c(a, d)
1−θ
where c(a, d) is the normalising constant. Hence,
θ
f (θ) ∝ exp a log
+ d log(1 − θ)
1−θ
= θa (1 − θ)d−a
which we recognise as a kernel of a Beta distribution. The convention is to
label the hyperparameters as α and β so that we put α = α(a, d) = a + 1
and β = β(a, d) = d − a + 1 (equivalently, a = a(α, β) = α − 1, d = d(α, β) =
β + α − 2). The conjugate prior distribution is θ ∼ Beta(α, β).
(b) Let Xi | θ ∼ N (µ, θ) with µ known.
i. Show that f (xi | θ) belongs to the 1-parameter exponential family
and for X = (X1 , . . . , Xn ) state the sufficient statistic for learning
about θ.
Writing the normal density as an exponential family (parameter θ as µ is a
known constant) we have
√
1
1
2
f (xi | θ) = exp − (xi − µ) − log θ − log 2π
2θ
2
so that f (xi | θ) belongs
Pn to the 1-parameter exponential family. The sufficient
statistic is tn = [n, i=1 (xi − µ)2 ]. Note that, expressed explicitly as a 1parameter exponential family, the likelihood for x = (x1 , . . . , xn ) is
(
)
n
√
1 X
n
2
f (x | θ) = exp −
(xi − µ) − log θ − n log 2π
2θ i=1
2
so that the natural conjugate prior has the form
1
f (θ) = exp −a − d log θ + c(a, d)
θ
1
−d
∝ θ exp −a
θ
which we recognise as a kernel of an Inverse-Gamma distribution.
2
ii. By viewing the likelihood as a function of θ, which generic family
of distributions (over θ) is the likelihood a kernel of ?
In conventional form,
(
f (x | θ) ∝ θ
−n
2
n
1 X
exp −
(xi − µ)2
2θ i=1
)
which, viewing f (x | θ) as a function of θ, we recognise as a kernel
Pn of an
1
Inverse-Gamma distribution (in particular, an Inv-gamma( n−2
,
i=1 (xi −
2
2
µ)2 )).
iii. By first finding the corresponding posterior distribution for θ given
x = (x1 , . . . , xn ), show that this family of distributions is conjugate
with respect to the likelihood f (x | θ).
Taking θ ∼ Inv-gamma(α, β) we have
)
(
n
β
1 X
2
−(α+1)
−n
2
(xi − µ) × θ
exp −
exp −
f (θ | x) ∝ θ
2θ i=1
θ
(
! )
n
n
1X
1
= θ−(α+ 2 +1) exp − β +
(xi − µ)2
2 i=1
θ
which we recognise as a kernel
Pnof an Inverse-Gamma distribution so that
θ | x ∼ Inv-gamma(α+ n2 , β + 21 i=1 (xi −µ)2 ). Hence, the prior and posterior
are in the same family giving conjugacy.
(c) Let Xi | θ ∼ M axwell(θ), the Maxwell distribution with parameter θ so
that
12
3
2
θx2
f (xi | θ) =
θ 2 x2i exp − i , xi > 0
π
2
q
2
and E(Xi | θ) = 2 πθ
, V ar(Xi | θ) = 3π−8
πθ .
i. Show that f (xi | θ) belongs to the 1-parameter exponential family
and for X = (X1 , . . . , Xn ) state the sufficient statistic for learning
about θ.
Writing the Maxwell density in exponential family form we have
x2i
3
1
2
2
f (xi | θ) = exp −θ
+ log θ + log xi + log
2
2
2
π
so that f (xi | θ) belongs
Pn to the 1-parameter exponential family. The sufficient
statistic is tn = [n, i=1 x2i ]. Note that, expressed explicitly as a 1-parameter
exponential family, the likelihood for x = (x1 , . . . , xn ) is
(
)
n
n
X
X
x2i
3n
n
2
f (x | θ) = exp −θ
+
log θ +
log x2i + log
2
2
2
π
i=1
i=1
3
so that the natural conjugate prior has the form
f (θ)
=
exp {−aθ + d log θ + c(a, d)}
∝ θd e−aθ
which we recognise as a kernel of a Gamma distribution.
ii. By viewing the likelihood as a function of θ, which generic family
of distributions (over θ) is the likelihood a kernel of ?
In conventional form,
!
n2
Pn
n
Y
x2i
3n
2
2
i=1
f (x | θ) =
θ2
xi exp −
θ
π
2
i=1
Pn
2
3n
i=1 xi
2
∝ θ exp −
θ
2
which, viewing f (x | θ)as a function of θ, we recognise
as a kernel of a Gamma
Pn
1
2
x
)).
distribution (in particular, Gamma( 3n+2
,
i
i=1
2
2
iii. By first finding the corresponding posterior distribution for θ given
x = (x1 , . . . , xn ), show that this family of distributions is conjugate
with respect to the likelihood f (x | θ).
Taking θ ∼ Gamma(α, β) we have
Pn
2
3n
i=1 xi
f (θ | x) ∝ θ 2 exp −
θ × θα−1 e−βθ
2
(
! )
n
1X 2
α+ 3n
−1
2
= θ
exp − β +
x θ
2 i=1 i
which, of P
course, is a kernel of a Gamma distribution so that θ | x ∼ Gamma(α+
n
1
3n
2
,
β
+
i=1 xi ). The prior and the posterior are in the same family giving
2
2
conjugacy.
2. Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally independent
given a parameter θ. Suppose that Xi | θ is geometrically distributed with
probability density function
f (xi | θ)
=
(1 − θ)xi −1 θ, xi = 1, 2, . . . .
(a) Show that f (x | θ), where x = (x1 , . . . , xn ), belongs to the 1-parameter
exponential family. Hence, or otherwise, find the conjugate prior distribution and corresponding posterior distribution for θ.
As the Xi are exchangeable then
f (x | θ)
=
=
n
Y
i=1
n
Y
f (xi | θ)
(1 − θ)xi −1 θ
i=1
=
(1 − θ)nx̄−n θn
=
exp {(nx̄ − n) log(1 − θ) + n log θ}
4
and so belongs to the 1-parameter exponential family. The conjugate prior is of
the form
f (θ) ∝ exp {a log(1 − θ) + b log θ}
= θb (1 − θ)a
which is a kernel of a Beta distribution. Letting α = b + 1, β = a + 1 then we
have θ ∼ Beta(α, β).
f (θ | x) ∝
∝
f (x | θ)f (θ)
θn (1 − θ)(nx̄−n) θα−1 (1 − θ)β−1
which is a kernel of a Beta(α+n, β +nx̄−n) so that θ | x ∼ Beta(α+n, β +nx̄−n).
(b) Show that the posterior mean for θ can be written as a weighted average
of the prior mean of θ and the maximum likelihood estimate, x̄−1 .
E(θ | X)
=
=
=
α+n
(α + n) + (β + nx̄ − n)
α+n
α + β + nx̄
α+β
α
nx̄
1
+
α + β + nx̄
α+β
α + β + nx̄
x̄
= λE(θ) + (1 − λ)x̄−1 .
(c) Suppose now that the prior for θ is instead given by the probability
density function
f (θ)
=
1
1
θα (1 − θ)β−1 +
θα−1 (1 − θ)β ,
2B(α + 1, β)
2B(α, β + 1)
where B(α, β) denotes the Beta function evaluated at α and β. Show
that the posterior probability density function can be written as
f (θ | x)
λf1 (θ) + (1 − λ)f2 (θ)
=
where
λ
=
(α + n)β
Pn
(α + n)β + (β − n + i=1 xi )α
and f1 (θ) and f2 (θ) are probability density functions.
f (θ | x) ∝ f (x | θ)f (θ)
= θn (1 − θ)(nx̄−n)
=
θα−1 (1 − θ)β
θα (1 − θ)β−1
+
B(α + 1, β)
B(α, β + 1)
θα1 (1 − θ)β1 −1
θα1 −1 (1 − θ)β1
+
B(α + 1, β)
B(α, β + 1)
where α1 = α + n and β1 = β + nx̄ − n. Finding the constant of proportionality
we observe that θα1 (1 − θ)β1 −1 is a kernel of a Beta(α1 + 1, β1 ) and θα1 −1 (1 − θ)β1
is a kernel of a Beta(α1 , β1 + 1). So,
B(α1 + 1, β1 )
B(α1 , β1 + 1)
f (θ | x) = c
f1 (θ) +
f2 (θ)
B(α + 1, β)
B(α, β + 1)
5
where f1 (θ) is the density function of Beta(α1 + 1, β1 ) and f2 (θ) the density
function of Beta(α1 , β1 + 1). Hence,
c−1
=
B(α1 + 1, β1 ) B(α1 , β1 + 1)
+
B(α + 1, β)
B(α, β + 1)
so that f (θ | x) = λf1 (θ) + (1 − λ)f2 (θ) with
λ
=
B(α1 +1,β1 )
B(α+1,β)
B(α1 +1,β1 )
B(α1 ,β1 +1)
B(α+1,β) + B(α,β+1)
=
α1 (α+β)B(α1 ,β1 )
α(α1 +β1 )B(α,β)
α1 (α+β)B(α1 ,β1 )
β1 (α+β)B(α1 ,β1 )
α(α1 +β1 )B(α,β) + β(α1 +β1 )B(α,β)
=
α1 β
α1 β + β1 α
=
(α + n)β
Pn
.
(α + n)β + (β + i=1 xi − n)α
3. Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally independent given a parameter θ. Suppose that Xi | θ is distributed as a doubleexponential distribution with probability density function
1
|xi |
f (xi | θ) =
exp −
, −∞ < xi < ∞
2θ
θ
for θ > 0.
(a) Find the conjugate prior distribution and corresponding posterior distribution for θ following observation of x = (x1 , . . . , xn ).
n
Y
1
|xi |
f (x | θ) =
exp −
2θ
θ
i=1
)
(
n
1
1X
|xi |
∝
exp −
θn
θ i=1
which, when viewed as a function of θ, is a kernel of Inv-gamma(n−1,
We thus take θ ∼ Inv-gamma(α, β) as the prior so that
(
)
n
1
1X
1
β
f (θ | x) ∝
exp −
|xi |
exp −
θn
θ i=1
θα+1
θ
!)
(
n
X
1
1
=
exp
−
β
+
|xi |
θα+n+1
θ
i=1
Pn
i=1
|xi |).
Pn
which is a kernel of Inv-gamma(α + n, β + i=1 |xi |). Thus, with respect to
X | θ, the prior and posterior P
are in the same family, showing conjugacy, with
n
θ | x ∼ Inv-gamma(α + n, β + i=1 |xi |).
(b) Consider the transformation φ = θ−1 . Find the posterior distribution
of φ | x.
6
We have φ = g(θ) where g(θ) = θ−1 so that θ = g −1 (φ) = φ−1 . Transforming
fθ (θ | x) to fφ (φ | x) we have
∂θ fφ (φ | x) = fθ (g −1 (φ) | x)
∂φ
(
!)
n
X
−1 1
1
∝ 2 α+n+1 exp − 1 β +
|xi |
φ 1
φ
i=1
φ
(
!)
n
X
α+n−1
= φ
exp −φ β +
|xi |
i=1
Pn
which is a kernel of aPGamma(α + n, β + i=1 |xi |) distribution. That is φ | x ∼
n
Gamma(α + n, β + i=1 |xi |). The result highlights the relationship between
the Gamma and Inv-Gamma distributions shown on question 3(b)(i) of Question
Sheet Two.
4. Let X1 , . . . , Xn be a finite subset of a sequence of infinitely exchangeable
random quantities with joint density function
!−(n+1)
n
X
xi
.
f (x1 , . . . , xn ) = n! 1 +
i=1
Show that they can be represented as conditionally independent and exponentially distributed.
Using de Finetti’s Representation Theorem (Theorem 2 of the on-line notes), the joint
distribution has an integral representation of the form
)
Z (Y
n
f (x1 , . . . , xn ) =
f (xi | θ) f (θ) dθ.
θ
i=1
If Xi | θ ∼ Exp(θ) then
n
Y
f (xi | θ) =
i=1
n
Y
n
θ exp (−θxi ) = θ exp −θ
i=1
n
X
!
xi
.
i=1
Notice that, viewed as a function of θ, this looks like a kernel of Gamma(n+1,
The result holds if we can find an f (θ) such that
!−(n+1)
!
Z
n
n
X
X
n
n! 1 +
xi
=
θ exp −θ
xi f (θ) dθ.
θ
i=1
Pn
i=1
xi ).
i=1
Pn
The left hand side looks like the normalising constant of a Gamma(n + 1, 1 + i=1 xi )
(as n! = Γ(n + 1)) and if f (θ) =Pexp(−θ) then the integrand on the right hand side is
n
a kernel of a Gamma(n + 1, 1 + i=1 xi ). So, if θ ∼ Gamma(1, 1) then f (θ) = exp(−θ)
and we have the desired representation.
5. Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally independent given a parameter θ. Suppose that Xi | θ is distributed as a Poisson
distribution with mean θ.
7
(a) Show that, with respect to this Poisson likelihood, the gamma family
of distributions is conjugate.
f (x | θ)
=
∝
n
Y
i=1
n
Y
P (Xi = xi | θ)
θxi exp {−θ}
i=1
= θnx̄ exp {−nθ} .
As θ ∼ Gamma(α, β) then
f (θ | x) ∝ f (x | θ)f (θ)
∝ θnx̄ exp {−nθ} θα−1 exp {−βθ}
= θα+nx̄−1 exp {−(β + n)θ}
which is a kernel of a Gamma(α + nx̄, β + n) distribution. Hence, the prior and
posterior are in the same family giving conjugacy.
(b) Interpret the posterior mean of θ paying particular attention to the
cases when we may have weak prior information and strong prior information.
E(θ | X)
=
=
α + nx̄
β+n
β α
β + nx̄
β+n
α
= λ
+ (1 − λ)x̄
β
β
where λ = β+n
. Hence, the posterior mean is a weighted average of the prior
α
mean, β , and the data mean, x̄, which is also the maximum likelihood estimate.
Weak prior information corresponds to a large variance of θ which can be viewed
as small β (β is the inverse scale parameter). In this case, more weight is attached
to x̄ than α
β in the posterior mean.
Strong prior information corresponds to a small variance of θ which can be viewed
as large β (once again, β is the inverse scale parameter). In this case, more weight
is attached to α
β than x̄ in the posterior mean which thus favours the prior mean.
(c) Suppose now that the prior for θ is given hierarchically. Given λ, θ
is judged to follow an exponential distribution with mean λ1 and λ is
given the improper distribution f (λ) ∝ 1 for λ > 0. Show that
f (λ | x) ∝
where x̄ =
1
n
Pn
i=1
xi .
8
λ
(n + λ)nx̄+1
θ | λ ∼ Exp(λ) so f (θ | λ) = λ exp{−λθ}.
f (λ, θ | x) ∝ f (x | θ, λ)f (θ, λ)
= f (x | θ)f (θ | λ)f (λ)
∝ θnx̄ exp {−nθ} (λ exp {−λθ})
= λθnx̄ exp {−(n + λ)θ} .
Thus, integrating out θ,
∞
Z
λθnx̄ exp {−(n + λ)θ} dθ
f (λ | x) ∝
0
Z
=
∞
θnx̄ exp {−(n + λ)θ} dθ
λ
0
As the integrand is a kernel of a Gamma(nx̄ + 1, n + λ) distribution we thus have
f (λ | x) ∝
∝
9
λΓ(nx̄ + 1)
(n + λ)nx̄+1
λ
.
(n + λ)nx̄+1