Introduction to Bayesian Inference

Introduction to Bayesian Inference
Deepayan Sarkar
Based on notes by Mohan Delampady
IIAP Astrostatistics School, July, 2013
1 Statistical Inference
Outline
2 Frequentist Statistics
3 Conditioning on Data
4 The Bayesian Recipe
5 Inference for Binomial proportion
6 Inference With Normals/Gaussians
7 Bayesian Computations
8 Empirical Bayes Methods for High Dimensional Problems
9 Formal Methods for Model Selection
10 Bayesian Model Selection
11 Model Selection or Model Averaging?
12 References
1 Statistical Inference
Outline
2 Frequentist Statistics
3 Conditioning on Data
4 The Bayesian Recipe
5 Inference for Binomial proportion
6 Inference With Normals/Gaussians
7 Bayesian Computations
8 Empirical Bayes Methods for High Dimensional Problems
9 Formal Methods for Model Selection
10 Bayesian Model Selection
11 Model Selection or Model Averaging?
12 References
What is Statistical Inference?
It is an inverse problem as in ‘Toy Example’:
Example 1 (Toy). Suppose a million candidate stars are examined
for the presence of planetary systems associated with them. If 272
‘successes’ are noticed, how likely that the success rate is 1%,
0.1%, 0.01%, · · · for the entire universe?
Probability models for observed data involve direct probabilities:
Example 2. An astronomical study involved 100 galaxies of which
20 are Seyfert galaxies and the rest are starburst galaxies. To
illustrate generalization of certain conclusions, say 10 of these 100
galaxies are randomly drawn. How many galaxies drawn will be
Seyfert galaxies?
This is exactly like an artificial problem involving an urn having 100
marbles of which 20 are red and the rest blue. 10 marbles are
drawn at random with replacement (repeatedly, one by one, after
replacing the one previously drawn and mixing the marbles well).
How many marbles drawn will be red?
Data and Models
X = number of Seyfert galaxies (red marbles) in the sample (out
of sample size n = 10)
n k
P (X = k | θ) =
θ (1 − θ)(n−k ) ,
k
k = 0, 1, . . . n
(1)
In (1) θ is the proportion of Seyfert galaxies (red marbles) in the
urn, which is also the probability of drawing a Seyfert galaxy at
20
= 0.2 and n = 10. So,
each draw. In Example 2, θ = 100
P (X = 0 | θ = 0.2) = 0.810 ,
P (X = 1 | θ = 0.2) = 10 × 0.2 × 0.89 , and so on.
In practice, as in ‘Toy Example’, θ is unknown and inference about
it is the question to solve.
In the Seyfert/starburst galaxy example, if θ is not known and 3
galaxies out of 10 turned out to be Seyfert, one could ask:
how likely is θ = 0.1, or 0.2 or 0.3 or . . .?
Thus inference about θ is an inverse problem:
Causes (parameters) ←− Effects (observations)
How does this inversion work?
The direct probability model P (X = k | θ) provides a likelihood
function for the unknown parameter θ when data X = x is
observed:
l (θ | x ) = f (x | θ) (= P (X = x | θ) when X is a discrete random
variable) as function of θ for given x .
Interpretation: f (x | θ) says how likely x is under different θ or
the model P (. | θ), so if x is observed, then
P (X = x | θ) = f (x | θ) = l (θ | x ) should be able to indicate
what the likelihood of different θ values or P (. | θ) are for that x .
As a function of x for fixed θ P (X = x | θ) is a probability mass
function or density, but as a function of θ for fixed x , it has no
such meaning, but just a measure of likelihood.
After an experiment is conducted and seeing data x , the only
entity available to convey the information about θ obtained from
the experiment is l (θ | x ).
For the Urn Example we have l (θ | X = 3) ∝ θ3 (1 − θ)7 :
> curve(dbinom(3, prob = x, size = 10), from = 0, to = 1,
+
xlab = expression(theta), ylab = "", las = 1,
+
main = expression(L(theta) %prop% theta^3 * (1-theta)^7))
L(θ) ∝ θ3(1 − θ)7
0.25
0.20
0.15
0.10
0.05
0.00
0.0
0.2
0.4
0.6
θ
0.8
1.0
Maximum Likelihood Estimation (MLE): If l (θ | x ) measures
the likelihood of different θ (or the corresponding models P (. | θ)),
just find that θ = θ̂ which maximizes the likelihood.
For model (1)
θ̂ = θ̂(x ) = x /n = sample proportion of successes .
This is only an estimate. How good is it? What is the possible
error in estimation?
1 Statistical Inference
Outline
2 Frequentist Statistics
3 Conditioning on Data
4 The Bayesian Recipe
5 Inference for Binomial proportion
6 Inference With Normals/Gaussians
7 Bayesian Computations
8 Empirical Bayes Methods for High Dimensional Problems
9 Formal Methods for Model Selection
10 Bayesian Model Selection
11 Model Selection or Model Averaging?
12 References
Frequentist Approach
Consider repeating this experiment again and again.
One can look at all possible sample data
X ∼ Bin(n, θ) → θ̂ = X /n.
Utilize long-run average behaviour of the MLE. i.e. treat θ̂ as
a random quantity (function of X .
E (θ̂) = E (X /n) = θ
Var (θ̂) = Var (X /n) = θ(1 − θ)/n
q
Gives “standard error” θ̂(1 − θ̂)/n.
For large n, can use Law of Large Numbers and the Central
Limit Theorem
Confidence Statements
Specifically, for large n, approximately
θ̂ − θ
p
∼ N (0, 1),
θ(1 − θ)/n
or
θ̂ − θ
q
∼ N (0, 1).
θ̂(1 − θ̂)/n
(2)
From (2), an approximate 95% confidence interval for θ (when n is
large) is
q
θ̂ ± 2 θ̂(1 − θ̂)/n.
What Does This Mean?
Simply, if we sample again and again, in about 19 cases out of 20
this random interval
q
q
θ̂(X ) − 2 θ̂(X )(1 − θ̂(X ))/n, θ̂(X ) + 2 θ̂(X )(1 − θ̂(X ))/n
will contain the true unknown value of θ.
Fine, but what can we say about the one interval that we can
construct for the given sample or data x ?
What Does This Mean?
Simply, if we sample again and again, in about 19 cases out of 20
this random interval
q
q
θ̂(X ) − 2 θ̂(X )(1 − θ̂(X ))/n, θ̂(X ) + 2 θ̂(X )(1 − θ̂(X ))/n
will contain the true unknown value of θ.
Fine, but what can we say about the one interval that we can
construct for the given sample or data x ?
Nothing; either θ is inside
p
p
(0.3 − 2 0.3 × 0.7/10, 0.3 + 2 0.3 × 0.7/10)
or it is outside.
If θ is treated as fixed unknown constant, conditioning on the given
data X = x is meaningless.
1 Statistical Inference
Outline
2 Frequentist Statistics
3 Conditioning on Data
4 The Bayesian Recipe
5 Inference for Binomial proportion
6 Inference With Normals/Gaussians
7 Bayesian Computations
8 Empirical Bayes Methods for High Dimensional Problems
9 Formal Methods for Model Selection
10 Bayesian Model Selection
11 Model Selection or Model Averaging?
12 References
Conditioning on Data
What other approach is possible, then?
How does one condition on data?
How does one talk about probability of a model or a
hypothesis?
Example 3.(not from physics but medicine) Consider a blood test
for a certain disease; result is positive (x = 1) or negative (x = 0).
Suppose θ1 denotes disease is present, θ2 disease not present.
Test is not confirmatory. Instead the probability distribution of X
for different θ is:
θ1
θ2
x =0
0.2
0.7
x =1
0.8
0.3
What does it say?
Test is +ve 80% of time if ‘disease present’
Test is −ve 70% of time if ‘disease not present’
If for a particular patient the test result comes out to be ‘positive’,
what should the doctor conclude?
What is the Question?
What is to be answered is ‘what are the chances that the disease is
present given that the test is positive?’ i.e., P (θ = θ1 |X = 1).
What we have is P (X = 1|θ = θ1 ) and P (X = 1|θ = θ2 ).
We have the ‘wrong’ conditional probabilities. They need to be
‘reversed’. But how?
1 Statistical Inference
Outline
2 Frequentist Statistics
3 Conditioning on Data
4 The Bayesian Recipe
5 Inference for Binomial proportion
6 Inference With Normals/Gaussians
7 Bayesian Computations
8 Empirical Bayes Methods for High Dimensional Problems
9 Formal Methods for Model Selection
10 Bayesian Model Selection
11 Model Selection or Model Averaging?
12 References
The Bayesian Recipe
Recall Bayes Theorem: If A and B are two events,
P (A|B ) =
P (A and B )
P (B )
assuming P (B ) > 0. Therefore, P (A and B ) = P (A|B )P (B ),
and by symmetry P (A and B ) = P (B |A)P (A). Consequently, if
P (B |A) is given and P (A|B ) is desired, note
P (A|B ) =
P (A and B )
P (B |A)P (A)
=
.
P (B )
P (B )
But how can we get P (B )?
The Bayesian Recipe
Recall Bayes Theorem: If A and B are two events,
P (A|B ) =
P (A and B )
P (B )
assuming P (B ) > 0. Therefore, P (A and B ) = P (A|B )P (B ),
and by symmetry P (A and B ) = P (B |A)P (A). Consequently, if
P (B |A) is given and P (A|B ) is desired, note
P (A|B ) =
P (A and B )
P (B |A)P (A)
=
.
P (B )
P (B )
But how can we get P (B )? Rule of total probability says,
P (B ) = P (B and Ω) = P (B and A) + P (B and Ac )
= P (B |A)P (A) + P (B |Ac )(1 − P (A)), so
P (A|B ) =
P (B |A)P (A)
P (B |A)P (A) + P (B |Ac )(1 − P (A))
(3)
Bayes Theorem allows one to invert a certain conditional
probability to get a certain other conditional probability. How does
this help us?
In our example we want P (θ = θ1 |X = 1). From (3),
P (θ = θ1 | X = 1)
P (X = 1 | θ = θ1 )P (θ = θ1 )
. (4)
=
P (X = 1 | θ1 )P (θ = θ1 ) + P (X = 1 | θ2 )P (θ = θ2 )
So, all we need is P (θ = θ1 ), which is simply the probability that a
randomly chosen person has this disease, or just the ‘prevalence’ of
this disease in the concerned population.
The doctor most likely has this information. But this is not part of
the experimental data.
This is pre-experimental information or prior information. If we
have this, and are willing to incorporate it in the analysis, we get
the post-experimental information or posterior information in the
form of P (θ|X = x ).
In our example, if we take P (θ = θ1 ) = 0.05 or 5%, we get
P (θ = θ1 | X = 1) =
0.8 × 0.05
0.04
=
= 0.123
0.8 × 0.05 + 0.3 × 0.95
0.325
which is only 12.3% and P (θ = θ2 | X = 1) = 0.877 or 87.7%.
In our example, if we take P (θ = θ1 ) = 0.05 or 5%, we get
P (θ = θ1 | X = 1) =
0.8 × 0.05
0.04
=
= 0.123
0.8 × 0.05 + 0.3 × 0.95
0.325
which is only 12.3% and P (θ = θ2 | X = 1) = 0.877 or 87.7%.
Formula (4) which shows how to ‘invert’ the given conditional
probabilities, P (X = x | θ) into the conditional probabilities of
interest, P (θ | X = x ) is an instance of the Bayes Theorem, and
hence the Theory of Inverse Probability (usage at the time of
Bayes and Laplace, late eighteenth century and even by Jeffreys),
is known these days as Bayesian inference.
Ingredients of Bayesian inference:
likelihood function, l (θ|x ); θ can be a parameter
vector
prior probability, π(θ)
Combining the two, one gets the posterior probability density or
mass function

π(θ)l (θ|x )


 P
j π(θj )l (θj |x )
π(θ | x ) =
π(θ)l (θ|x )


 R
π(u)l (u|x ) du
if θ is discrete;
(5)
if θ is continuous.
A more relevant example
A more relevant example
http://xkcd.com/1132/
A more relevant example
Frequentist answer?
Null hypothesis: sun has not exploded.
p-value=?
Bayesian answer?
A more relevant example
1 Statistical Inference
Outline
2 Frequentist Statistics
3 Conditioning on Data
4 The Bayesian Recipe
5 Inference for Binomial proportion
6 Inference With Normals/Gaussians
7 Bayesian Computations
8 Empirical Bayes Methods for High Dimensional Problems
9 Formal Methods for Model Selection
10 Bayesian Model Selection
11 Model Selection or Model Averaging?
12 References
Inference for Binomial proportion
Example 2 contd. Suppose we have no special information
available on θ. Then assume θ is uniformly distributed on the
interval (0, 1). i.e., the prior density is π(θ) = 1, 0 < θ < 1.
This is a choice of non-informative or vague or reference prior.
Often, Bayesian inference from such a prior coincides with classical
inference.
The posterior density of θ given x is then
π(θ|x ) =
=
π(θ)l (θ|x )
π(u)l (u|x ) du
(n + 1)! x
θ (1 − θ)n−x ,
x !(n − x )!
R
0 < θ < 1.
As a function of θ, this is the same as the likelihood function
l (θ|x ) ∝ θx (1 − θ)n−x , and so maximizing the posterior probability
density will give the same estimate as the maximum likelihood
estimate!
Influence of the Prior
If we had some knowledge about θ which can be summarized in
the form of a Beta prior distribution with parameters α and γ, the
posterior will also be Beta with parameters x + α and n − x + γ.
Such priors which result in posteriors from the same ‘family’ are
called ‘natural conjugate priors’.
P(θ)
1.5
Beta(1.5, 1.5)
Beta(2, 2)
Beta(2.5, 2.5)
1.0
0.5
0.0
0.0
0.2
0.4
0.6
θ
0.8
1.0
Robustness?
If the answer depends on choice of prior, which prior should we
choose?
Robustness?
If the answer depends on choice of prior, which prior should we
choose?
Objective Bayesian Approach:
Invariant priors (Jeffreys)
Reference priors (Bernardo, Jeffreys)
Maximum entropy priors (Jaynes)
Subjective Bayesian Approach:
Expert opinion, previous studies, etc.
In Example 2, what π(θ|x ) says is that
The uncertainty in θ can now be described in terms of an
actual probability distribution (posterior distribution).
The MLE θ̂ = x /n happens to be the value where the
posterior is maximum (traditionally called the mode of the
distribution).
θ̂ can thus be interpreted as the most probable value of the
unknown parameter θ conditional on the sample data x .
Also called the ‘maximum a posteriori estimate (MAP)’,
or the ‘highest posterior density estimate (HPD)’,
or simply ‘posterior mode’.
Of course we do not need to mimic the MLE.
In fact the more common Bayes estimate is the posterior mean
(which minimizes the posterior dispersion):
E [(θ − θ̂B )2 |x ] = min E [(θ − a)2 |x ],
a
when θ̂B = E (θ|x ).
If we choose θ̂B as the estimate of θ, we get a natural measure of
variability of this estimate in the form of the posterior variance:
E [(θ − E (θ|x ))2 |x ].
Therefore the posterior standard deviation is
pa natural measure of
estimation error. i.e., our estimate is θ̂B ± E [(θ − E (θ|x ))2 |x ].
In fact, we can say much more. For any interval around θ̂ we can
compute the (posterior) probability of it containing the true
parameter θ. In other words, a statement such as
P (θ̂B − k1 ≤ θ ≤ θ̂B + k2 |x ) = 0.95
is perfectly meaningful.
All these inferences are conditional on the given data.
In Example 2, if the prior is a Beta distribution with parameters α
and γ, then θ|x will have a Beta(x + α, n − x + γ) distribution, so
the Bayes estimate of θ will be
θ̂B
=
n
x
α+γ
α
(x + α)
=
+
.
(n + α + γ)
n +α+γn
n +α+γα+γ
This is a convex combination of sample mean and prior mean, with
the weights depending upon the sample size and the strength of
the prior information as measured by the values of α and γ.
Bayesian inference relies on the conditional probability language to
revise one’s knowledge.
In the above example, prior to the collection of sample data one
had some (vague, perhaps) information on θ.
Then came the sample data.
Combining the model density of this data with the prior density
one gets the posterior density, the conditional density of θ given
the data.
From now on until further data is available, this posterior
distribution of θ is the only relevant information as far as θ is
concerned.
1 Statistical Inference
Outline
2 Frequentist Statistics
3 Conditioning on Data
4 The Bayesian Recipe
5 Inference for Binomial proportion
6 Inference With Normals/Gaussians
7 Bayesian Computations
8 Empirical Bayes Methods for High Dimensional Problems
9 Formal Methods for Model Selection
10 Bayesian Model Selection
11 Model Selection or Model Averaging?
12 References
Inference With Normals/Gaussians
Gaussian PDF
f (x |µ, σ 2 ) = √
1
2πσ 2
e−
(x −µ)2
2σ 2
over [−∞, ∞]
Common abbreviated notation: X ∼ N (µ, σ 2 )
Parameters
Z
x f (x |µ, σ 2 ) dx
Z
(x − µ)2 f (x |µ, σ 2 ) dx
µ = E (X ) ≡
2
σ = E (X − µ)
2
≡
(6)
Inference About a Normal Mean
Example 4. Fit a normal/Gaussian model to the ‘globular cluster
luminosity functions’ data. The set-up is as follows.
Our data consist of n measurements, Xi = µ + i .
Suppose the noise contributions are independent, and
i ∼ N (0, σ 2 ). Denoting the random sample (x1 , . . . , xn ) by x ,
Y
f (x |µ, σ 2 ) =
f (xi |µ, σ 2 )
i
=
Y
i
√
1
2πσ 2
1
2
e − 2σ2 (xi −µ)
1
Pn
1
Pn
= (2πσ 2 )−n/2 e − 2σ2
= (2πσ 2 )−n/2 e − 2σ2 [
2
i=1 (xi −µ)
2
2
i=1 (xi −x̄ ) +n(x̄ −µ)
].
P
Note (X̄ , s 2 = ni=1 (Xi − X̄ )2 /(n − 1)) is sufficient for the
parameters (µ, σ 2 ). This is a very substantial data compression.
Inference About a Normal Mean, σ 2 known
Not useful, but easy to understand.
n
2
l (µ|x ) ∝ f (x |µ, σ 2 ) ∝ e − 2σ2 (µ−x̄ ) ,
so that X̄ is sufficient.
Also,
X̄ |µ ∼ N (µ, σ 2 /n).
If an informative prior, µ ∼ N (µ0 , τ 2 ) is chosen for µ,
π(µ|x ) ∝ l (µ|x )π(µ)
∝ e
∝ e
− 21
−
(µ−µ0 )2
n(µ−x̄ )2
+
σ2
τ2
τ 2 +σ 2 /n
2τ 2 σ 2 /n
2
µ−
µ
τ 2 σ 2 /n
( 0 + n 2x̄ )
τ 2 +σ 2 /n τ 2
σ
i.e., µ|x ∼ N (µ̂, δ 2 ):
µ̂ =
=
τ 2 σ 2 /n µ0 n x̄ + 2
τ 2 + σ 2 /n τ 2
σ
2
τ
σ 2 /n
x̄
+
µ0 .
τ 2 + σ 2 /n
τ 2 + σ 2 /n
.
µ̂ is the Bayes estimate of µ, which is just a weighted average of
sample mean x̄ and prior mean µ0 .
δ 2 is the posterior variance of µ and
δ2 =
σ2
τ2
τ 2 σ 2 /n
=
.
τ 2 + σ 2 /n
n τ 2 + σ 2 /n
Therefore
µ̂ ± δ is our estimate for µ, and
µ̂ ± 2δ is a 95% HPD (Bayesian) credible interval for µ.
µ̂ is the Bayes estimate of µ, which is just a weighted average of
sample mean x̄ and prior mean µ0 .
δ 2 is the posterior variance of µ and
δ2 =
σ2
τ2
τ 2 σ 2 /n
=
.
τ 2 + σ 2 /n
n τ 2 + σ 2 /n
Therefore
µ̂ ± δ is our estimate for µ, and
µ̂ ± 2δ is a 95% HPD (Bayesian) credible interval for µ.
What happens as τ 2 → ∞, or as the prior becomes more and more
flat?
σ
µ̂ → x̄ , δ → √
n
So Jeffreys’ prior π(µ) = C reproduces frequentist inference.
Inference About a Normal Mean, σ 2 unknown
Our observations X1 , . . . Xn is a random sample from a Gaussian
population with both mean µ and variance σ 2 unknown.
We are only interested in µ.
How do we get rid of the nuisance parameter σ 2 ?
Bayesian inference uses posterior distribution which is a probability
distribution, so σ 2 should be integrated out from the joint
posterior distribution of µ and σ 2 to get the marginal posterior
distribution of µ.
1
Pn
l (µ, σ 2 |x ) = (2πσ 2 )−n/2 e − 2σ2 [
2
2
i=1 (xi −x̄ ) +n(µ−x̄ )
Start with π(µ, σ 2 ) and get
π(µ, σ 2 |x ) ∝ π(µ, σ 2 )l (µ, σ 2 |x )
and then get
Z
π(µ|x ) =
0
∞
π(µ, σ 2 |x ) d σ 2 .
].
Use Jeffreys’ prior π(µ, σ 2 ) ∝ 1/σ 2 : Flat prior for µ which is a
location or translation parameter, and an independent flat prior for
log(σ) which is again a location parameter, being the log of a scale
parameter.
1
π(µ, σ 2 |x ) ∝ 2 l (µ, σ 2 |x )
σ
Use Jeffreys’ prior π(µ, σ 2 ) ∝ 1/σ 2 : Flat prior for µ which is a
location or translation parameter, and an independent flat prior for
log(σ) which is again a location parameter, being the log of a scale
parameter.
1
π(µ, σ 2 |x ) ∝ 2 l (µ, σ 2 |x )
σ
Z
π(µ|x ) ∝
∞
1
Pn
(σ 2 )−(n+1)/2 e − 2σ2 [
i=1 (xi −x̄ )
0
−n/2
(n − 1)s 2 + n(µ − x̄ )2
−n/2
1 n(µ − x̄ )2
∝ 1+
n −1
s2
∝ density of Students tn−1 .
∝
2 +n(µ−x̄ )2
] d σ2
√
n(µ − x̄ )
| data ∼ tn−1
s
s
s
P (x̄ − tn−1 (0.975) √ ≤ µ ≤ x̄ + tn−1 (0.975) √ | data) = 95%
n
n
i.e., the Jeffreys’ translation-scale invariant prior reproduces
frequentist inference.
What if there are some constraints on µ such as −A ≤ µ ≤ B , for
example, µ > 0? We will get a truncated tn−1 instead, but the
procedure will go through with minimal change.
Example 4 contd. (GCL Data) n = 360, x̄ = 14.46, s = 1.19.
√
360(µ − 14.46)
| data ∼ t359
1.19
µ| data ∼ N (14.46, 0.0632 ) approximately.
Estimate for mean GCL is 14.46 ± 0.063 and 95% HPD credible
interval is (14.33, 14.59).
Comparing two Normal Means
Example 5. Check whether the mean distance indicators in the
two populations of LMC datasets are different.
http://www.iiap.res.in/astrostat/School10/datasets/
LMC_distance.html
> x <- c(18.70, 18.55, 18.55, 18.575, 18.4, 18.42, 18.45,
+
18.59, 18.471, 18.54, 18.54, 18.64, 18.58)
> y <- c(18.45, 18.45, 18.44, 18.30, 18.38, 18.50, 18.55,
+
18.52, 18.40, 18.45, 18.55, 18.69)
Model as follows:
X1 , . . . Xn1 is a random sample from N (µ1 , σ12 ).
Y1 , . . . Yn2 is a random sample from N (µ2 , σ22 ) (independent).
Unknown parameters: (µ1 , µ2 , σ12 , σ22 )
Quantity of interest: η = µ1 − µ2
Nuisance parameters: σ12 and σ22
Case
1. σ12 = σ22 . Then
for (µ1 , µ2 , σ 2 ) is
Psufficient statistic P
n1
n2
2
2
X̄ , Ȳ , s 2 = n1 +n1 2 −2
i=1 (Xi − X̄ ) +
j =1 (Yj − Ȳ )
It can be shown that
X̄ |µ1 , µ2 , σ 2 ∼ N (µ1 , σ 2 /n1 ),
Ȳ |µ1 , µ2 , σ 2 ∼ N (µ2 , σ 2 /n2 ),
(n1 + n2 − 2)s 2 |µ1 , µ2 , σ 2 ∼ σ 2 χ2n1 +n2 −2 ,
and these three are independently distributed.
X̄ − Ȳ |µ1 , µ2 , σ 2 ∼ N (η, σ 2 ( n11 + n12 )), η = µ1 − µ2
Use Jeffreys’ location-scale invariant prior π(µ1 , µ2 , σ 2 ) ∝ 1/σ 2
η|σ 2 , x , y ∼ N (x̄ − ȳ, σ 2 (
1
1
+ )), and
n1 n2
π(η, σ 2 |x , y ) ∝ π(η|σ 2 , x , y )π(σ 2 |s 2 ),
Integrate out σ 2 from (7) as in the previous example to get
η − (x̄ − ȳ)
q
| x , y ∼ tn1 +n2 −2 .
s n11 + n12
95% HPD credible interval for η = µ1 − µ2 is
r
1
1
x̄ − ȳ ± tn1 +n2 −2 (0.975)s
+ ,
n1 n2
same as frequentist t-interval.
(7)
Example 5 contd. We have x̄ = 18.539, ȳ = 18.473, n1 = 13,
2
nq
2 = 12 and s = 0.0085. η̂ = x̄ − ȳ = 0.066,
s n11 + n12 = 0.037, t23 (0.975) = 2.069.
95% HPD credible interval for η = µ1 − µ2 :
(0.066 − 2.069 × 0.037, 0.066 + 2.069 × 0.037) = (−0.011, 0.142).
Case 2. σ12 and σ22 are not known to be equal.
From the one-sample
normal example, note that
1 Pn1
2
(X̄ , sX = n1 −1 i=1 (Xi − X̄ )2 ) sufficient for (µ1 , σ12 ), and
P 2
(Yj − Ȳ )2 ) sufficient for (µ2 , σ22 ).
(Ȳ , sY2 = n21−1 nj =1
Making inference on η = µ1 − µ2 when σ12 and σ22 are not assumed
to be equal is called the Behrens-Fisher problem for which the
frequentist solution is not very straight forward, but the Bayes
solution is.
It is a well-known result that
X̄ | µ1 , σ12 ∼ N (µ1 , σ12 /n1 )
(n1 − 1)sX2 | µ1 , σ12 ∼ σ 2 χ2n1 −1
and are independently distributed. Similarly,
Ȳ | µ2 , σ22 ∼ N (µ2 , σ22 /n2 )
(n2 − 1)sY2 | µ2 , σ22 ∼ σ 2 χ2n2 −1
and are independently distributed.
X and Y samples are independent.
Use Jeffreys’ prior
π(µ1 , µ2 , σ12 , σ22 ) ∝ 1/σ12 × 1/σ22
Calculations similar to those in one-sample case give:
√
n1 (µ1 − x̄ )
| data ∼ tn1 −1 ,
sX
√
n2 (µ2 − ȳ)
| data ∼ tn2 −1 ,
sY
and these two are independent.
(8)
Posterior distribution of η = µ1 − µ2 given the data is non-standard
(difference of two independent t variables) but not difficult to get.
Use Monte-Carlo Sampling: Simply generate (µ1 , µ2 ) repeatedly
from (8) and construct a histogram for η = µ1 − µ2 .
Example 5 (LMC) contd.
>
>
>
>
>
>
>
>
xbar <- mean(x)
ybar <- mean(y)
n1 <- length(x)
n2 <- length(y)
sx <- sqrt(var(x))
sy <- sqrt(var(y))
mu1.sim <- xbar + sx * rt(100000, df = n1 - 1) / sqrt(n1)
mu2.sim <- ybar + sy * rt(100000, df = n2 - 1) / sqrt(n2)
> plot(density(mu1.sim - mu2.sim))
> s <- sqrt((1/n1 + 1/n2) * ((n1-1)*sx^2 + (n2-1)*sy^2) / (n1+n2> curve(dt((x - (xbar-ybar)) / s, df = n1 + n2 - 2) / s,
+
add = TRUE, col = "red")
6
4
2
0
Density
8
10
density.default(x = mu1.sim − mu2.sim)
−0.2
−0.1
0.0
0.1
N = 100000 Bandwidth = 0.003582
0.2
0.3
Posterior mean of η = µ1 − µ2 is
η̂ = E (µ1 − µ2 | data) = 0.0654.
(9)
95% HPD credible interval for η = µ1 − µ2 is
=
(−0.011, 0.142)
(−0.014, 0.147)
equal variance;
unequal variance.
(10)
1 Statistical Inference
Outline
2 Frequentist Statistics
3 Conditioning on Data
4 The Bayesian Recipe
5 Inference for Binomial proportion
6 Inference With Normals/Gaussians
7 Bayesian Computations
8 Empirical Bayes Methods for High Dimensional Problems
9 Formal Methods for Model Selection
10 Bayesian Model Selection
11 Model Selection or Model Averaging?
12 References
Bayesian Computations
Bayesian analysis requires computation of expectations and
quantiles of probability distributions (posterior distributions).
Most often posterior distributions will not be standard distributions.
Then posterior quantities of inferential interest cannot be
computed in closed form. Special techniques are needed.
Example M1. Suppose X1 , X2 , . . . , Xk are observed number of
certain type of stars in k similar regions. Model them as
independent Poisson counts: Xi ∼ Poisson(θi ). θi are a priori
considered related. νi = log(θi ) is the i th element of ν and
suppose
ν ∼ Nk µ1, τ 2 (1 − ρ)Ik + ρ110 ,
where 1 is the k -vector with all elements being 1, and µ, τ 2 and ρ
are known constants. Then
! k
k
X
Y
f (x | ν) = exp −
{e νi − νi xi } /
xi !.
i=1
i=1
1
0
0 −1
π(ν) ∝ exp − 2 (ν − µ1) (1 − ρ)Ik + ρ11
(ν − µ1)
2τ
π(ν n| x ) ∝
P
exp − ki=1 {e νi − νi xi } −
(ν−µ1)0 ((1−ρ)Ik +ρ110 )−1 (ν−µ1)
2τ 2
o
.
To obtain the posterior mean of θj , compute
R
k exp(νj )g(ν | x ) d ν
π
π
,
E (θj | x ) = E (exp(νj ) | x ) = R R
Rk g(ν | x ) d ν
where
| x) =
n g(ν
Pk
exp − i=1 {e νi − νi xi } −
(ν−µ1)0 ((1−ρ)Ik +ρ110 )−1 (ν−µ1)
2τ 2
o
.
This is a ratio of two k -dimensional integrals, and as k grows, the
integrals become less and less easy to work with. Numerical
integration techniques fail to be an efficient technique in this case.
This problem, known as the curse of dimensionality, is due to the
fact that the size of the part of the space that is not relevant for
the computation of the integral grows very fast with the
dimension. Consequently, the error in approximation associated
with this numerical method increases as the power of the
dimension k , making the technique inefficient.
The recent popularity of Bayesian approach to statistical
applications is mainly due to advances in statistical computing.
These include the E-M algorithm and the Markov chain Monte
Carlo (MCMC) sampling techniques.
Monte Carlo Sampling
Consider an expectation that is not available in closed form. To
estimate a population mean, gather a large sample from this
population and consider the corresponding sample mean. The Law
of Large Numbers guarantees that the estimate will be good
provided the sample is large enough. Specifically, let f be a
probability density function (or a mass function) and suppose the
quantity of interest is a finite expectation of the form
Z
Ef h(X ) =
h(x )f (x ) d x
(11)
X
(or the corresponding sum in the discrete case). If i.i.d.
observations X 1 , X 2 , . . . can be generated from the density f , then
h̄m =
m
1 X
h(X i )
m
(12)
i=1
converges in probability to Ef h(X ). This justifies using h̄m as an
approximation for Ef h(X ) for large m.
To provide a measure of accuracy or the extent of error in the
approximation, compute the standard error. If Varf h(X ) is finite,
then Varf (h̄m ) = Varf h(X )/m. Further,
2
Varf h(X ) = Ef h 2 (X ) − Ef h(X ) can be estimated by
2
sm
m
2
1 X
=
h(X i ) − h̄m ,
m
i=1
and hence the standard error of h̄m can be estimated by
m
2 1/2
1
1 X
√ sm =
h(X i ) − h̄m
.
m
m
i=1
Confidence intervals for Ef h(X ): Using CLT
√
m h̄m − Ef h(X )
−→ N (0, 1), so
m→∞
sm
√
√ h̄m − zα/2 sm / m, h̄m + zα/2 sm / m can be used as an
approximate 100(1 − α)% confidence interval for Ef h(X ), with
zα/2 denoting the 100(1 − α/2)% quantile of standard normal.
What Does This Say?
If we want to approximate the posterior mean, try to generate i.i.d.
observations from the posterior distribution and consider the mean
of this sample. This is rarely useful because most often the
posterior distribution will be a non-standard distribution which may
not easily allow sampling from it. What are some other
possibilities?
Example M2. Suppose X is N (θ, σ 2 ) with known σ 2 and a
Cauchy(µ, τ ) prior on θ is considered appropriate. Then
π(θ | x ) ∝ exp −(θ − x )2 /(2σ 2 )
τ 2 + (θ − µ)2
−1
,
and hence the posterior mean is
−1
R∞
(θ−x )2
τ 2 + (θ − µ)2
dθ
θ
exp
−
2
−∞
2σ
E π (θ | x ) = R
∞
(θ−x )2
(τ 2 + (θ − µ)2 )−1 d θ
−∞ exp − 2σ 2
−1
R ∞ 1 θ−x 2
τ + (θ − µ)2
dθ
σ
−∞ θ σ φ
= R ∞ 1 θ−x ,
(τ 2 + (θ − µ)2 )−1 d θ
σ
−∞ σ φ
where φ denotes the density of standard normal.
E π (θ | x ) is the ratio of expectation of h(θ) = θ/(τ 2 + (θ − µ)2 )
to that of h(θ) = 1/(τ 2 + (θ − µ)2 ), both expectations being with
respect to the N (x , σ 2 ) distribution. Therefore, we simply sample
θ1 , θ2 , . . . from N (x , σ 2 ) and use
2 + (θ − µ)2 −1
θ
τ
i
i
i=1
E π\
(θ | x ) = P
m
2
2 −1
i=1 (τ + (θi − µ) )
Pm
as our Monte Carlo estimate of E π (θ | x ). Note that (11) and
(12) are applied separately to both the numerator and
denominator, but using the same sample of θ’s. It is unwise to
assume that the problem has been completely solved. The sample
of θ’s generated from N (x , σ 2 ) will tend to concentrate around x ,
whereas to satisfactorily account for the contribution of the
Cauchy prior to the posterior mean, a significant portion of the θ’s
should come from the tails of the posterior distribution.
Why not express the posterior mean in the form
R∞
(θ−x )2
θ
exp
−
π(θ) d θ
−∞
2σ 2
,
E π (θ | x ) = R
∞
(θ−x )2
π(θ)
d
θ
exp
−
2
−∞
2σ
and then sample θ’s from Cauchy(µ, τ ) and use the approximation
Pm
(θi −x )2
θ
exp
−
i=1 i
2σ 2
?
E π\
(θ | x ) = P
(θi −x )2
m
exp
−
i=1
2σ 2
However, this is also not satisfactory because the tails of the
posterior distribution are not as heavy as those of the Cauchy prior,
and there will be excess sampling from the tails relative to the
center. So the convergence of the approximation will be slower
resulting in a larger error in approximation (for a fixed m).
Ideally, therefore, sampling should be from the posterior
distribution itself. With this view in mind, a variation of the above
theme, called Monte Carlo importance sampling has been
developed.
Consider (11) again. Suppose that it is difficult or expensive to
sample directly from f , but there exists a probability density u that
is very close to f from which it is easy to sample. Then we can
rewrite (11) as
Z
Z
f (x )
u(x ) d x
Ef h(X ) =
h(x )f (x ) d x =
h(x )
u(x )
X
X
Z
=
{h(x )w (x )} u(x ) d x = Eu {h(X )w (X )} ,
X
where w (x ) = f (x )/u(x ). Now apply (12) with f replaced by u
and h replaced by hw . In other words, generate i.i.d. observations
X 1 , X 2 , . . . from the density u and compute
hw m =
m
1 X
h(X i )w (X i ).
m
i=1
The sampling density u is called the importance function.
Markov Chain Monte Carlo Methods
A severe drawback of the standard Monte Carlo sampling/
importance sampling: complete determination of the functional
form of the posterior density is needed for implementation.
Situations where posterior distributions are incompletely specified
or are specified indirectly cannot be handled: joint posterior
distribution of the vector of parameters is specified in terms of
several conditional and marginal distributions, but not directly.
This covers a large range of Bayesian analysis because a lot of
Bayesian modeling is hierarchical so that the joint posterior is
difficult to calculate but the conditional posteriors given
parameters at different levels of hierarchy are easier to write down
(and hence sample from).
Markov Chains in MCMC
A sequence of random variables {Xn }n≥0 is a Markov chain if for
any n, given the current value, Xn , the past {Xj , j ≤ n − 1} and
the future {Xj : j ≥ n + 1} are independent. In other words,
P (A ∩ B | Xn ) = P (A | Xn )P (B | Xn ),
(13)
where A and B are events defined respectively in terms of the past
and the future.
Important subclass: Markov chains with time homogeneous or
stationary transition probabilities: the probability distribution of
Xn+1 given Xn = x , and the past, Xj : j ≤ n − 1 depends only on
x and does not depend on the values of Xj : j ≤ n − 1 or n.
If the set S of values {Xn } can take, known as the state space, is
countable, this reduces to specifying the transition probability
matrix P ≡ ((pij )) where for any two values i , j in S , pij is the
probability that Xn+1 = j given Xn = i , i.e., of moving from state
i to state j in one time unit.
For state space S that is not countable, specify a transition kernel
or transition function P (x , ·) where P (x , A) is the probability of
moving from x into A in one step, i.e., P (Xn+1 ∈ A | Xn = x ).
Given the transition probability and the probability distribution of
the initial value X0 , one can construct the joint probability
distribution of {Xj : 0 ≤ j ≤ n} for any finite n. i.e.,
P (X0 = i0 , X1 = i1 , . . . , Xn−1 = in−1 , Xn = in )
= P (Xn = in | X0 = i0 , . . . , Xn−1 = in−1 )
×P (X0 = i0 , X1 = i1 , . . . Xn−1 = in−1 )
= pin−1 in P (X0 = i0 , . . . , Xn−1 = in−1 )
= P (X0 = i0 )pi0 i1 pi1 i2 . . . pin−1 in .
A probability distribution π is called stationary or invariant for a
transition probability P or the associated Markov chain {Xn } if it
is the case that when the probability distribution of X0 is π then
the same is true for Xn for all n ≥ 1. Thus in the countable state
space case a probability distribution π = {πi : i ∈ S } is stationary
for a transition probability matrix P if for each j in S ,
X
P (X1 = j ) =
P (X1 = j | X0 = i )P (X0 = i )
i
=
X
πi pij = P (X0 = j ) = πj .
(14)
i
In vector notation it says π = (π1 , π2 , . . .) is a left eigenvector of
the matrix P with eigenvalue 1 and
π = πP .
(15)
Similarly, if S is a continuum, a probability distribution π with
density p(x ) is stationary for the transition kernel P (·, ·) if
Z
Z
π(A) =
p(x ) dx =
P (x , A)p(x ) dx
A
S
for all A ⊂ S .
A Markov chain {Xn } with a countable state space S and
transition probability matrix P ≡ ((pij )) is said to be irreducible if
for any two states i and j the probability of the Markov chain
visiting j starting from i is positive, i.e., for some
(n)
n ≥ 1, pij ≡ P (Xn = j | X0 = i ) > 0.
A similar notion of irreducibility, known as Harris or Doeblin
irreducibility exists for the general state space case also.
Theorem (Law of Large Lumbers for Markov Chains).
{Xn }n≥0 is a Markov chain with a countable state space S and a
transition probability matrix P . Suppose it is irreducible and has a
stationary probability distribution π ≡ (πi : i ∈ S ) as defined in
(14). Then, for any bounded function h : S → R and for any
initial distribution of X0
n−1
X
1X
h(Xi ) →
h(j )πj
n
i=0
(16)
j
in probability as n → ∞.
A similar law of large numbers (LLN) holds when the state space S
is not countable. The limit value in (16) will be the integral of h
with respect to the stationary distribution π. A sufficient condition
for the validity of this LLN is that the Markov chain {Xn } be
Harris irreducible and have a stationary distribution π.
How is this Useful?
A probability distribution π on a set S is given. Want to
Pcompute
the “integral of h with respect to π”, which reduces to j h(j )πj
in the countable case.
Look for an irreducible Markov chain {Xn } with state space S and
stationary distribution π. Starting from some initial value X0 , run
the Markov chain {Xj } for a period of time, say 0, 1, 2, . . . n − 1
and consider as an estimate
µn =
n−1
1X
h(Xj ).
n 0
By the LLN (16), µn will be close to
P
j
(17)
h(j )πj for large n.
This technique is called Markov chain Monte Carlo (MCMC).
P
To approximate π(A) ≡ j ∈A πj for some A ⊂ S simply consider
πn (A) ≡
n−1
1X
IA (Xj ) → π(A),
n 0
where IA (Xj ) = 1 if Xj ∈ A and 0 otherwise.
An irreducible Markov chain {Xn } with a countable state space S
is called aperiodic if for some i ∈ S the greatest common divisor,
(n)
g.c.d. {n : pii > 0} = 1. Then, in addition to the LLN (16), the
following result on the convergence of P (Xn = j ) holds.
X
| P (Xn = j ) − πj | → 0
(18)
j
as n → ∞, for any initial distribution of X0 . In other words, for
large n the probability distribution of Xn will be close to π. There
exists a result similar to (18) for the general state space case also.
This suggests that instead of doing one run of length n, one could
do N independent runs each of length m so that n = Nm and
then from the i th run use only the m th observation, say, Xm,i and
consider the estimate
µ̃N ,m
N
1 X
h(Xm,i ).
≡
N
i=1
(19)
Metropolis-Hastings Algorithm
Very general MCMC method with wide applications. Idea is not to
directly simulate from the given target density (which may be
computationally difficult), but to simulate an easy Markov chain
that has this target density as the stationary distribution.
Let π be the target probability distribution on S , a finite or
countable set. Let Q ≡ ((qij )) be a transition probability matrix
such that for each i , it is computationally easy to generate a
sample from the distribution {qij : j ∈ S }. Generate a Markov
chain {Xn } as follows. If Xn = i , first sample from the distribution
{qij : j ∈ S } and denote that observation Yn . Then, choose Xn+1
from the two values Xn and Yn according to
P (Xn+1 = Yn | Xn , Yn ) = ρ(Xn , Yn ) = 1−P (Xn+1 = Xn | Xn , Yn ),
where the “acceptance probability” ρ(·, ·) is given by
πj qji
, 1 for all (i , j ) such that πi qij > 0.
ρ(i , j ) = min
πi qij
{Xn } is a Markov chain with transition probability matrix
P = ((pij )) given by
(
qij ρijP
j 6= i ,
pij =
1−
pik ,
j = i.
(20)
k 6=i
Q is called the “proposal transition probability” and ρ the
“acceptance probability”. A significant feature of this transition
mechanism P is that P and π satisfy
πi pij = πj pji
for all i , j .
This implies that for any j
X
X
πi pij = πj
pji = πj ,
i
i
or, π is a stationary probability distribution for P .
(21)
(22)
Suppose S is irreducible with respect to Q and πi > 0 for all i in
S . It can then be shown that P is irreducible, and because it has a
stationary distribution π, LLN (16) is available. This algorithm is
thus a very flexible and useful one. The choice of Q is subject only
to the condition that S is irreducible with respect to Q. A
sufficient condition for the aperiodicity of P is that pii > 0 for
some i or equivalently
X
qij ρij < 1.
j 6=1
A sufficient condition for this is that there exists a pair (i , j ) such
that πi qij > 0 and πj qji < πi qij .
Recall that if P is aperiodic, then both the LLN (16) and (18) hold.
If S is not finite or countable but is a continuum and the target
distribution π(·) has a density p(·), then one proceeds as follows:
Let Q be a transition function such that for each x , Q(x , ·) has a
density q(x , y). Then proceed as in the discrete case but set the
“acceptance probability” ρ(x , y) to be
p(y)q(y, x )
,1
ρ(x , y) = min
p(x )q(x , y)
for all (x , y) such that p(x )q(x , y) > 0.
A particularly useful feature of the above algorithm is that it is
enough to know p(·) upto a multiplicative constant as the
“acceptance probability” ρ(·, ·) needs only the ratios p(y)/p(x ) or
πi /πj .
This assures us that in Bayesian applications it is not necessary to
have the normalizing constant of the posterior density available for
computation of the posterior quantities of interest.
Gibbs Sampling
Most of the new problems that Bayesians are asked to solve are
high-dimensional: e.g. micro-arrays, image processing. Bayesian
analysis of such problems involve target (posterior) distributions
that are high-dimensional multivariate distributions.
In image processing, typically one has N × N square grid of pixels
with N = 256 and each pixel has k ≥ 2 possible values. Each
configuration has (256)2 components and the state space S has
2
k (256) configurations. How does one simulate a random
configuration from a target distribution over such a large S ?
Gibbs sampler is a technique especially suitable for generating an
irreducible aperiodic Markov chain that has as its stationary
distribution a target distribution in a high-dimensional space
having some special structure.
The most interesting aspect of this technique: to run this Markov
chain, it suffices to generate observations from univariate
distributions.
The Gibbs sampler in the context of a bivariate probability
distribution can be described as follows. Let π be a target
probability distribution of a bivariate random vector (X , Y ). For
each x , let P (x , ·) be the conditional probability distribution of Y
given X = x . Similarly, let Q(y, ·) be the conditional probability
distribution of X given Y = y. Note that for each x , P (x , ·) is a
univariate distribution, and for each y, Q(y, ·) is also a univariate
distribution. Now generate a bivariate Markov chain
Zn = (Xn , Yn ) as follows:
Start with some X0 = x0 . Generate an observation Y0 from the
distribution P (x0 , ·). Then generate an observation X1 from
Q(Y0 , ·). Next generate an observation Y1 from P (X1 , ·) and so
on. At stage n if Zn = (Xn , Yn ) is known, then generate Xn+1
from Q(Yn , ·) and Yn+1 from P (Xn+1 , ·).
If π is a discrete distribution concentrated on
{(xi , yj ) : 1 ≤ i ≤ K , 1 ≤ j ≤ L} and if πij = π(xi , yj ) then
P (xi , yP
j ) = πij /πi· and
P Q(yj , xi ) = πij /π·j , where
πi· = j πij , π·j = i πij . Thus the transition probability matrix
R = ((r(ij ),(k `) )) for the {Zn } chain is given by
r(ij ),(k `) = Q(yj , xk )P (xk , y` )
πkj πk `
.
=
π·j πk ·
Verify that this chain is irreducible, aperiodic, and has π as its
stationary distribution. Thus LLN (16) and (18) hold in this case.
Thus for large n, Zn can be viewed as a sample
P from a distribution
that
is
close
to
π
and
one
can
approximate
i,j h(i , j )πij by
Pn
1=1 h(Xi , Yi )/n.
Illustration:
Consider
sampling from
X
0 1 ρ ∼ N2
,
. The conditional distribution of
Y
0
ρ 1
X given Y = y and that of Y given X = x are
X | Y = y ∼ N (ρy, 1−ρ2 ) and Y | X = x ∼ N (ρx , 1−ρ2 ). (23)
Using this property, Gibbs sampling proceeds as follows: Generate
(Xn , Yn ), n = 0, 1, 2, . . ., by starting from an arbitrary value x0 for
X0 , and repeat the following steps for i = 0, 1, . . . , n.
1
Given xi for X , draw a random deviate from N (ρxi , 1 − ρ2 )
and denote it by Yi .
2
Given yi for Y , draw a random deviate from N (ρyi , 1 − ρ2 )
and denote it by Xi+1 .
The theory of Gibbs sampling tells us that if n is large, then
(xn
, yn ) isa random draw from a distribution that is close to
0
1 ρ
N2
,
.
0
ρ 1
Multivariate extension: π is a probability distribution of a
k -dimensional random vector (X1 , X2 , . . . , Xk ). If
u = (u1 , u2 , . . . , uk ) is any k -vector, let
u −i = (u1 , u2 , . . . , ui−1 , ui+1 , . . . , uk ) be the k − 1 dimensional
vector resulting by dropping the i th component ui . Let πi (· | x −i )
denote the univariate conditional distribution of Xi given that
X −i ≡ (X1 , X2 , Xi−1 , Xi+1 , . . . , Xk ) = x −i . Starting with some
initial value for X 0 = (x01 , x02 , . . . , x0k ) generate
X 1 = (X11 , X12 , . . . , X1k ) sequentially by generating X11
according to the univariate distribution π1 (· | x 0−1 ) and then
generating X12 according to π2 (· | (X11 , x03 , x04 , . . . , x0k ) and so
on.
The most important feature to recognize here is that all the
univariate conditional distributions, Xi | X −i = x −i , known as full
conditionals should easily allow sampling from them. This is the
case in most hierarchical Bayes problems. Thus, the Gibbs sampler
is particularly well adapted for Bayesian computations with
hierarchical priors.
Rao-Blackwellization
The variance reduction idea of the famous Rao-Blackwell theorem
in the presence of auxiliary information can be used to provide
improved estimators when MCMC procedures are adopted.
Theorem (Rao-Blackwell) Let δ(X1 , X2 , . . . , Xn ) be an estimator
of θ with finite variance. Suppose that T is sufficient for θ, and let
δ ∗ (T ), defined by δ ∗ (t) = E (δ(X1 , X2 , . . . , Xn ) | T = t), be the
conditional expectation of δ(X1 , X2 , . . . , Xn ) given T = t. Then
E (δ ∗ (T ) − θ)2 ≤ E (δ(X1 , X2 , . . . , Xn ) − θ)2 .
The inequality is strict unless δ = δ ∗ , or equivalently, δ is already a
function of T .
By the property of iterated conditional expectation,
E (δ ∗ (T )) = E [E (δ(X1 , X2 , . . . , Xn ) | T )] = E (δ(X1 , X2 , . . . , Xn )).
Therefore, to compare the mean squared errors (MSE) of the two
estimators, compare their variances only. Now,
Var(δ(X1 , X2 , . . . , Xn )) = Var [E (δ | T )] + E [Var(δ | T )]
= Var(δ ∗ ) + E [Var(δ | T )] > Var(δ ∗ ),
unless Var(δ | T ) = 0, which is the case only if δ is a function of
T.
The Rao–Blackwell theorem involves two key steps: variance
reduction by conditioning and conditioning by a sufficient statistic.
The first step is based on the analysis of variance formula: For any
two random variables S and T , because
Var(S ) = Var(E (S | T )) + E (Var(S | T )),
one can reduce the variance of a random variable S by taking
conditional expectation given some auxiliary information T . This
can be exploited in MCMC.
(Xj , Yj ), j = 1, 2, . . . , N : a single run of the Gibbs sampler
algorithm with a target distribution of a bivariate random vector
(X , Y ). Let h(X ) be a function of the X component of (X , Y )
and let its mean value be µ. Goal is to estimate µ. A first estimate
is the sample mean of the h(Xj ), j = 1, 2, . . . , N . From the
MCMC theory, as N → ∞, this estimate will converge to µ in
probability. The computation of variance of this estimator is not
easy due to the (Markovian) dependence of the sequence
{Xj , j = 1, 2, . . . , N }. Suppose we make n independent runs of
Gibbs sampler and generate
(Xij , Yij ), j = 1, 2, . . . , N ; i = 1, 2, . . . , n. Suppose that N is
sufficiently large so that (XiN , YiN ) can be regarded as a sample
from the limiting target distribution of the Gibbs sampling scheme.
Thus (XiN , YiN ), i = 1, 2, . . . , n form a random sample from the
target distribution. Consider a second estimate of µ—the sample
mean of h(XiN ), i = 1, 2, . . . , n.
This estimator ignores part of the MCMC data but has the
advantage that the variables h(XiN ), i = 1, 2, . . . , n are
independent and hence the variance of their mean is of order n −1 .
Now applying the variance reduction idea of the Rao-Blackwell
theorem by using the auxiliary information YiN , i = 1, 2, . . . , n,
one can improve this estimator as follows:
Let k (y) = E (h(X ) | Y = y). Then for each i , k (YiN ) has a
smaller variance than h(XiN ) and hence the following third
estimator,
n
1X
k (YiN ),
n
i=1
has a smaller variance than the second one. A crucial fact to keep
in mind here is that the exact functional form of k (y) be available
for implementing this improvement.
(Example M2 continued.) X | θ ∼ N (θ, σ 2 ) with known σ 2 and
θ ∼ Cauchy(µ, τ ). Simulate θ from the posterior distribution, but
sampling directly is difficult.
Gibbs sampling: Cauchy is a scale mixture of normal densities,
with the scale parameter having a Gamma distribution.
−1
π(θ) ∝ τ 2 + (θ − µ)2
Z ∞
λ
λ
λ 1/2
2
)
exp
−
(θ
−
µ)
λ1/2−1 exp(− ) d λ,
∝
(
2
2
2πτ
2τ
2
0
so that π(θ) may be considered the marginal prior density from the
joint prior density of (θ, λ) where
θ | λ ∼ N (µ, τ 2 /λ) and λ ∼ Gamma(1/2, 1/2).
This implicit hierarchical prior structure implies: π(θ | x ) is the
marginal density from π(θ, λ | x ).
Full conditionals of π(θ, λ | x ) are standard distributions:
τ2
λσ 2
τ 2σ2
θ | λ, x ∼ N
x
+
µ,
, (24)
τ 2 + λσ 2
τ 2 + λσ 2 τ 2 + λσ 2
2
τ + (θ − µ)2
.
(25)
λ | θ, x ∼ λ | θ ∼ Exponential
2τ 2
Thus, the Gibbs sampler will use (24) and (25) to generate (θ, λ)
from π(θ, λ | x ).
Example M5. X = number of defectives in the daily production
of a product. (X | Y , θ) ∼ binomial(Y , θ), where Y , a day’s
production, is Poisson with known mean λ, and θ is the probability
that any product is defective. The difficulty is that Y is not
observable, and inference has to be made on the basis of X only.
Prior: (θ | Y = y) ∼ Beta(α, γ), with known α and γ independent
of Y . Bayesian analysis here is not difficult because the posterior
distribution of θ | X = x can be obtained as follows. First,
X | θ ∼ Poisson(λθ). Next, θ ∼ Beta(α, γ). Therefore,
π(θ | X = x ) ∝ exp(−λθ)θx +α−1 (1 − θ)γ−1 , 0 < θ < 1.(26)
This is not a standard distribution, and hence posterior quantities
cannot be obtained in closed form. Instead of focusing on θ | X
directly, view it as a marginal component of (Y , θ | X ). Check
that the full conditionals of this are given by
Y | X = x , θ ∼ x + Poisson(λ(1 − θ)), and
θ | X = x , Y = y ∼ Beta(α + x , γ + y − x )
both of which are standard distributions.
Example M5 continued. It is actually possible here to sample
from the posterior distribution using the accept-reject Monte Carlo
method:
Let g(x )/K be the target density, where K is the possibly
unknown normalizing constant of the unnormalized density g.
Suppose h(x ) is a density that can be simulated by a known
method and is close to g, and suppose there exists a known
constant c > 0 such that g(x ) < ch(x ) for all x . Then, to
simulate from the target density, the following two steps suffice.
Step 1. Generate Y ∼ h and U ∼ U (0, 1);
Step 2. Accept X = Y if U ≤ g(Y )/{ch(Y )}; return to Step 1
otherwise.
The optimal choice for c is sup{g(x )/h(x ).
In Example M5, from (26),
g(θ) = exp(−λθ)θx +α−1 (1 − θ)γ−1 I {0 ≤ θ ≤ 1},
so that h(θ) may be chosen to be the density of Beta(x + α, γ).
Then, with the above-mentioned choice for c, if
θ ∼ Beta(x + α, γ) is generated in Step 1, its ‘acceptance
probability’ in Step 2 is simply exp(−λθ).
Even though this method works here, let us see how the
Metropolis-Hastings algorithm can be applied.
The required Markov chain is generated by taking the transition
density q(z , y) = q(y | z ) = h(y), independently of z . Then the
acceptance probability is
g(y)h(z )
,1
ρ(z , y) = min
g(z )h(y)
= min {exp (−λ(y − z )) , 1} .
The steps involved in this “independent” M-H algorithm are:
Start at t = 0 with a value x0 in the support of the target
distribution; in this case, 0 < x0 < 1. Given xt , generate the next
value in the chain as given below.
(a) Draw Yt from Beta(x + α, γ).
(b) Let
Yt with probability ρt
x(t+1) =
xt otherwise,
where ρt = min{exp (−λ(Yt − xt )) , 1}.
(c) Set t = t + 1 and go to step (a).
Run this chain until t = n, a suitably chosen large integer. In our
example, for x = 1, α = 1, γ = 49 and λ = 100, we simulated such
a Markov chain. The resulting frequency histogram is shown in
Figure below, with the true posterior density super-imposed on it.
Figure: M-H frequency histogram and true posterior density.
1 Statistical Inference
Outline
2 Frequentist Statistics
3 Conditioning on Data
4 The Bayesian Recipe
5 Inference for Binomial proportion
6 Inference With Normals/Gaussians
7 Bayesian Computations
8 Empirical Bayes Methods for High Dimensional Problems
9 Formal Methods for Model Selection
10 Bayesian Model Selection
11 Model Selection or Model Averaging?
12 References
Empirical Bayes Methods for High Dimensional Problems
This is becoming popular again, this time for ‘high dimensional’
problems. Astronomers routinely estimate characteristics of
millions of similar astronomical objects – distance, radial velocity
whatever. Consider the data:



(X1 = 

X11
X12
..
.
X1n






 , X2 = 


X21
X22
..
.






 , · · · Xp = 


X2n
Xp1
Xp2
..
.



).

Xpn
Xj represents n repeated independent observations on the j th
object, j = 1, 2, . . . p. The important point is n is small, 2, 5, or
10, whereas p is large, such as a million.
Suppose Xj 1 , . . . Xjn measure µj with variability σ 2 .
Problem: Maximum likelihood can give wrong estimates
Take n = 2 and suppose
2
Xj 1
µj
σ
0
∼N
,
,
Xj 2
µj
0 σ2
j = 1, 2, . . . p.
i.e., we measure µj with 2 independent measurements, each
coming with a N (0, σ 2 ) error added to it; we do this for a very
large number p of objects. What is the MLE of σ 2 ?
l (µ1 , . . . µp ; σ 2 | x1 , . . . xp ) = f (x1 , . . . xp | µ1 , . . . µp ; σ 2 )
=
p Y
2
Y
f (xji | µj , σ 2 )
j =1 i=1
p
2
1 XX
(xji − µj )2 )
2σ 2
j =1 i=1
" 2
#
p
X
X
1
= (2πσ 2 )−p exp(− 2
(xji − x̄j )2 + 2(x̄j − µj )2 ).
2σ
= (2πσ 2 )−p exp(−
j =1
i=1
µ̂j = x̄j = (xj 1 + xj 2 )/2 and
σˆ2 =
=
=
p
2
1 XX
(xji − x̄j )2
2p
j =1 i=1
"
#
p
X
xj 1 + xj 2 2
xj 1 + xj 2 2
1
xj 1 −
+ xj 2 −
2p
2
2
1
2p
j =1
p
X
j =1
p
(xj 1 − xj 2 )2
1 X
2
=
(xj 1 − xj 2 )2 .
4
4p
j =1
Since Xj 1 − Xj 2 ∼ N (0, 2σ 2 ), j = 1, 2 . . .,
p
1X
(Xj 1 − Xj 2 )2
p
1
σˆ2 =
j =1
p
X
(Xj 1 − Xj 2 )2
P
−→
p→∞
P
−→
p→∞
2σ 2 , so that
σ2
, and not σ 2 .
Good estimates for σ 2 do exist, for example,
p
1 X
P
(Xj 1 − Xj 2 )2 −→ 2σ 2 .
p→∞
2p
j =1
What is going wrong here?
This is not a small p, large n problem, but a small n, large p
problem. i.e. a high dimensional problem, so needs care!
As p → ∞, there are too many parameters to estimate and the
likelihood function is unable to see where information lies, so tries
to distribute it everywhere.
What is the way out? Go Bayesian!
There
available on σ 2 (note
Pp is a lot of information
2
2 2
j =1 (Xj 1 − Xj 2 ) ∼ 2σ χp ) but very little on individual µj .
However, if µj are ‘similar’, there is a lot of information on where
they come from, because we get to see p samples, p large.
Suppose we are interested in µj . How can we use the above
information? Model as follows:
X̄j | µj , σ 2 ∼ N (µj , σ 2 /2), j = 1, . . . p, independent observations.
σ 2 may be
assumed known, since a reliable estimate
1 Pp
2
σˆ2 = 2p
j =1 (Xj 1 − Xj 2 ) is available. Express the information
that µj are ‘similar’ in the form:
µj , j = 1, . . . p is a random sample (collection) from N (η, τ 2 ).
Where do we get the η and τ 2 , the prior mean and prior variance?
Marginally (or in predictive sense) X̄j , j = 1, . . . p is a random
sample from N (µ0 , τ 2 + σ 2 /2). Use this random sample.
¯ = 1 P X̄ and τ 2 by
Estimate η by η̂ = X̄
j
p
P
p
1
¯ )2 − σ 2 /2 + .
X̄
τˆ2 = p−1
(
X̄
−
j
j =1
Now one could pretend that the prior for (µ1 , . . . µp ) is N (η̂, τˆ2 )
and compute the Bayes estimates for µj :
¯,
E (µj | X1 , . . . Xp ) = (1 − B̂ )X̄j + B̂ X̄
where B̂ =
σ 2 /2
.
σ 2 /2+τˆ2
If instead of 2 observations, each sample has
n observations, replace 2 by n. This is called Empirical Bayes since
the prior is estimated using data. There is also a fully Bayesian
counter-part called Hierarchical Bayes.
1 Statistical Inference
Outline
2 Frequentist Statistics
3 Conditioning on Data
4 The Bayesian Recipe
5 Inference for Binomial proportion
6 Inference With Normals/Gaussians
7 Bayesian Computations
8 Empirical Bayes Methods for High Dimensional Problems
9 Formal Methods for Model Selection
10 Bayesian Model Selection
11 Model Selection or Model Averaging?
12 References
Formal Methods for Model Selection
What is the best model for Gamma-ray burst afterglow?
Consider a simpler, abstract problem instead.
Suppose X having density f (x | θ) is observed, with θ being an
unknown element of the parameter space Θ. We are interested in
comparing two models M0 and M1 :
M0 : X has density f (x | θ) where θ ∈ Θ0 ;
M1 : X has density f (x | θ) where θ ∈ Θ1 .
(27)
Simplify even further, and assume we want to test
M0 : θ = θ0 versus M1 : θ 6= θ0 ,
(28)
Frequentist: A (classical) significance test is derived. It is based on
a test statistic T (X ), large values of which are deemed to provide
evidence against the null hypothesis, M0 . If data X = x is
observed, with corresponding t = T (x ), the P-value is
α = Pθ0 (T (X ) ≥ T (x )) .
Example 6. Consider a random sample X1 , . . . , Xn from N (θ, σ 2 ),
where σ 2 is known. Then X̄ is sufficient for θ and it has the
N (θ, σ 2 /n) distribution. Noting
that
√
T = T (X̄ ) = | n X̄ − θ0 /σ | is a natural test statistic to
test (28), one obtains the usual P-value as α = 2[1 − Φ(t)], where
√
t = | n (x̄ − θ0 ) /σ | and Φ is the standard normal cumulative
distribution function.
What is a P-value and what does it say? P-value is the probability
under a (simple) null hypothesis of obtaining a value of a test
statistic that is at least as extreme as that observed in the sample
data.
To compute a P-value we take the observed value of the test
statistic to the reference distribution and check if it is likely or
unlikely under M0 .
χ2 Goodness-of-fit test
Example 7. Rutherford and Geiger (1910) gave the following
observed numbers of intervals of 1/8 minute when 0, 1, . . .
α-particles are ejected by a specimen. Check if Poisson fits well.
Number
0
1
2
3
4
5
Obs.
57 203 383 525 532 408
Exp.
54 211 407 525 508 393
Number
6
7
8
9
10
11 12 or more
Obs.
273 139 45
27
10
4
2
Exp.
254 140 68
29
11
4
1
k
X
(Oi − Ei )2
∼ χ2k −2 approximately for large n,
Test statistic: T =
Ei
j =1
where k is the number of cells, Oi is the observed and Ei is the
expected count (estimated) for the i th cell.
Estimated Poisson intensity rate = (total number of particles
ejected)/(total number of intervals) = 100097/2608 =3.87.
k = 13.
P-value = P (T ≥ 14.03) ≈ 0.21 (under χ211 ).
Likelihood Ratio Criterion
Standard likelihood ratio criterion for comparing M0 and M1 is
λn =
f (x | θ̂0 )
f (x | θ̂)
=
maxθ∈Θ0 f (x | θ)
.
maxθ∈Θ0 ∪Θ1 f (x | θ)
(29)
0 < λn ≤ 1, and large values of λn provide evidence for M0 .
Reject M0 for small values.
Use λn (or a function of λn ) as a test statistic if its distribution
under M0 can be derived. Otherwise, use the large sample result:
L
−2 log(λn ) −→ χ2p1 −p0 ,
n→∞
under M0 where p0 and p1 are dimensions of Θ0 and Θ0 ∪ Θ1 .
1 Statistical Inference
Outline
2 Frequentist Statistics
3 Conditioning on Data
4 The Bayesian Recipe
5 Inference for Binomial proportion
6 Inference With Normals/Gaussians
7 Bayesian Computations
8 Empirical Bayes Methods for High Dimensional Problems
9 Formal Methods for Model Selection
10 Bayesian Model Selection
11 Model Selection or Model Averaging?
12 References
Bayesian Model Selection
How does the Bayesian approach work?
X ∼ f (x | θ) and we want to test
M0 : θ ∈ Θ0
versus
M1 : θ ∈ Θ1 .
(30)
If Θ0 and Θ1 are of the same dimension (eg: M0 : θ ≤ 0 and
M1 : θ > 0), choose a prior density that assigns positive prior
probability to Θ0 and Θ1 . Then calculate the posterior probabilities
P {Θ0 | x}, P {Θ1 | x} as well as the posterior odds ratio, namely,
P {Θ0 | x }/P {Θ1 | x }.
Find a threshold like 1/9 or 1/19, etc. to decide what constitutes
evidence against H0 .
Alternatively, let π0 and 1 − π0 be the prior probabilities of Θ0 and
Θ1 . Let gi (θ) be the prior p.d.f. of θ under Θi (or Mi ), so that
Z
gi (θ)d θ = 1.
Θi
The prior in the previous approach is nothing but
π(θ) = π0 g0 (θ)I {θ ∈ Θ0 } + (1 − π0 )g1 (θ)I {θ ∈ Θ1 }.
Need not require any longer that Θ0 and Θ1 are of the same
dimension. Sharp null hypotheses are also covered. Proceed as
before and report posterior probabilities or posterior odds. To
compute these posterior quantities, note that the marginal density
of X under the prior π can be expressed as
Z
mπ (x ) =
f (x | θ)π(θ) d θ
ΘZ
Z
= π0
f (x | θ)g0 (θ) d θ + (1 − π0 )
f (x | θ)g1 (θ) d θ
Θ0
Θ1
and hence the posterior density of θ given the data X = x as
f (x | θ)π(θ)
π0 f (x | θ)g0 (θ)/mπ (x )
=
π(θ | x ) =
(1 − π0 )f (x | θ)g1 (θ)/mπ (x )
mπ (x )
if θ ∈ Θ
if θ ∈ Θ
It follows then that
π
P (M0 | x ) =
=
P π (M1 | x ) =
=
Z
π0
P (Θ0 | x ) =
f (x | θ)g0 (θ) d θ
mπ (x ) Θ0
R
π0 Θ0 f (x | θ)g0 (θ) d θ
R
R
π0 Θ0 f (x | θ)g0 (θ) d θ + (1 − π0 ) Θ1 f (x | θ)g1 (θ) d θ
Z
(1 − π0 )
π
f (x | θ)g1 (θ) d θ
P (Θ1 | x ) =
mπ (x ) Θ1
R
(1 − π0 ) Θ1 f (x | θ)g1 (θ) d θ
R
R
π0 Θ0 f (x | θ)g0 (θ) d θ + (1 − π0 ) Θ1 f (x | θ)g1 (θ) d θ
π
One may also report the Bayes factor, which does not depend on
π0 . The Bayes factor of M0 relative to M1 is defined as
R
f (x | θ)g0 (θ) d θ
P (Θ0 | x ) P (Θ0 )
BF01 =
= RΘ0
.
(31)
P (Θ1 | x ) P (Θ1 )
Θ1 f (x | θ)g1 (θ) d θ
Note:
BF10 = 1/BF01 .
Posterior odds ratio of M0 relative to M1 :
P (Θ0 | x )
=
P (Θ1 | x )
π0
1 − π0
BF01 .
Posterior odds ratio of M0 relative to M1 = BF01 if π0 = 12 .
The smaller the value of BF01 , the stronger the evidence
against M0 .
Testing as a model selection problem using Bayes factor illustrated
below: Jeffreys test.
Jeffreys Test for Normal Mean; σ 2 Unknown
X1 , X2 , . . . , Xn a random sample from N (µ, σ 2 ). We want to test
M0 : µ = µ0 versusM1 : µ 6= µ0
where µ0 is some specified number.
Parameter σ 2 is common in the two models corresponding to M0
and M1 and µ occurs only in M1 . Take the prior g0 (σ) = 1/σ for σ
under M0 . Under M1 , take the same prior for σ and add a
conditional prior for µ given σ, namely
g1 (µ | σ) =
µ
1
g2 ( ).
σ
σ
where g2 (·) is a p.d.f. Jeffreys suggested we should take g2 to be
Cauchy, so
1
g0 (σ) =
under M0
σ
1
1
1
g1 (µ, σ) = g1 (µ | σ) =
under M1 .
σ
σ σπ(1 + µ2 /σ 2 )
One may now find the Bayes factor BF01 using (31).
Example 8. Einstein’s theory of gravitation predicts the amount of
deflection of light deflected by gravitation. Eddington’s expedition
in 1919 (and other groups in 1922 and 1929) provided 4
observations: x1 = 1.98, x2 = 1.61, x3 = 1.18, x4 = 2.24 (all in
seconds as measures of angular deflection). Suppose they are
normally distributed around their predicted value µ. Then
X1 , · · · , X4 are independent and identically distributed as
N (µ, σ 2 ). Einstein’s prediction is µ = 1.75. Test M0 : µ = 1.75
versus M1 : µ 6= 1.75, where σ 2 is unknown.
Use the conventional priors of Jeffreys to calculate the Bayes
factor.
BF01 = 2.98.
The calculations with the given data lend some support to
Einstein’s prediction. However, the evidence in the data isn’t very
strong.
BIC
When we compare two models M0 : θ ∈ Θ0 and M1 : θ ∈ Θ1 ,
what does the Bayes factor
R
f (x | θ)g0 (θ) d θ
m0 (x )
BF01 = RΘ0
=
m1 (x )
Θ1 f (x | θ)g1 (θ) d θ
measure?
m0 (x ) measures how well M0 fits the data x whereas m1 (x )
measures how well M1 fits the same data, so BF01 is the relative
strength of the two models in the predictive sense. This can be
difficult to compute for complicated models, so any good
approximation is welcome.
Approximate marginal density m(x ) of X for large sample size n:
Z
m(x ) =
π(θ)f (x | θ) d θ =?
Laplace’s Method
Z
Z
π(θ)f (x | θ) d θ =
m(x ) =
π(θ)
n
Y
f (xi | θ) d θ
i=1
Z
=
Z
n
X
π(θ) exp(
log f (xi | θ)) d θ = π(θ) exp(nh(θ)) d θ.
i=1
where h(θ) =
1
n
Pn
i=1 log f (xi
| θ).
Consider any integral of the form
Z ∞
I =
q(θ)e nh(θ) d θ
−∞
where q and h are smooth functions of θ with h having a unique
maximum at θ̂.
If h has a unique sharp maximum at θ̂, then most contribution to
the integral I comes from the integral over a small neighborhood
(θ̂ − δ, θ̂ + δ) of θ̂.
Study the behavior of I as n → ∞. As n → ∞, we have
Z θ̂+δ
I ∼ I1 =
q(θ)e nh(θ) d θ.
θ̂−δ
Laplace’s method involves Taylor series expansion of q and h
about θ̂:
Z θ̂+δ 1
2 00
0
I ∼
q(θ̂) + (θ − θ̂)q (θ̂) + (θ − θ̂) q (θ̂) + · · ·
2
θ̂−δ
i
h
n 00
0
2
× exp nh(θ̂) + nh (θ̂)(θ − θ̂) + h (θ̂)(θ − θ̂) + · · ·
2
Z θ̂+δ 1
2 00
nh(θ̂)
0
∼ e
q(θ̂)
1 + (θ − θ̂)q (θ̂)/q(θ̂) + (θ − θ̂) q (θ̂)/q(θ̂)
2
θ̂−δ
hn
i
× exp h 00 (θ̂)(θ − θ̂)2 d θ.
2
Assume c = −h 00 (θ̂) > 0 and use a change of variable
√
t = nc(θ − θ̂):
I
1
∼ e nh(θ̂) q(θ̂) √
nc
Z δ√nc t 0
t 2 00
2
√
q (θ̂)/q(θ̂) e −t /2 dt
×
1+
q (θ̂)/q(θ̂) +
√
2nc
nc
−δ nc
"
#
√
q 00 (θ̂)
nh(θ̂) 2π
√ q(θ̂) 1 +
∼ e
nc
2ncq(θ̂)
√
2π
= e nh(θ̂) √ q(θ̂) 1 + O(n −1 ) .
(32)
nc
R
R
Apply (32) to m(x ) = π(θ)f (x | θ) d θ = π(θ) exp(nh(θ)) d θ,
with q = π and ignore terms that stay bounded.
log(m(x ) ≈ nh(θ̂) − 21 log n = log(f (x | θ̂)) − 12 log n.
What Happens When θ is p > 1 Dimensional?
Simply replace (32) by its p dimensional counter part:
I
= e nh(θ̂) (2π)p/2 n −p/2 det(∆h (θ̂))−1/2 q(θ̂)(1 + O(n −1 ))
where ∆h (θ) denotes the Hessian of −h, i.e.,
∂2
h(θ)
.
∆h (θ) = −
∂θi ∂θj
p×p
Now apply
R thisR to
R
R
m(x ) = · · · π(θ)f (x | θ) d θ = · · · π(θ) exp(nh(θ)) d θ,
with q = π and ignore terms that stay bounded. Then
log(m(x ) ≈ nh(θ̂) −
p
2
log n = log(f (x | θ̂)) −
p
2
log n.
Schwarz (1978) proposed a criterion, known as the BIC, based on
(32) ignoring the terms that stay bounded as the sample size
n → ∞ (and general dimension p for θ):
BIC = log f (x | θ̂) − (p/2) log n
This serves as an approximation to the logarithm of the integrated
likelihood of the model and is free from the choice of prior.
2 log BF01 is a commonly used evidential measure to compare the
support provided by the data x for M0 relative to M1 . Under the
above approximation we have,
!
f (x | θb0 )
− (p0 − p1 ) log n. (33)
2 log(BF01 ) ≈ 2 log
f (x | θb1 )
This is the approximate Bayes factor based on the Bayesian
information criterion (BIC) due to Schwarz (1978). The term
(p0 − p1 ) log n can be considered a penalty for using a more
AIC
Recall the likelihood ratio criterion: λn =
f (x | θ̂0 )
f (x | θ̂)
P (M0 is rejected | M0 ) = P (λn < c) ≈ P (χ2p1 −p0 > −2 log(c)) > 0,
so, from a frequentist point of view, a criterion based solely on the
likelihood ratio does not converge to a sure answer under M0 .
Akaike (1983) suggested a penalized likelihood criterion:
!
f (x | θb0 )
− 2(p0 − p1 )
2 log
f (x | θb1 )
(34)
which is based on the Akaike information criterion (AIC), namely,
b − 2p
AIC = 2 log f (x | θ)
for a model f (x | θ). The penalty for using a complex model is
not as drastic as that in BIC.
1 Statistical Inference
Outline
2 Frequentist Statistics
3 Conditioning on Data
4 The Bayesian Recipe
5 Inference for Binomial proportion
6 Inference With Normals/Gaussians
7 Bayesian Computations
8 Empirical Bayes Methods for High Dimensional Problems
9 Formal Methods for Model Selection
10 Bayesian Model Selection
11 Model Selection or Model Averaging?
12 References
Model Selection or Model Averaging?
Example 9. Velocities (km/second) of 82 galaxies in six
well-separated conic sections of the Corona Borealis region. How
many clusters?
Consider mixture of normals:
f (x | θ) =
n
Y
f (xi | θ)
i=1
=
n
Y


k
X

pj φ(xi | µj , σj2 ) ,
i=1
j =1
where k is the number of mixture components, pj is the weight
given to the j th component, N (µj , σj2 ).
Models to consider:
Mk : X has density
k
X
pj φ(xi | µj , σj2 ), k = 1, 2 . . .
j =1
i.e., Mk is a k component normal mixture.
Bayesian modelR selection procedure computes
m(x | Mk ) = π(θk )f (x | θk ) d θk , for each k of interest and
picks the one which gives the largest value.
Example 9 contd. Chib (1995), JASA:
k
2
3
3
σj2
σj2 = σ 2
σj2 = σ 2
σj2 unrestricted
log(m(x | Mk ))
-240.464
-228.620
-224.138
3 component normal mixture model with unequal variances seems
best.
From the Bayesian point of view, a natural approach to model
uncertainty is to include all models, Mk , under consideration
for future decisions.
i.e., Bypass the model-choice step entirely.
Unsuitable for scientific inference where selection of a model is
a must.
Suitable for prediction purposes, since underestimation of
uncertainty resulting from choosing model Mk̂ is eliminated.
We have Θ = ∪k Θk ,
f (y | θ) = fk (y | θk ) if θ ∈ Θk and
π(θ) = pk gk (θk ) if θ ∈ Θk ,
where pk = Pπ (Mk ) is the prior probability of Mk and gk
integrates to 1 over Θk . Therefore, given the sample
x = (x1 , . . . xn ),
f (x | θ)π(θ)
m(x )
X pk
=
fk (x | θk )gk (θk )IΘk (θk )
m(x )
k
X
=
P (Mk | x )gk (θk | x )IΘk (θk ).
π(θ | x ) =
k
Predictive density m(y | x ) given the sample x = (x1 , . . . xn ) is
what is needed. This is given by
Z
m(y | x ) =
f (y | θ)π(θ | x ) d θ
Θ
Z
X
=
P (Mk | x )
fk (y | θk )gk (θk | x ) d θk
k
=
X
Θk
P (Mk | x )mk (y | x ),
k
which is clearly obtained by averaging over all models.
Minimum Description Length
Model fitting is like describing the data in a compact form. A
model is better if it can provide a more compact description, or if
it can compress the data more, or if it can be transmitted with
fewer bits. Given a set of models to describe a data set, the best
model is the one which provides the shortest description length.
In general one needs log2 (n) bits to transmit n, but patterns can
reduce the description length.
100 · · · 0: 1 followed by a million 0’s
1010 · · · 10: pair 10 repeated a million times
If data x is known to arise from a probability density p, then the
optimal code length (in an average sense) is given by − log p(x ).
The optimal code length of − log p(x ) is valid only in the discrete
case. What happens in the continuous case? Discretize x and
denote it by [x ] = [x ]δ where δ denotes the precision. This means
we consider
Z [x ]+δ/2
P ([x ] − δ/2 ≤ X ≤ [x ] + δ/2) =
p(u) du ≈ δp(x )
[x ]−δ/2
instead of p(x ) itself as far as coding of x is considered when x is
one-dimensional. In the r -dimensional case, replace the density
p(x ) by the probability of the r -dimensional cube of side δ
containing x , namely p([x ])δ r ≈ p(x )δ r , so that the optimal code
length changes to − log p(x ) − r log δ.
MDL for Estimation or Model Fitting
Consider data x ≡ x n = (x1 , x2 , . . . , xn ), and suppose
F = {f (x n | θ) : θ ∈ Θ}
is the collection of models of interest. Further, let π(θ) be a prior
density for θ. Given a value of θ (or a model), the optimal code
length for describing x n is − log f (x n | θ), but since θ is unknown,
its description requires a further − log π(θ) bits on average.
Therefore the optimal code length is obtained upon minimizing
DL(θ) = − log π(θ) − log f (x n | θ),
(35)
so that MDL amounts to seeking that model which minimizes the
sum of
(i) the length, in bits, of the description of the model, and
(ii) the length, in bits, of data when encoded with the help of the
model.
The posterior density of θ given the data x n is
π(θ | x n ) =
f (x n | θ)π(θ)
,
m(x n )
(36)
where m(y ) is the marginal or predictive density. Minimizing
DL(θ) = − log π(θ) − log f (x n | θ) = − log{f (x n | θ)π(θ)}
over θ is equivalent to maximizing π(θ | x n ). Thus MDL for
estimation or model fitting is equivalent to finding the highest
posterior density (HPD) estimate of θ.
Consider the case of F having model parameters of different
dimensions. Consider the continuous case and discretization.
Denote k -dimensional θ by θ k = (θ1 , θ2 , . . . , θk ). Then
DL(θ k )
= − log{π([θ k ]δπ )δπk } − log{f ([x n ]δf | [θ k ]δπ )δfn }
= − log π([θ k ]δπ ) − k log δπ − log f ([x n ]δf | [θ k ]δπ ) − n log δf
≈ − log π(θ k ) − k log δπ − logf (x n | θ k ) − n log δf .
Note that the term −n log δf is common across all models, so it
can be ignored. However, the term −k log δπ indicating the
dimension of θ in the model varies and is influential. According to
√
Rissanen, δπ = 1/ n is optimal, in which case
DL(θ k ) ≈ −logf (x n | θ k ) − log π(θ k ) +
k
log n + constant .
2
(37)
1 Statistical Inference
Outline
2 Frequentist Statistics
3 Conditioning on Data
4 The Bayesian Recipe
5 Inference for Binomial proportion
6 Inference With Normals/Gaussians
7 Bayesian Computations
8 Empirical Bayes Methods for High Dimensional Problems
9 Formal Methods for Model Selection
10 Bayesian Model Selection
11 Model Selection or Model Averaging?
12 References
References
1
Tom Loredo’s site:
http://www.astro.cornell.edu/staff/loredo/bayes/
2
An Introduction to Bayesian Analysis: Theory and Methods by
J.K. Ghosh, Mohan Delampady and T. Samanta, Springer,
2006
3
Probability Theory: The Logic of Science by E.T. Jaynes,
Cambridge University Press, 2003
4
Bayesian Logical Data Analysis for Physical Sciences by
P.C. Gregory, Cambridge University Press, 2005
5
Bayesian Reasoning in High-Energy Physics: Principles and
Applications by G. D’Agostini, CERN, 1999

Download Report

Introduction to Bayesian Inference

Paperzz.com

Your Paperzz