Computer intensive statistical methods Lecture 5

Computer intensive statistical methods
Lecture 5
Jonas Wallin
Chalmers, Gothenburg
.
Jonas Wallin - [email protected]
1/25
The agenda of the day
• repetition
• Bayesian statistics basic ideas
• Priors
• Bayesian Hypothesis testing
The main idea is to build ”realistic” models.
Today
Jonas Wallin - [email protected]
2/25
Self-normalized IS
Assume f (x) is known only up to a normalizing constant cf > 0,
i.e. f (x) = z(x)
cf , where we can evaluate z(x) = cf f (x) but not
f (x). Then, as before,
R
Z
cf f (x)>0 h(x)f (x) dx
R
τ = Ef (h(X )) =
h(x)f (x) dx =
cf f (x)>0 f (x) dx
χ
R
R
cf f (x)
f (x)>0 h(x) g (x) g (x) dx
g (x)>0 h(x)ωu (x)g (x) dx
R
=
= R
cf f (x)
g (x) dx
g (x)>0 ωu (x)g (x) dx
g (x)>0 g (x)
=
Eg (h(X )ωu (X ))
,
Eg (ωu (X ))
where we are able to evaluate
ωu : {x ∈ χ : g (x) > 0} 3 x 7→
reptetition
z(x)
.
g (x)
Jonas Wallin - [email protected]
3/25
Control variates
Let real-valued random variable Y , referred to as a control variate
such that
(i) E(Y ) = m is known and
(ii) h(X ) − Y can be simulated at the same complexity as h(X ).
Then for any α ∈ R,
Z = h(X ) + α(Y − m),
so that
E(Z ) = E(h(X ) + α(Z − m)) = E(h(X )) +α (E(Y ) − m) = τ.
|
{z
}
| {z }
=τ
reptetition
=0
Jonas Wallin - [email protected]
4/25
Control variates (cont.)
h(X ) and Y has covariance C(h(X ), Y ) it holds that
V(Z ) = V(h(X ) + αY ) = C(h(X ) + αY , h(X ) + αY )
= V(h(X )) + 2αC(h(X ), Y ) + α2 V(Y ).
reptetition
Jonas Wallin - [email protected]
5/25
Control variates (cont.)
h(X ) and Y has covariance C(h(X ), Y ) it holds that
V(Z ) = V(h(X ) + αY ) = C(h(X ) + αY , h(X ) + αY )
= V(h(X )) + 2αC(h(X ), Y ) + α2 V(Y ).
Differentiating w.r.t. α and minimizing yields
0 = 2C(h(X ), Y ) + 2αV(Y )
⇔
α = α∗ = −
C(h(X ), Y )
,
V(Y )
which provides the optimal coefficient α∗ in terms of variance.
reptetition
Jonas Wallin - [email protected]
5/25
Control variates reconsidered
A problem with the control variate approach is that the optimal α,
i.e.
C(h(X ), Y )
α∗ = −
,
V(Y )
So instead use the estimate of each term
α∗ = −
CN
,
Vn
where
def.
CN =
N
1 X
h(Xi )(Yi − m)
N
i=1
N
X
def. 1
VN =
(Yi − m)2
N
i=1
reptetition
Jonas Wallin - [email protected]
6/25
Control variates multivariate
There is a multivariate extension using control variates: Let
real-valued random variable Y ∈ Rd , such that
E(Y) = m
Now for any α ∈ Rd ,
Z = h(X ) + αT (Y − m),
has E[Z ] = 0. Further
V(Z ) = V(h(X )) + 2αT C(h(X ), Y) + αT V(Y)α.
reptetition
Jonas Wallin - [email protected]
7/25
Control variates multivariate
There is a multivariate extension using control variates: Let
real-valued random variable Y ∈ Rd , such that
E(Y) = m
Now for any α ∈ Rd ,
Z = h(X ) + αT (Y − m),
has E[Z ] = 0. Further
V(Z ) = V(h(X )) + 2αT C(h(X ), Y) + αT V(Y)α.
Thus the optimal vector α is α∗ = V(Y)−1 C(h(X ), Y).
reptetition
Jonas Wallin - [email protected]
7/25
Control variates multivariate example
Lets estimate π again. Using the fact that
P(U12 + U22 ≤ 1) =
π
.
4
Using three control variates
Y1 = I(U1 + U2 ≤ 1),
√
Y2 = I(U1 + U2 ≤ 2),
Y3 = (U1 + U2 − 1)I(1 < U1 + U2 ≤
√
2).
All exceptions are explicit
√
√
√
1
(2 − 2)2
( 2 − 1)2 ( 2 − 1)3
E[Y1 ] = , E[Y2 ] =
, E[Y3 ] =
−
.
2
2
2
3
reptetition
Jonas Wallin - [email protected]
8/25
Control variates multivariate example
Variance reduction due to various control variates 1 −
1
0.727
reptetition
2
0.242
3
0.999
1,2
0.222
1,3
0.62
2,3
0.181
Jonas Wallin - [email protected]
−1 α
αT
∗
∗ V[Y]
V(h(X )) :
1,2,3
0.175
9/25
What is a Statistical model
• I define a stastitical model, as any model where the data is a
random variable.
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
10/25
What is a Statistical model
• I define a stastitical model, as any model where the data is a
random variable.
• Recall from the first course in statistics:
yi = β0 + xi β1 + i ,
i ∼ N(0, σ 2 ).
This model gives a likelihood f (Y|Θ) given a set of
parameters Θ = {σ, β0 , β1 }
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
10/25
What is a Statistical model
• I define a stastitical model, as any model where the data is a
random variable.
• Recall from the first course in statistics:
yi = β0 + xi β1 + i ,
i ∼ N(0, σ 2 ).
This model gives a likelihood f (Y|Θ) given a set of
parameters Θ = {σ, β0 , β1 }
• To get ”the best” (most likely) model, we set the parameters
Θ̂ such that the likelihood of the data Y is maximized
(typically found numerically).
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
10/25
What is a Bayesian Statistical model
• In a Bayesian model paramets are not fixed but rather as
random variables.
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
11/25
What is a Bayesian Statistical model
• In a Bayesian model paramets are not fixed but rather as
random variables.
• First one defines a distribution (prior) π(Θ) on the parameter.
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
11/25
What is a Bayesian Statistical model
• In a Bayesian model paramets are not fixed but rather as
random variables.
• First one defines a distribution (prior) π(Θ) on the parameter.
• First oneThen one assumes a model (or a likelihood) given a
set of parameter f (Y|Θ).
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
11/25
What is a Bayesian Statistical model
• In a Bayesian model paramets are not fixed but rather as
random variables.
• First one defines a distribution (prior) π(Θ) on the parameter.
• First oneThen one assumes a model (or a likelihood) given a
set of parameter f (Y|Θ).
• Together this creates a joint distribution
f (Y, Θ) = f (Y|Θ)π(Θ).
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
11/25
What is a Bayesian Statistical model
• In a Bayesian model paramets are not fixed but rather as
random variables.
• First one defines a distribution (prior) π(Θ) on the parameter.
• First oneThen one assumes a model (or a likelihood) given a
set of parameter f (Y|Θ).
• Together this creates a joint distribution
f (Y, Θ) = f (Y|Θ)π(Θ).
• And instead of trying to find ”the best” model one studies the
posterior distribution of the parameters f (Θ|Y).
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
11/25
Bayes’ Formula
To combine the prior and likelihood we use.
Bayes’ Formula
f (Θ|y) =
f (y|Θ)π(Θ)
f (y|Θ)π(Θ)
=R
0
0
0
f (y)
χ f (y|Θ )π(Θ ) dΘ
f (Θ|y) is called the posterior, or “a posteriori”, distribution.
Sometimes f (y) is denoted prior predictive.
Often, only the proportionality relation
f (Θ|y) ∝ π(Θ, y) = f (y|Θ)π(Θ)
is needed, when seen as a function of Θ.
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
12/25
An example
• Suppose a hospital has around 200 beds occupied each day,
and we want to know the underlying risk that a patient will be
infected by MRSA (methicillin-resistant Staphylococcus
aureus).
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
13/25
An example
• Suppose a hospital has around 200 beds occupied each day,
and we want to know the underlying risk that a patient will be
infected by MRSA (methicillin-resistant Staphylococcus
aureus).
• Looking back at the first six months of the year, we count
y = 20 infections in 40000 bed-days.
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
13/25
An example
• Suppose a hospital has around 200 beds occupied each day,
and we want to know the underlying risk that a patient will be
infected by MRSA (methicillin-resistant Staphylococcus
aureus).
• Looking back at the first six months of the year, we count
y = 20 infections in 40000 bed-days.
• Let θ be the risk of infection per 10000 bed-days.
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
13/25
An example
• Suppose a hospital has around 200 beds occupied each day,
and we want to know the underlying risk that a patient will be
infected by MRSA (methicillin-resistant Staphylococcus
aureus).
• Looking back at the first six months of the year, we count
y = 20 infections in 40000 bed-days.
• Let θ be the risk of infection per 10000 bed-days.
• A reasonable model is that y ∈ Po(4θ).
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
13/25
An example (Classical modeling)
• ML estimate θb = y /4 = 20/4 = 5.
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
14/25
An example (Classical modeling)
• ML estimate θb = y /4 = 20/4 = 5.
• An approximate confidence interval based on a Normal
approximation
s
θb ± λ0.975
Bayesian statistics — Introduction
θb
.
4
Jonas Wallin - [email protected]
14/25
An example (Bayesian modeling)
Bayesian
• However, other evidence about the underlying risk may exist,
such as the previous year’s rates or rates in similar hospitals.
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
15/25
An example (Bayesian modeling)
Bayesian
• However, other evidence about the underlying risk may exist,
such as the previous year’s rates or rates in similar hospitals.
• Suppose this other information, on its own, suggests plausible
values of θ of around 10 per 10000, with 95% of the support
for θ lying between 5 and 17.
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
15/25
An example (Bayesian modeling)
Bayesian
• However, other evidence about the underlying risk may exist,
such as the previous year’s rates or rates in similar hospitals.
• Suppose this other information, on its own, suggests plausible
values of θ of around 10 per 10000, with 95% of the support
for θ lying between 5 and 17.
• This can be expressed as a prior on θ,
θ ∈ Γ(a, b),
Bayesian statistics — Introduction
a = 10, b = 1.
Jonas Wallin - [email protected]
15/25
An example (Bayesian modeling)
Bayesian
• However, other evidence about the underlying risk may exist,
such as the previous year’s rates or rates in similar hospitals.
• Suppose this other information, on its own, suggests plausible
values of θ of around 10 per 10000, with 95% of the support
for θ lying between 5 and 17.
• This can be expressed as a prior on θ,
θ ∈ Γ(a, b),
a = 10, b = 1.
• The posterior distribution is now
f (θ|y ) ∝ θy e −4θ θa−1 e −bθ ∝ θy +a−1 e −θ(4+b) .
This is a conjugate prior of the Poisson distribution (more
about this later).
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
15/25
An example (Bayesian modeling) cont
Bayesian
• The posterior is f (θ|y ) = Γ(y + a, 4 + b).
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
16/25
An example (Bayesian modeling) cont
Bayesian
• The posterior is f (θ|y ) = Γ(y + a, 4 + b).
• If we want a point estimate of θ, the maximum a posteriori
estimator can be used:
θ̂MAP = arg max f (θ|y )
θ
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
16/25
An example (Bayesian modeling) cont
Bayesian
• The posterior is f (θ|y ) = Γ(y + a, 4 + b).
• If we want a point estimate of θ, the maximum a posteriori
estimator can be used:
θ̂MAP = arg max f (θ|y )
θ
• A credible or posterior probability interval can be found using
the quantiles of the posterior distribution.
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
16/25
An example (Bayesian modeling) cont
Lets study the relative what the different distribution looks like for
the parameter θ:
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
17/25
An example (Bayesian modeling) cont
Lets study the relative what the different distribution looks like for
the parameter θ: piror
0
5
10
15
20
θ
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
17/25
An example (Bayesian modeling) cont
Lets study the relative what the different distribution looks like for
the parameter θ: piror, Likelihood
0
5
10
15
20
θ
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
17/25
An example (Bayesian modeling) cont
Lets study the relative what the different distribution looks like for
the parameter θ: piror, Likelihood and posterior distribution.
0
5
10
15
20
θ
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
17/25
An example pedictions cont
So what is the difference in prediction: ML vs Bayesian
0
10
20
30
40
50
z
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
18/25
An example next year
Now suppose that during the following years we observe an other
12000 bed-days and we get 72 infections. What happens to the
posterior distribution θ?
Bayesian statistics — Introduction
Jonas Wallin - [email protected]
19/25
An example next year
Now suppose that during the following years we observe an other
12000 bed-days and we get 72 infections. What happens to the
posterior distribution θ?
0
Bayesian statistics — Introduction
5
10
15
20
θ
Jonas
Wallin - [email protected]
19/25
priors distribution
We will spend a few slides presenting various forms of priors and
there interpretation.
Bayesian statistics — Prior
Jonas Wallin - [email protected]
20/25
priors distribution
We will spend a few slides presenting various forms of priors and
there interpretation.
• The main criticism of the Bayesian method, is that different
priors generate posterior different distribution. From a
philosophical point you can never reject a prior since it
represent the prior knowledge you had, which is subjective.
Bayesian statistics — Prior
Jonas Wallin - [email protected]
20/25
priors distribution
We will spend a few slides presenting various forms of priors and
there interpretation.
• The main criticism of the Bayesian method, is that different
priors generate posterior different distribution. From a
philosophical point you can never reject a prior since it
represent the prior knowledge you had, which is subjective.
• The influence of a prior can be, negligible, moderate or
enormous. The importance is to know the effect prior.
Bayesian statistics — Prior
Jonas Wallin - [email protected]
20/25
priors distribution
We will spend a few slides presenting various forms of priors and
there interpretation.
• The main criticism of the Bayesian method, is that different
priors generate posterior different distribution. From a
philosophical point you can never reject a prior since it
represent the prior knowledge you had, which is subjective.
• The influence of a prior can be, negligible, moderate or
enormous. The importance is to know the effect prior.
• Often priors are chosen so that
• generate a non-informative prior, or and
• generate computationally convenient posterior density.
Bayesian statistics — Prior
Jonas Wallin - [email protected]
20/25
priors distribution cont
The prior distribution is always a choice. For example suppose we
have a binomial likelihood (bin(n, p)) and assume a uniform prior
on p, that is we assume any value of the probability is equally
likely.
Bayesian statistics — Prior
Jonas Wallin - [email protected]
21/25
priors distribution cont
The prior distribution is always a choice. For example suppose we
have a binomial likelihood (bin(n, p)) and assume a uniform prior
on p, that is we assume any value of the probability is equally
likely.
• The assumption of the r.v of p i is:
Bayesian statistics — Prior
Jonas Wallin - [email protected]
21/25
priors distribution cont
The prior distribution is always a choice. For example suppose we
have a binomial likelihood (bin(n, p)) and assume a uniform prior
on p, that is we assume any value of the probability is equally
likely.
• The assumption of the r.v of p i is:
dens
0.0000
0.0005
0.0010
0.0015
0.0020
p^2
0.0
0.2
0.4
0.6
0.8
1.0
p
Bayesian statistics — Prior
Jonas Wallin - [email protected]
21/25
priors distribution cont
The prior distribution is always a choice. For example suppose we
have a binomial likelihood (bin(n, p)) and assume a uniform prior
on p, that is we assume any value of the probability is equally
likely.
• The assumption of the r.v of p i is:
dens
0.0000
0.0005
0.0010
0.0015
0.0020
0.0025
0.0030
p^3
0.0
0.2
0.4
0.6
0.8
1.0
p
Bayesian statistics — Prior
Jonas Wallin - [email protected]
21/25
priors distribution cont
The prior distribution is always a choice. For example suppose we
have a binomial likelihood (bin(n, p)) and assume a uniform prior
on p, that is we assume any value of the probability is equally
likely.
• The assumption of the r.v of p i is:
dens
0.000
0.001
0.002
0.003
0.004
p^4
0.0
0.2
0.4
0.6
0.8
1.0
p
Bayesian statistics — Prior
Jonas Wallin - [email protected]
21/25
priors distribution cont
The prior distribution is always a choice. For example suppose we
have a binomial likelihood (bin(n, p)) and assume a uniform prior
on p, that is we assume any value of the probability is equally
likely.
• The assumption of the r.v of p i is:
0.000
0.001
0.002
dens
0.003
0.004
0.005
p^5
0.0
0.2
0.4
0.6
0.8
1.0
p
Bayesian statistics — Prior
Jonas Wallin - [email protected]
21/25
Conjugate priors
One of the most common class of prior are the so called the
conjugate prior.
Definition (Conjugate prior)
A family F of probability distributions on Θ is said to be conjugate
(or closed under sampling) for a likelihood function f (x|Θ) if, for
every π ∈ F, the posterior distribution f (Θ|x) also belongs to F.
• This definition is mainly of use for families that are in the
same class of parametric distributions.
Bayesian statistics — Prior
Jonas Wallin - [email protected]
22/25
Conjugate priors
One of the most common class of prior are the so called the
conjugate prior.
Definition (Conjugate prior)
A family F of probability distributions on Θ is said to be conjugate
(or closed under sampling) for a likelihood function f (x|Θ) if, for
every π ∈ F, the posterior distribution f (Θ|x) also belongs to F.
• This definition is mainly of use for families that are in the
same class of parametric distributions.
• Main motivation is typically computational convince.
Bayesian statistics — Prior
Jonas Wallin - [email protected]
22/25
Conjugate priors cont.
Conjugate priors for Θ for some common likelihoods. All
parameters except θ are assumed fixed and data (y1 , . . . , yn ) are
conditionally independent given Θ = θ.
Likelihood
Bin(n, θ)
Ge(θ)
NegBin(n, θ)
Gamma(k, θ)
Po(θ)
Prior
Beta(α, β)
Beta(α, β)
Beta(α, β)
Gamma(α, β)
Gamma(α, β)
N(µ, θ−1 )
N(θ, σ 2 )
Gamma(α, β)
N(m, s 2 )
Bayesian statistics — Prior
Posterior
Beta(α + y , β + P
n − y)
Beta(α + n, β + ni=1 yi − n)
Beta(α + n, β + y − n)
Pn
Gamma(α + nk,
Pn β + i=1 yi )
Gamma(α + i=1 yiP
, β + n)
n
(y −µ)2
Gamma(α + n2 , β + i=1 2 i
2 +nȳ /σ 2
1
N( m/s
,
)
1/s 2 +n/σ 2 1/s 2 +n/σ 2
Jonas Wallin - [email protected]
)
23/25
Improper prior
Suppose we want an ”agreement” between the classical statistics
and the Bayesian statistics, so that all information about the
parameter is in the likelihood.
Bayesian statistics — Prior
Jonas Wallin - [email protected]
24/25
Improper prior
Suppose we want an ”agreement” between the classical statistics
and the Bayesian statistics, so that all information about the
parameter is in the likelihood.
The only choice:
π(Θ) ∝ 1.
This is an improper prior.
Bayesian statistics — Prior
Jonas Wallin - [email protected]
24/25
Improper prior
Improper prior is a prior not being a proper density.
• Example:
• π(Θ) ∝ 1 when the parameter space is Rd .
Bayesian statistics — Prior
Jonas Wallin - [email protected]
25/25
Improper prior
Improper prior is a prior not being a proper density.
• Example:
• π(Θ) ∝ 1 when the parameter space is Rd .
• π(Θ) ∝ e Θ when the parameter space is Rd .
Bayesian statistics — Prior
Jonas Wallin - [email protected]
25/25
Improper prior
Improper prior is a prior not being a proper density.
• Example:
• π(Θ) ∝ 1 when the parameter space is Rd .
• π(Θ) ∝ e Θ when the parameter space is Rd .
• π(θ) ∝ |θ|−1 when the parameter space is R.
Bayesian statistics — Prior
Jonas Wallin - [email protected]
25/25
Improper prior
Improper prior is a prior not being a proper density.
• Example:
•
•
•
•
π(Θ) ∝ 1 when the parameter space is Rd .
π(Θ) ∝ e Θ when the parameter space is Rd .
π(θ) ∝ |θ|−1 when the parameter space is R.
They might generate a proper posterior with respect to some
likelihoods.
• For more complex models this hard to verify.
Bayesian statistics — Prior
Jonas Wallin - [email protected]
25/25