Motivating example: Spelling correction
Problem: Someone types ’radom’.
Question: What did they mean to type? Random?
Ingredients:
Data x: The observed word — radom
Parameter of interest θ: The correct word
Comments: To solve this we need
background information on which words are usually typed.
an idea about how words are typically mistyped.
Example adapted from Bayesian Data Analysis by Gelman et al.
1/35
Bayesian Idea
Data model: Conditional on θ, data x is distributed according to
pf/pdf π(x):
π(x|θ) ∝ L(θ; x)
Prior: Prior knowledge (i.e. before collecting data) about θ is
summaries by a pf/pdf,
π(θ)
← the likelihood
← the prior
Posterior : The updated knowledge about θ after collecting data:
The conditional distribution of θ given data x:
π(x|θ)π(θ)
π(x)
∝ π(x|θ)π(θ)
“posterior = likelihood × prior′′
π(θ|x) =
2/35
Example: Prior
Without any other prior knowledge Google provies the following prior
probabilities (for three candidate words):
θ
π(θ)
random 7.60 × 10−5
radon
6.05 × 10−6
radom
3.12 × 10−7
Comments
The relative high probablity for the word radom is surprising!
Name of Polish airshow and nickname for Polish hand gun...
In the context of writing a scientific report these prior probailities
would look different...?
3/35
Example: Likelihood
Google provies the following conditional probabilities prior probabilities:
θ
π(x = ’radom’|θ)
random 0.00193
radon
0.000143
radom
0.975
Comments
This is not a probability distribution!
If one in fact intends to write ’radom’ this actually happens in
97.5% of cases.
If one intends to write either ’random’ or ’radon’ this is rarely
misspelled ’radom’.
4/35
Example: Posterior
Combining the prior and likelihhod we obtain the posterior probabilities:
π(θ|x) =
π(x|θ)π(θ)
∝ π(x|θ)π(θ)
π(x)
θ
π(x = ’radom’|θ)π(θ) π(θ|x = ’radom’)
random 1.47 × 10−7
0.325
radon
8.65 × 10−10
0.002
radom
3.04 × 10−7
0.673
Conclusion
With the given prior and likelihood the word ’radom’ is twice as
likely as ’random’.
Criticism
The posterior probability for ’radom’ seems too high.
Likelihood or prior to blame
Likelihood is probably ok in this case
Prior depends on context — and hence might be “wrong”.
5/35
Another example
Heights of some Copenhageners in 1995
0
100
200
300
400
500
Histogram of height
140
150
160
170
180
190
200
210
Assume: Heights are normal, X ∼ N (µ, τ ).
For now: Assume precision τ known.
6/35
Bayesian Idea: Illustration
Data model: X ∼ N (µ, 0.01) (i.e. pop. sd = 10)
10
µ
Prior: We believe that the population mean is most likely between 160
cm and 200 cm. π(µ) = N (180, 0.01).
Posterior: After observing a number of heights (n = 10, x̄ = 169), our
knowledge about µ is updated. Summarised by the posterior.
140
150
160
170
180
190
200
210
220
7/35
Normal example: One (!) observation
Data model: X ∼ N (µ, τ )
Assume:
Precision τ known.
Interest:
The unknown mean µ.
Prior: The prior for µ is specified as µ ∼ N (µ0 , τ0 ).
r
1
τ0
µ0
8/35
Normal example: Data density
Data: One observation, X, from a normal distribution:
r
τ
1
2
exp − τ (x − µ)
π(x|µ) =
2π
2
r
τ
1
1
=
exp − τ x2 − τ µ2 + τ µx
2π
2
2
1 2
∝ exp − τ x + τ µx
2
Notice the “pattern” inside the exponential.
9/35
Normal example: Posterior density
P osterior ∝ Likelihood × P rior
π(µ|x) ∝ π(x|µ)π(µ)
r
r
τ
τ0
1
1
2
2
=
exp − τ (x − µ)
exp − τ0 (µ − µ0 )
2π
2
2π
2
1
1
∝ exp − τ µ2 + τ xµ − τ0 µ2 + τ0 µµ0
2
2
1
2
= exp − (τ + τ0 )µ + (τ x + τ0 µ0 )µ
2
τ x + τ0 µ 0
, τ + τ0
∝N
τ + τ0
Notice: Prior for µ was normal, now the posterior for µ is also normal!
10/35
Normal example: Posterior mean & variance
The posterior: π(µ|x) = N
Posterior expectation:
E[µ|x] =
τ x+τ0 µ0
τ +τ0 , τ
+ τ0
τ x + τ0 µ 0
τ
τ0
=
x+
µ0 (= µ1 ).
τ + τ0
τ + τ0
τ + τ0
Weighted average of prior mean and observation x.
Posterior variance:
Var[µ|x] =
1
1
(= )
τ + τ0
τ1
µ0 µ1
x
11/35
Posterior as prior — or updating believes
General setup: We are interested in parameter θ.
Data model: π(x|θ)
Prior:
Data:
Posterior:
π(θ)
First observation x1 ∼ π(x1 |θ)
π(θ|x1 ) ∝ π(x|θ)π(θ)
Assume we have a second independent observation x2 ∼ π(x2 |θ).
Posterior:
π(θ|x1 , x2 ) ∝ π(x1 , x2 |θ)π(θ)
= π(x1 |θ)π(x2 |θ)π(θ)
∝ π(x2 |θ) π(θ|x1 )
| {z } | {z }
likelihood
prior
Notice: The posterior after observing x1 is the prior before observing x2 .
12/35
Independent normal case
Posterior mean and precision after one observation x1 :
µ1 =
x 1 τ + µ 0 τ0
τ + τ0
and τ1 = τ + τ0 .
Next, µ1 and τ1 are prior mean and precision before observing x2 .
Hence, posterior mean and precision after observing (independent) x1
and x2 are
x 2 τ + µ 1 τ1
τ + τ1
x 2 τ + x 1 τ + µ 0 τ0
(x1 + x2 )τ + µ0 τ0
=
=
τ + τ + τ0
2τ + τ0
τ2 = 2τ + τ0
µ2 = E[µ2 |x1 , x2 ] =
This can easily be generalised.
13/35
Many independent normal observations
Assume...
iid
X2 , X2 , . . . , Xn ∼ N (µ, τ ).
τ in known.
Prior π(µ) = N (µ0 , τ0 ).
The posterior is then
π(µ|x1 , x2 , . . . , xn ) = N
P
τ i x i + τ0 µ 0
, nτ + τ0
nτ + τ0
14/35
Posterior mean: Sanity check
The posterior is
P
τ i x i + τ0 µ 0
, nτ + τ0
π(µ|x1 , x2 , . . . , xn ) = N
nτ + τ0
Does the posterior mean seem “sane”?
P
x i + τ0 µ 0
nτ + τ0
P
τ n n1 i xi + τ0 µ0
=
nτ + τ0
τ0
nτ
x̄ +
µ0
=
nτ + τ0
nτ + τ0
E[µ|x1 , . . . , xn ] = µn =
τ
i
Weighted average of sample average x̄ and prior mean µ0 .
For n large we have µn ≈ x̄. Choice of µ0 of little importance.
Precision τn = nτ + τ0 . Knowledge is ever more precise.
15/35
0.20
Heights in Copenhagen: Prior
0.00
0.05
0.10
0.15
Prior (µ0 = 180 τ0 = 0.01)
150
160
170
180
190
200
16/35
Heights in Copenhagen: Posterior
0.20
One observation
0.00
0.05
0.10
0.15
Prior (µ0 = 180, τ0 = 0.01)
Posterior
(n = 1, τ = 0.01)
Data average
150
160
170
180
190
200
17/35
Heights in Copenhagen: Posterior
0.20
Ten observations
0.00
0.05
0.10
0.15
Prior (µ0 = 180, τ0 = 0.01)
Posterior
(n = 10, τ = 0.01)
Data average
150
160
170
180
190
200
18/35
Heights in Copenhagen: Posterior
0.20
100 observations
0.00
0.05
0.10
0.15
Prior (µ0 = 180, τ0 = 0.01)
Posterior
(n = 100, τ = 0.01)
Data average
150
160
170
180
190
200
19/35
Heights in Copenhagen: Posterior
0.20
2627 observations
0.00
0.05
0.10
0.15
Prior (µ0 = 180, τ0 = 0.01)
Posterior
(n = 2627, τ = 0.01)
Data average
150
160
170
180
190
200
20/35
How to summaries the posterior π(θ|x)?
The posterior is usually summaries using one or more of the following:
Plot of posterior density π(θ|x). See previous slides.
Posterior mean and variance/precision.
Central Posterior Interval (CPI). See next slide.
Maximum A Posteriori (MAP) estimate
M AP (θ) = argmax π(θ|x).
θ
21/35
Central Posterior Interval
The CPI is an interval estimate.
Also known as credibility interval.
A 95% CPI for a parameter θ is the shortest (connected) interval
which contains θ with 95% posterior probability.
In case of the normal example, we have
r r
1
1
= 0.95
≤ µ ≤ µn + 1.96
P µn − 1.96
τn
τn
Hence, a 95% CPI for µ is
r
1
95% CPI: µn ± 1.96
.
τn
µn − 1.96
r
1
τn
µn
µn + 1.96
r
1
τn
22/35
CPI compared to confidence interval
The classical 95% confidence interval for µ is
r
95% CI: x̄ ± 1.96
1
.
nτ
For CPI: Assume the prior precision is τ0 = 0, i.e. infinite variance. Then
µn = x̄ and τn = nτ , i.e.
r
1
.
95% CPI: x̄ ± 1.96
nτ
Same interval. Different interpretations.
23/35
Conjugate priors
In the normal example: Both prior and posterior were normal! Very
convenient!
We say that the normal distribution is conjugate.
Definition: Conjugate priors
Let π(x|θ) be the data model.
A class Π of prior distributions for θ is said to be conjugate for π(x|θ) if
π(θ|x) ∝ π(x|θ)π(θ) ∈ Π
whenever π(θ) ∈ Π. I.e. prior and posterior are in the same class of
distributions.
Notice: Π should be a class of “natural” distributions for this to be
useful.
24/35
Improper priors
If we have no prior knowledge we may be tempted to use a “flat” prior,
i.e.
π(θ) ∝ k
If θ ∈ R this is an example of an improper prior, as
Z ∞
Z ∞
k = ∞.
π(θ)dθ =
−∞
−∞
Problematic, but ok, if posterior is proper, i.e. if
Z
Z
π(θ|x)dθ = π(x|θ)π(θ)dθ < ∞.
Notice: If π(θ) ∝ 1 then MAP estimator = Maximum likelihood
estimator.
25/35
Normal example: Unknown precision, known mean
iid
Data model: X1 , X2 , . . . , Xn ∼ N (µ, τ ):
n
τ n2
1 X
(xi − µ)2
exp − τ
π(x|τ ) =
2π
2 i=1
!
Prior: Gamma distribution: π(τ ) = Gamma(α, β)
τ
1
α−1
τ
exp −
π(τ ) = α
β Γ(α)
β
Shape parameter α and scale parameter β.
Properties of the gamma distribution:
E[τ ] = αβ
Var[τ ] = αβ 2 .
26/35
Gamma distribution
0.2
0.0
0.0
0.2
0.4
α=2 , β=2
0.4
α=1 , β=2
0
5
10
15
0
5
10
15
0.2
0.0
0.0
0.2
0.4
α=9 , β=1
0.4
α=5 , β=0.5
0
5
10
15
0
5
10
15
27/35
Normal example: Posterior precision
iid
Data model: X1 , X2 , . . . , Xn ∼ N (µ, τ ):
Prior: π(τ ) = Gamma(α, β)
Posterior:
n
π(τ |x) = Gamma + α,
2
(
n
1X
1
(xi − µ)2 +
2 i=1
β
Posterior mean and variance
E[τ |x] =
2
n
2
+α
2
i=1 (xi − µ) +
Pn
1
For large n we have
E[τ |x] ≈
1
σ̂ 2
1
β
Var[τ |x] = P
n
1
2
)−1
n
2
+α
i=1 (xi
− µ)2 +
1
β
2
where σ̂ 2 is the usual ML variance estimate.
28/35
Known mean: Priors and posteriors
α =0.1 , β =0.1 , n =10
α =0.1 , β =0.1 , n =25
40
80
Prior
posterior
^ −2
σ
0
0
40
80
Prior
posterior
^ −2
σ
0.00
0.02
0.04
0.06
0.08
0.10
0.00
α =0.1 , β =0.1 , n =100
0.04
0.06
0.08
0.10
40
0
0
40
Prior
posterior
^ −2
σ
80
Prior
posterior
^ −2
σ
80
0.02
α =0.1 , β =0.1 , n =2627
0.00
0.02
0.04
0.06
0.08
0.10
0.00
0.02
0.04
0.06
0.08
0.10
29/35
Binomial example
Data model: Binomial, X ∼ B(n, p), n known.
n x
π(x|p) =
p (1 − p)n−x 0 ≤ p ≤ 1.
x
Prior: Beta distribution,
π(p) = Be(α, β)
, α, β > 0.
Where we have to specify α and β.
The Beta distribution has density
(
Γ(α)Γ(β) α−1
p
(1 − p)β−1
π(p) = Γ(α+β)
0
for 0 ≤ p ≤ 1
otherwise.
If α = β = 1 then π(p) = 1 for 0 ≤ p ≤ 1, i.e. uniform.
30/35
Beta distribution: Examples
α=0.5 , β=0.5
0.0
0.0
0.4
1.0
0.8
2.0
α=1 , β=1
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.6
0.8
1.0
0.8
1.0
α=2 , β=0.5
0.0
0.0
1.0
1.0
2.0
2.0
3.0
α=2 , β=4
0.4
0.0
0.2
0.4
0.6
α
Mean: E[p] =
α+β
0.8
1.0
0.0
0.2
0.4
0.6
αβ
Variance: Var[p] =
(α + β)2 (α + β + 1)
31/35
Binomial example — cont.
Data model:X ∼ B(n, p)
Prior: π(p) = Be(α, β), that is
(
Γ(α)Γ(β) α−1
p
(1 − p)β−1
π(p) = Γ(α+β)
0
for 0 ≤ p ≤ 1
otherwise.
Posterior:
π(p|x) ∝ π(x|p)π(p)
Γ(α)Γ(β) α−1
n x
p
(1 − p)β−1
=
p (1 − p)n−x ·
Γ(α + β)
x
∝ px+α−1 (1 − p)n−x+β−1
= Be(x + α, n − n + β).
32/35
Posterior mean & Variance
Posterior
π(p|x) = Be(x + α, n − n + β).
Posterior mean
E[p|x] =
x+α
x+α
=
(x + α) + (n − x + β)
α+β+n
If x, n ≫ α, β then E[p|x] ≈
Posterior variance
x
n.
(x + α)(n − x + β)
(x + α + n − x + β)2 (x + α + n − x + β + 1)
(x + α)(n − x + β)
1
=
=O
(α + β + n)2 (α + β + n + 1)
n
Var[p|x] =
33/35
Example: Placentia Previa (PP)
Question: Is the sex ratio different for PP births compared to
normal births?
Prior knowledge: 51.5% of new-borns are boys.
Data: Of n = 980 cases of PP x=543 were boys (543/980=55.4%).
Data model: X ∼ B(n, p).
Prior: π(p) = Be(α, β).
Posterior:
π(p|x) = Be(x + α, n − x + β)
= Be(543 + α, 437 + β)
How to choose α and β, and what difference does it make?
34/35
Placenta Previa: Beta priors and posteriors
0.0
0.2
0.4
0.6
0.8
1.0
8
0
2
4
6
8
6
4
2
0
0
2
4
6
8
10
α=4.85 , β=5.15
10
α=2.425 , β=2.575
10
α=1 , β=1
0.0
0.2
0.4
0.6
0.8
1.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
1.0
10
0
2
4
6
8
10
8
4
2
0
0.2
0.2
α=97 , β=103
6
8
6
4
2
0
0.0
0.0
α=48.5 , β=51.5
10
α=9.7 , β=10.3
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
35/35
© Copyright 2026 Paperzz