Bayesian Methods
Nicholas Ruozzi
University of Texas at Dallas
based on the slides of Vibhav Gogate
Binary Variables
β’ Coin flipping: heads=1, tails=0 with bias π
π π=1π =π
β’ Bernoulli Distribution
π΅πππ π₯ π = π π₯ β
1 β π
πΈ π =π
π£ππ π = π β
(1 β π)
2
1βπ₯
Binary Variables
β’ π coin flips: π1 , β¦ , ππ
π Οπ ππ = π π, π =
π π
π 1βπ
π
β’ Binomial Distribution
π΅ππ π π, π =
π π
π 1βπ
π
πΈ ΰ· ππ = ππ
π
π£ππ ΰ· ππ = ππ(1 β π)
π
3
πβπ
πβπ
Binomial Distribution
4
Estimating the Bias of a Coin
β’ Suppose that we have a coin, and we would like to figure out
what the probability is that it will flip up heads
β How should we estimate the bias?
5
Estimating the Bias of a Coin
β’ Suppose that we have a coin, and we would like to figure out
what the probability is that it will flip up heads
β How should we estimate the bias?
β With these coin flips, our estimate of the bias is: ?
6
Estimating the Bias of a Coin
β’ Suppose that we have a coin, and we would like to figure out
what the probability is that it will flip up heads
β How should we estimate the bias?
β With these coin flips, our estimate of the bias is: 3/5
β’ Why is this a good estimate of the bias?
7
Coin Flipping β Binomial Distribution
β’ π(π»ππππ ) = π, π(πππππ ) = 1 β π
β’ Flips are i.i.d.
β Independent events
β Identically distributed according to Binomial distribution
β’ Our training data consists of πΌπ» heads and πΌ π tails
π π· π = π πΌπ» β
1 β π
8
πΌπ
Maximum Likelihood Estimation (MLE)
β’ Data: Observed set of πΌπ» heads and πΌ π tails
β’ Hypothesis: Coin flips follow a binomial distribution
β’ Learning: Find the βbestβ π
β’ MLE: Choose ο± to maximize probability of D given π
9
First Parameter Learning Algorithm
Set derivative to zero, and solve!
10
First Parameter Learning Algorithm
Set derivative to zero, and solve!
11
Coin Flip MLE
πΰ· ππΏπΈ =
12
πΌπ»
πΌπ» + πΌ π
Priors
β’ Suppose we have 5 coin flips all of which are heads
β Our estimate of the bias is?
13
Priors
β’ Suppose we have 5 coin flips all of which are heads
β MLE would give πππΏπΈ = 1
β
1
This event occurs with probability 5
2
=
1
for a fair coin
32
β Are we willing to commit to such a strong conclusion
with such little evidence?
14
Priors
β’ Priors are a Bayesian mechanism that allow us to take into account
βpriorβ knowledge about our belief in the outcome
β’ Rather than estimating a single π, consider a distribution over
possible values of π given the data
β Update our prior after seeing data
Our best guess in the
absence of any data
Our estimate after we
see some data
Observe flips
e.g.: {tails, tails}
15
Bayesian Learning
Apply Bayes rule:
Posterior
Prior
Data Likelihood
π π·π π π
π ππ· =
π π·
Normalization
β’ Or equivalently: π π π· β π π· π π π
β’ For uniform priors this reduces to the MLE objective
π π β1
β
π π π· β π(π·|π)
16
Picking Priors
β’ How do we pick a good prior distribution?
β Could represent expert domain knowledge
β Statisticians choose them to make the posterior distribution
βniceβ (conjugate priors)
β’ What is a good prior for the bias in the coin flipping problem?
17
Picking Priors
β’ How do we pick a good prior distribution?
β Could represent expert domain knowledge
β Statisticians choose them to make the posterior distribution
βniceβ (conjugate priors)
β’ What is a good prior for the bias in the coin flipping problem?
β Truncated Gaussian (tough to work with)
β Beta distribution (works well for binary random variables)
18
Coin Flips with Beta Distribution
Likelihood function:
Prior:
Posterior:
= π πΌπ»+π½π»β1 (1 β π)πΌπ+π½πβ1
MAP Estimation
β’ Choosing π to maximize the posterior distribution is called
maximum a posteriori (MAP) estimation
πππ΄π = arg max π(π|π·)
π
β’ The only difference between πππΏπΈ and πππ΄π is that one
assumes a uniform prior (MLE) and the other allows an
arbitrary prior
20
Priors
β’ Suppose we have 5 coin flips all of which are heads
β MLE would give πππΏπΈ = 1
6
7
β MLE with a π΅ππ‘π(2,2) prior gives πππ΄π = β .857
β As we see more data, the effect of the prior diminishes
πΌπ» +π½π» β1
πΌπ» +π½π» +πΌπ +π½π β2
β’ πππ΄π =
observations
21
β
πΌπ»
for large # of
πΌπ» +πΌπ
Sample Complexity
β’ How many coin flips do we need in order to guarantee that our
learned parameter does not differ too much from the true
parameter (with high probability)?
β’ Can use Chernoff bound (again!)
β Suppose π1 , β¦ , ππ are i.i.d. random variables taking values
in {0, 1} such that πΈπ ππ = π¦. For π > 0,
π
1
β2ππ 2
π¦ β ΰ· ππ β₯ π β€ 2π
π
π
22
Sample Complexity
β’ How many coin flips do we need in order to guarantee that our
learned parameter does not differ too much from the true
parameter (with high probability)?
β’ Can use Chernoff bound (again!)
β For the coin flipping problem with π1 , β¦ , ππ iid coin flips
and π > 0,
π
1
β2ππ 2
ππ‘ππ’π β ΰ· ππ β₯ π β€ 2π
π
π
23
Sample Complexity
β’ How many coin flips do we need in order to guarantee that our
learned parameter does not differ too much from the true
parameter (with high probability)?
β’ Can use Chernoff bound (again!)
β For the coin flipping problem with π1 , β¦ , ππ iid coin flips
and π > 0,
π ππ‘ππ’π β πππΏπΈ β₯ π β€ 2π
24
β2ππ 2
Sample Complexity
β’ How many coin flips do we need in order to guarantee that our
learned parameter does not differ too much from the true
parameter (with high probability)?
β’ Can use Chernoff bound (again!)
β For the coin flipping problem with π1 , β¦ , ππ iid coin flips
and π > 0,
π ππ‘ππ’π β πππΏπΈ β₯ π β€ 2π
πΏ β₯ 2π
β2ππ 2
β2ππ 2
1
2
β π β₯ 2 ln
2π
πΏ
25
© Copyright 2026 Paperzz