Bayesian Methods - The University of Texas at Dallas

Bayesian Methods
Nicholas Ruozzi
University of Texas at Dallas
based on the slides of Vibhav Gogate
Binary Variables
β€’ Coin flipping: heads=1, tails=0 with bias πœ‡
𝑝 𝑋=1πœ‡ =πœ‡
β€’ Bernoulli Distribution
π΅π‘’π‘Ÿπ‘› π‘₯ πœ‡ = πœ‡ π‘₯ β‹… 1 βˆ’ πœ‡
𝐸 𝑋 =πœ‡
π‘£π‘Žπ‘Ÿ 𝑋 = πœ‡ β‹… (1 βˆ’ πœ‡)
2
1βˆ’π‘₯
Binary Variables
β€’ 𝑁 coin flips: 𝑋1 , … , 𝑋𝑁
𝑝 σ𝑖 𝑋𝑖 = π‘š 𝑁, πœ‡ =
𝑁 π‘š
πœ‡ 1βˆ’πœ‡
π‘š
β€’ Binomial Distribution
𝐡𝑖𝑛 π‘š 𝑁, πœ‡ =
𝑁 π‘š
πœ‡ 1βˆ’πœ‡
π‘š
𝐸 ෍ 𝑋𝑖 = π‘πœ‡
𝑖
π‘£π‘Žπ‘Ÿ ෍ 𝑋𝑖 = π‘πœ‡(1 βˆ’ πœ‡)
𝑖
3
π‘βˆ’π‘š
π‘βˆ’π‘š
Binomial Distribution
4
Estimating the Bias of a Coin
β€’ Suppose that we have a coin, and we would like to figure out
what the probability is that it will flip up heads
– How should we estimate the bias?
5
Estimating the Bias of a Coin
β€’ Suppose that we have a coin, and we would like to figure out
what the probability is that it will flip up heads
– How should we estimate the bias?
– With these coin flips, our estimate of the bias is: ?
6
Estimating the Bias of a Coin
β€’ Suppose that we have a coin, and we would like to figure out
what the probability is that it will flip up heads
– How should we estimate the bias?
– With these coin flips, our estimate of the bias is: 3/5
β€’ Why is this a good estimate of the bias?
7
Coin Flipping – Binomial Distribution
β€’ 𝑃(π»π‘’π‘Žπ‘‘π‘ ) = πœƒ, 𝑃(π‘‡π‘Žπ‘–π‘™π‘ ) = 1 βˆ’ πœƒ
β€’ Flips are i.i.d.
– Independent events
– Identically distributed according to Binomial distribution
β€’ Our training data consists of 𝛼𝐻 heads and 𝛼 𝑇 tails
𝑝 𝐷 πœƒ = πœƒ 𝛼𝐻 β‹… 1 βˆ’ πœƒ
8
𝛼𝑇
Maximum Likelihood Estimation (MLE)
β€’ Data: Observed set of 𝛼𝐻 heads and 𝛼 𝑇 tails
β€’ Hypothesis: Coin flips follow a binomial distribution
β€’ Learning: Find the β€œbest” πœƒ
β€’ MLE: Choose  to maximize probability of D given πœƒ
9
First Parameter Learning Algorithm
Set derivative to zero, and solve!
10
First Parameter Learning Algorithm
Set derivative to zero, and solve!
11
Coin Flip MLE
πœƒΰ· π‘€πΏπΈ =
12
𝛼𝐻
𝛼𝐻 + 𝛼 𝑇
Priors
β€’ Suppose we have 5 coin flips all of which are heads
– Our estimate of the bias is?
13
Priors
β€’ Suppose we have 5 coin flips all of which are heads
– MLE would give πœƒπ‘€πΏπΈ = 1
–
1
This event occurs with probability 5
2
=
1
for a fair coin
32
– Are we willing to commit to such a strong conclusion
with such little evidence?
14
Priors
β€’ Priors are a Bayesian mechanism that allow us to take into account
β€œprior” knowledge about our belief in the outcome
β€’ Rather than estimating a single πœƒ, consider a distribution over
possible values of πœƒ given the data
– Update our prior after seeing data
Our best guess in the
absence of any data
Our estimate after we
see some data
Observe flips
e.g.: {tails, tails}
15
Bayesian Learning
Apply Bayes rule:
Posterior
Prior
Data Likelihood
𝑝 π·πœƒ 𝑝 πœƒ
𝑝 πœƒπ· =
𝑝 𝐷
Normalization
β€’ Or equivalently: 𝑝 πœƒ 𝐷 ∝ 𝑝 𝐷 πœƒ 𝑝 πœƒ
β€’ For uniform priors this reduces to the MLE objective
𝑝 πœƒ ∝1
β‡’
𝑝 πœƒ 𝐷 ∝ 𝑝(𝐷|πœƒ)
16
Picking Priors
β€’ How do we pick a good prior distribution?
– Could represent expert domain knowledge
– Statisticians choose them to make the posterior distribution
β€œnice” (conjugate priors)
β€’ What is a good prior for the bias in the coin flipping problem?
17
Picking Priors
β€’ How do we pick a good prior distribution?
– Could represent expert domain knowledge
– Statisticians choose them to make the posterior distribution
β€œnice” (conjugate priors)
β€’ What is a good prior for the bias in the coin flipping problem?
– Truncated Gaussian (tough to work with)
– Beta distribution (works well for binary random variables)
18
Coin Flips with Beta Distribution
Likelihood function:
Prior:
Posterior:
= πœƒ 𝛼𝐻+π›½π»βˆ’1 (1 βˆ’ πœƒ)𝛼𝑇+π›½π‘‡βˆ’1
MAP Estimation
β€’ Choosing πœƒ to maximize the posterior distribution is called
maximum a posteriori (MAP) estimation
πœƒπ‘€π΄π‘ƒ = arg max 𝑝(πœƒ|𝐷)
πœƒ
β€’ The only difference between πœƒπ‘€πΏπΈ and πœƒπ‘€π΄π‘ƒ is that one
assumes a uniform prior (MLE) and the other allows an
arbitrary prior
20
Priors
β€’ Suppose we have 5 coin flips all of which are heads
– MLE would give πœƒπ‘€πΏπΈ = 1
6
7
– MLE with a π΅π‘’π‘‘π‘Ž(2,2) prior gives πœƒπ‘€π΄π‘ƒ = β‰ˆ .857
– As we see more data, the effect of the prior diminishes
𝛼𝐻 +𝛽𝐻 βˆ’1
𝛼𝐻 +𝛽𝐻 +𝛼𝑇 +𝛽𝑇 βˆ’2
β€’ πœƒπ‘€π΄π‘ƒ =
observations
21
β‰ˆ
𝛼𝐻
for large # of
𝛼𝐻 +𝛼𝑇
Sample Complexity
β€’ How many coin flips do we need in order to guarantee that our
learned parameter does not differ too much from the true
parameter (with high probability)?
β€’ Can use Chernoff bound (again!)
– Suppose π‘Œ1 , … , π‘Œπ‘ are i.i.d. random variables taking values
in {0, 1} such that 𝐸𝑝 π‘Œπ‘– = 𝑦. For πœ– > 0,
𝑝
1
βˆ’2π‘πœ– 2
𝑦 βˆ’ ෍ π‘Œπ‘– β‰₯ πœ– ≀ 2𝑒
𝑁
𝑖
22
Sample Complexity
β€’ How many coin flips do we need in order to guarantee that our
learned parameter does not differ too much from the true
parameter (with high probability)?
β€’ Can use Chernoff bound (again!)
– For the coin flipping problem with 𝑋1 , … , 𝑋𝑛 iid coin flips
and πœ– > 0,
𝑝
1
βˆ’2π‘πœ– 2
πœƒπ‘‘π‘Ÿπ‘’π‘’ βˆ’ ෍ 𝑋𝑖 β‰₯ πœ– ≀ 2𝑒
𝑁
𝑖
23
Sample Complexity
β€’ How many coin flips do we need in order to guarantee that our
learned parameter does not differ too much from the true
parameter (with high probability)?
β€’ Can use Chernoff bound (again!)
– For the coin flipping problem with 𝑋1 , … , 𝑋𝑛 iid coin flips
and πœ– > 0,
𝑝 πœƒπ‘‘π‘Ÿπ‘’π‘’ βˆ’ πœƒπ‘€πΏπΈ β‰₯ πœ– ≀ 2𝑒
24
βˆ’2π‘πœ– 2
Sample Complexity
β€’ How many coin flips do we need in order to guarantee that our
learned parameter does not differ too much from the true
parameter (with high probability)?
β€’ Can use Chernoff bound (again!)
– For the coin flipping problem with 𝑋1 , … , 𝑋𝑛 iid coin flips
and πœ– > 0,
𝑝 πœƒπ‘‘π‘Ÿπ‘’π‘’ βˆ’ πœƒπ‘€πΏπΈ β‰₯ πœ– ≀ 2𝑒
𝛿 β‰₯ 2𝑒
βˆ’2π‘πœ– 2
βˆ’2π‘πœ– 2
1
2
β‡’ 𝑁 β‰₯ 2 ln
2πœ–
𝛿
25