Learning with Probabilities
CS194-10 Fall 2011 Lecture 15
CS194-10 Fall 2011 Lecture 15
1
Outline
♦ Bayesian learning
– eliminates arbitrary loss functions and regularizers
– facilitates incorporation of prior knowledge
– quantifies hypothesis and prediction uncertainty
– gives optimal predictions
♦ Maximum a posteriori and maximum likelihood learning
♦ Maximum likelihood parameter learning
CS194-10 Fall 2011 Lecture 15
2
Full Bayesian learning
View learning as Bayesian updating of a probability distribution
over the hypothesis space
H is the hypothesis variable, values h1, h2, . . ., prior P(H)
ith observation xi gives the outcome of random variable Xi
training data X = x1, . . . , xN
Given the data so far, each hypothesis has a posterior probability:
P (hk |X) = αP (X|hk )P (hk )
where P (X|hk ) is called the likelihood
Predictions use a likelihood-weighted average over the hypotheses:
P(XN +1|X) = Σk P(XN +1|X, hk )P (hk |X) = Σk P(XN +1|hk )P (hk |X)
No need to pick one best-guess hypothesis!
CS194-10 Fall 2011 Lecture 15
3
Example
Suppose there are five kinds of bags of candies:
10% are h1: 100% cherry candies
20% are h2: 75% cherry candies + 25% lime candies
40% are h3: 50% cherry candies + 50% lime candies
20% are h4: 25% cherry candies + 75% lime candies
10% are h5: 100% lime candies
Then we observe candies drawn from some bag:
What kind of bag is it? What flavour will the next candy be?
CS194-10 Fall 2011 Lecture 15
4
Posterior probability of hypotheses
P (hk |X) = αP (X|hk )P (hk )
P (h1
P (h2
P (h3
P (h4
P (h5
|5
|5
|5
|5
|5
limes) = αP (5
limes) = αP (5
limes) = αP (5
limes) = αP (5
limes) = αP (5
limes | h1)P (h1) = α · 0.05 · 0.1 = 0
limes | h2)P (h2) = α · 0.255 · 0.2 = 0.000195α
limes | h3)P (h3) = α · 0.55 · 0.4 = 0.0125α
limes | h4)P (h4) = α · 0.755 · 0.2 = 0.0475α
limes | h5)P (h5) = α · 1.05 · 0.1 = 0.1α
α = 1/(0 + 0.000195 + 0.0125 + 0.0475 + 0.1) = 6.2424
P (h1
P (h2
P (h3
P (h4
P (h5
|5
|5
|5
|5
|5
limes) = 0
limes) = 0.00122
limes) = 0.07803
limes) = 0.29650
limes) = 0.62424
CS194-10 Fall 2011 Lecture 15
5
Posterior probability of hypotheses
Posterior probability of hypothesis
1
P(h1 | d)
P(h2 | d)
P(h3 | d)
P(h4 | d)
P(h5 | d)
0.8
0.6
0.4
0.2
0
0
2
4
6
Number of samples in d
8
10
CS194-10 Fall 2011 Lecture 15
6
Prediction probability
P(XN +1|X) = Σk P(XN +1|X, hk )P (hk |X) = Σk P(XN +1|hk )P (hk |X)
P (lime on 6 | 5 limes) =
+
+
+
+
=
+
+
+
+
=
P (lime on 6 | h1)P (h1
P (lime on 6 | h2)P (h2
P (lime on 6 | h3)P (h3
P (lime on 6 | h4)P (h4
P (lime on 6 | h5)P (h5
0×0
0.25 × 0.00122
0.5 × 0.07830
0.75 × 0.29650
1.0 × 0.62424
0.88607
|5
|5
|5
|5
|5
limes)
limes)
limes)
limes)
limes)
CS194-10 Fall 2011 Lecture 15
7
Prediction probability
P(next candy is lime | d)
1
0.9
0.8
0.7
0.6
0.5
0.4
0
2
4
6
Number of samples in d
8
10
CS194-10 Fall 2011 Lecture 15
8
Learning from positive examples only
Example from Tenenbaum via Murphy, Ch.3:
Given examples of some unknown class, a predefined subset of {1, . . . , 100},
output a hypothesis as to what the class is
E.g., {16, 8, 2, 64}
Boolean classification problem; simplest consistent solution is “everything.”
[This is the basis for Chomsky’s “Poverty of the Stimulus” argument
purporting to prove that humans must have innate grammatical knowledge]
CS194-10 Fall 2011 Lecture 15
9
Bayesian counterargument
Assuming numbers are sampled uniformly from the class:
P ({16, 8, 2, 64} | powers of 2) = 7−4 ≈ 4.2 × 10−4
P ({16, 8, 2, 64} | everything) = 100−4 = 10−8
This difference far outweighs any reasonable simplicity-based prior
CS194-10 Fall 2011 Lecture 15
10
Bayes vs. Humans
CS194-10 Fall 2011 Lecture 15
11
MAP approximation
Summing over the hypothesis space is often intractable
(e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes)
Maximum a posteriori (MAP) learning: choose hMAP maximizing P (hk |X)
I.e., maximize P (X|hk )P (hk )
or minimize − log P (X|hk ) − log P (hk )
. . . or, in information theory terms, minimize
bits to encode data given hypothesis + bits to encode hypothesis
This is the basic idea of minimum description length (MDL) learning
In science experiments, “inputs” are fixed, deterministic h predicts “outputs”
⇒ P (X|hk ) is 1 if consistent, 0 otherwise
⇒ MAP = simplest consistent hypothesis
CS194-10 Fall 2011 Lecture 15
12
ML approximation
For large data sets, prior becomes irrelevant
Maximum likelihood (ML) learning: choose hML maximizing P (X|hk )
I.e., simply get the best fit to the data; identical to MAP for uniform prior
(which is reasonable if all hypotheses are of the same complexity)
ML is the “standard” (non-Bayesian) statistical learning method
CS194-10 Fall 2011 Lecture 15
13
A simple generative model: Bernoulli
A generative model is a probability model from which the probability of any
observable data set can be derived
[Usually contrasted with a discriminative or conditional model, which gives
only the probability for the “output” given the observable “inputs”]
E.g., Bernoulli[θ] model:
P (Xi = 1) = θ; P (Xi = 0) = 1 − θ
or P (Xi = xi) = θxi (1 − θ)1−xi
Suppose we get a bag of candy from a new manufacturer;
fraction θ of cherry candies
Any θ is possible: continuum of hypotheses hθ
CS194-10 Fall 2011 Lecture 15
14
ML estimation of Bernoulli model
Suppose we unwrap N candies, c cherries and ` = N − c limes
These are i.i.d. (independent, identically distributed) observations, so
P (X|hθ ) = Π
N
i = 1 P (xi |hθ )
=θ
i xi
P
(1 −
θ)N − i xi
P
= θc · (1 − θ)`
Maximize this w.r.t. θ—which is easier for the log-likelihood:
N
L(X|hθ ) = log P (X|hθ ) = Σi = 1 log P (xi|hθ ) = c log θ + ` log(1 − θ)
c
`
c
c
dL(X|hθ )
= −
=0
⇒ θ=
=
dθ
θ 1−θ
c+` N
Seems sensible, but causes problems with 0 counts!
CS194-10 Fall 2011 Lecture 15
15
Naive Bayes models
Generative model for discrete (often Boolean) classification problems:
– Each example has discrete class variable Yi
– Each example has discrete or continuous attributes Xij , j = 1, . . . , D
– Attributes are conditionally independent given the class value:
P(Yi = 1)
θ
Yi
Yi
0
P(Xij = 1 | Yi)
θ 0,j
θ 1,j
1
Xi,1
Xi,D
Xij
D
P (yi, xi,1, . . . , xi,D ) = P (yi)Πj = 1P (xij | yi)
D
x
= θyi (1 − θ)1−yi Πj = 1θyiij,j (1 − θyi,j )1−xij
CS194-10 Fall 2011 Lecture 15
16
ML estimation of Naive Bayes models
Likelihood is
x
D
N
P (X | h ) = Πi = 1θyi (1 − θ)1−yi Πj = 1θyiij,j (1 − θyi,j )1−xij
θ
Log likelihood is
L = log P (X | h ) =
ΣNi= 1yi log θ + (1 − yi) log(1 − θ) + ΣDj= 1xij log θyi,j + (1 − xij ) log(1 − θyi,j )
θ
This has parameters in separate terms, so derivatives are decoupled:
∂L
1 − y i N1 N − N1
N yi
= Σi = 1 −
=
−
∂θ
θ
1−θ
θ
1−θ
xij 1 − xij Nyj Ny − Nyj
∂L
= Σi:yi = y
−
=
−
∂θyj
θyj 1 − θyj
θyj
1 − θyj
where Ny = number of examples with class label y
and Nyj = number of examples with class label y and value 1 for Xij
CS194-10 Fall 2011 Lecture 15
17
ML estimation contd.
Setting derivatives to zero:
θ = N1/N
θyj = Nyj /Ny
as before
I.e., count the fraction of each class with jth attribute set to 1
⇒ O(N D) time to train the model
Example: 1000 cherry and lime candies, wrapped in red or green wrappers
by the Surprise Candy Company
400 cherry, of which 300 have red wrappers and 100 green wrappers
600 lime, of which 120 have red wrappers and 480 green wrappers
θ = P (F lavor = cherry) = 400/100 = 0.40
θ11 = P (W rapper = red | F lavor = cherry) = 300/400 = 0.75
θ01 = P (W rapper = red | F lavor = lime) = 120/600 = 0.20
CS194-10 Fall 2011 Lecture 15
18
Classifying a new example
P (Y = 1 | x1, . . . , xD ) = α P (x1, . . . , xD | Y = 1)P (Y = 1)
x
D
= α θ Πj = 1θ1jj (1 − θ1j )1−xj
log P (Y = 1 | x1, . . . , xD )
D
= log α + log θ + Σj = 1xj log θ1j + (1 − xj ) log(1 − θ1j )
D
j = 1 (1
= log α + log θ + Σ
!
D
− θ1j ) + Σj = 1xj (log(θ1j /(1 − θ1j )))
The set of points where
P (Y = 1 | x1, . . . , xD ) = P (Y = 0 | x1, . . . , xD ) = 0.5
is a linear separator! (But location is sensitive to class prior.)
CS194-10 Fall 2011 Lecture 15
19
Summary
Full Bayesian learning gives best possible predictions but is intractable
MAP learning balances complexity with accuracy on training data
Maximum likelihood assumes uniform prior, OK for large data sets
1. Choose a parameterized family of models to describe the data
requires substantial insight and sometimes new models
2. Write down the likelihood of the data as a function of the parameters
may require summing over hidden variables, i.e., inference
3. Write down the derivative of the log likelihood w.r.t. each parameter
4. Find the parameter values such that the derivatives are zero
may be hard/impossible; modern optimization techniques help
Naive Bayes is a simple generative model with a very fast training method
that finds a linear separator in input feature space
CS194-10 Fall 2011 Lecture 15
20
and provides probabilistic predictions
CS194-10 Fall 2011 Lecture 15
21
© Copyright 2026 Paperzz