Lecture 3: Pattern Recognition and Machine
Learning (II)
Bayes’ Theorem and Parameter Estimation
Kai Yu
SpeechLab
Department of Computer Science & Engineering
Shanghai Jiao Tong University
Autumn 2014
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II) SJTU Speech Lab
1 / 25
Table of Content
I
Bayes’ Theorem
I
Parameter Estimation
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II) SJTU Speech Lab
2 / 25
Recap: Probability Theory
The quantity P (X = x) can be estimated as the number of times
x has happened over the total number of events:
C(X = x)
P (X = x) = P
x̄ C(X = x̄)
where C(X = x) is the count of the event X = x.
I
Mean
E[x] = µ =
X
xP (x)
x∈X
I
Variance
E[(x − µ)2 ] = σ 2 =
X
(x − µ)2 P (x)
x
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II) SJTU Speech Lab
3 / 25
Recap: Independence
I
Exclusiveness: Two events never occur simultaneously
I
Independence: The probability of X happening does not
depend on Y , and vice versa.
P (X, Y ) = P (X)P (Y )
The mean and variance can be calculated as:
µX = E[X]
2
σX
= E[(X − µX )2 ]
In case of Z = nX and Z = X + Y
Q: What’s µZ and σZ2
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II) SJTU Speech Lab
4 / 25
Recap: Probability of Multiple Random Variables
I
Joint probability: P (X, Y ) is the probability of both X and
Y happening
I
Conditional probability: P (X|Y ) is the probability of X
happening given the event Y
I
Marginalization: Given P (X, Y ), calculate the probability of
X in all cases of Y
Q: What’s Posterior and Prior probability?
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II) SJTU Speech Lab
5 / 25
The Biased Coin Example
Let X be outcome of tossing a biased coin. Two possible
outcomes: Head (H) and Tail (T). Now consider there are two
glasses, one filled with water (F), the other is empty (E). Let Y be
the event where the coin fell into one of the two glasses and the
following is were observed:
The probability of the coin falling into an empty glass E and
getting a head H is
P (X = H, Y = E) = 0.20
Q: How to compute P (X = H|Y = E) or P (Y = F |X = T )
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II) SJTU Speech Lab
6 / 25
Recap: Bayes’ Theorem
Discrete distribution
Using probability definition, joint probability can be written as
P (X, Y ) = P (X|Y )P (Y ) = P (Y |X)P (X)
Bayes’ theorem is the rearrangement of the above
P (Y |X) =
P (X|Y )P (Y )
P (X)
Regard X as the outcome of random experiments:
I
P (Y |X): Posterior probability of Y given X
I
P (X|Y ): Conditional probability of X given Y
I
P (Y ): Prior probability of Y
P
P (X) = Y P (X|Y )P (Y ): Marginal probability or
evidence of X
I
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II) SJTU Speech Lab
7 / 25
Bayes’ Theorem
Likelihood and interpretation of probability
Likelihood in Bayes’ theorem:
I Continuous distribution - 2-dimensional image
p(y|x) =
I
p(x|y)p(y)
p(x)
Mixed distribution - Speech Recognition
P (Y |x) =
p(x|Y )P (Y )
p(x)
Probability explanation
I Bayesian: Probability measures a degree of belief, Bayes’
theorem then links the degree of belief in a proposition before
and after accounting for evidence.
I Frequency: Probability measures a proportion of outcomes of
many experiments. P (X|Y ) is the proportion of outcomes
with property X out of outcomes with property Y .
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II) SJTU Speech Lab
8 / 25
Bayes’ Theorem
Monty Hall Problem
Suppose you are given the choice of three doors: Behind one door
is a car; behind the others, goats. You pick a door, say No. 3, and
the host, who knows what’s behind the doors, opens another
door, say No. 1, which has a goat. He then says to you:
Do you want to pick door No. 2?
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II) SJTU Speech Lab
9 / 25
Analysis of Monty Hall Problem
Random experiment: The host chooses a door and then open
I
Before the experiment
P (D2 = car) = P (D3 = car) = P (D1 = car) =
I
1
3
After the experiment
X
P (D2 = car) =
P (D2 = car|D3 )P (D3 )
D3 ∈{car,goat}
P (D3 = goat) =
2
3
P (D2 = car) = 0 + 1 ×
2
2
=
3
3
Q: Why does the host’s indication matter?
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II)SJTU Speech Lab
10 / 25
Bayes’ Decision Rule for Classification
Overall aim of the classifier: “How to build a pattern classifier that
minimizes the average probability of error?”
Consider such a system:
I
feature vector x
I
K classes: w1 , w2 , ..., wK
I
a set of K priors: P (w1 ), P (w2 ), ..., P (wK )
I
a set of class conditional p.d.f.’s: p(x|wm ), k = 1, 2, ..., K
Posterior probability of each class can be calculated:
p(x|wj )P (wj )
, j = 1, 2, ..., K
P (wj |x) = PK
m=1 p(x|wk )P (wk )
Bayes’ decision rule:
ŵ = arg maxwj P (wj |x)
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II)SJTU Speech Lab
11 / 25
Bayes’ Decision Rule for Classification
Error analysis for binary classification
Binary classification rule divides the complete space into two
regions,decide class w1 in region R1 and w2 in R2 .The probability
of error is then:
P (error) = P (x ∈ R2 , w1 ) + P (x ∈ R1 , w2 )
=P
R (x ∈ R2 |w1 )P (w1 ) + P
R (x ∈ R1 |w2 )P (w2 )
= R2 p(x|w1 )P (w1 )dx + R1 p(x|w2 )P (w2 )dx
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II)SJTU Speech Lab
12 / 25
Parameter Estimation in Supervised Training of Probability
Distributions
Three Key elements
I
I
Data: assumed to be generated from the underlying
distribution
Model: structure and parameter set
I
I
I
Criterion
I
I
I
I
Gaussian
Gaussian Mixture Model
Maximum likelihood
Maximum a Posteriori
Discriminative
Assumption
I
I
Kai Yu
Parametric form of distribution is known
Samples are independently identically distributed (i.i.d.)
Lecture 3: Pattern Recognition and Machine Learning (II)SJTU Speech Lab
13 / 25
Maximum Likelihood Estimation (MLE)
With i.i.d assumption, the likelihood of the model parameters θ
with respect to the set of samples, X = {x1 , x2 , ...xn }, is
p(X|θ) =
N
Y
p(xn |θ)
n=1
Assumption: θ is deterministic though unknown
Goal: choose parameters set θ̂ that maximizes p(X|θ)
In practice it is easier to deal with the log of the likelihood funtion:
L(θ) = log(p(X|θ)) =
N
X
log(p(xn |θ))
n=1
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II)SJTU Speech Lab
14 / 25
Maximizing the Likelihood
A typical approach is to find the θ where the gradient is 0
∇θ L(θ) =
N
X
n=1
Kai Yu
∇θ log(p(xn |θ)) =
N
X
∂ log(p(xn |θ))
n=1
∂θ
=0
Lecture 3: Pattern Recognition and Machine Learning (II)SJTU Speech Lab
15 / 25
Recap: Single Gaussian Distribution
1
> −1
− (x − µ) Σ (x − µ)
N (x|µ, Σ) =
d
1 exp
2
(2π) 2 |Σ| 2
1
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II)SJTU Speech Lab
16 / 25
Recap: Properties of Single Gaussian Distribution
0.3
1
p(x|y=0.7)mean:0.6sigma:1.5
p(x)mean:0.5sigma:2
0.9
0.25
0.8
y=0.7
0.7
0.2
0.6
0.5
0.15
0.4
0.3
0.1
0.2
0.1
0
I
I
I
I
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.05
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
The marginal distribution p(xi ) of any component is Gaussian.
The joint maginal distribution of any subset p(xi , xj , ..) is
Gaussian.
The conditional distribution p(xi |xj ) is Gaussian.
If x is Gaussian and y = Ax + b,then y is Gaussian with
Aµx + b and covariance AΣx A> .
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II)SJTU Speech Lab
17 / 25
Gaussian Parameter Estimation
p(x|θ) =
N
Y
N (xn |µ, σ 2 )
n=1
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II)SJTU Speech Lab
18 / 25
MLE for Single Dimensional Gauss
Independent update of mean and variance
Consider a single dimensional observation (d = 1)
N
N X
X
(xn − µ)2
1
2
L(θ) =
log (p(xn |µ)) =
− log(2πσ ) −
2
2σ 2
n=1
n=1
N
∇µ L(θ)
=
∂L(µ) X
=
(xn − µ) = 0
∂µ
⇒ µ̂ =
1
N
n=1
N
X
xn
n=1
N
∇σ2 L(θ)
=
∂L(σ 2 ) X 1
(xn − µ̂)2
=
(
−
)=0
∂σ 2
σ2
σ4
⇒ σ̂ 2 =
Kai Yu
1
N
n=1
N
X
(xn − µ̂)2
n=1
Lecture 3: Pattern Recognition and Machine Learning (II)SJTU Speech Lab
19 / 25
Vector Calculus
Useful Formula of Vector calculus
∂
log |A| = (A−1 )>
∂A
∂
(x> Ay) = xy>
∂A
∂
(x> a) = a
∂x
∂
(x> Ax) = (A + A> )x
∂x
Note:
|A| =
n
X
i=1
Kai Yu
(−1)i+j aij Mij
A−1 =
1 ∗
A
|A|
Lecture 3: Pattern Recognition and Machine Learning (II)SJTU Speech Lab
20 / 25
MLE for Multivariate Gauss
Vector form
L(θ) =
N
X
log N (xn |µ, Σ) =
n=1
N
o
1 Xn
−
d log(2π)+log |Σ|+(xn−µ)> Σ−1 (xn−µ)
2
n=1
µ̂ =
Σ̂ =
Kai Yu
N
1 X
xn
N
1
N
n=1
N
X
(xn − µ̂)(xn − µ̂)>
n=1
Lecture 3: Pattern Recognition and Machine Learning (II)SJTU Speech Lab
21 / 25
Maximizing the Likelihood
Another type of case
Data:
X = {x1 , · · · , xN } xi > 0
Model:
p(x|θ) =
1
θ2 −θ1
0
θ1 < x < θ 2
otherwise
Criterion:
θ̂ = arg max log p(X|θ)
θ
Q: How to maximize the likelihood?
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II)SJTU Speech Lab
22 / 25
Maximum a Posteriori Estimation
I
Maximum Likelihood estimation (MLE): Deterministic θ
Lml (θ) = log p(X|θ)
I
Maximum a Posteriori (MAP): Random θ
Lmap (θ) = log p(θ|X) = log p(X|θ) + log p(θ)
I
I
Kai Yu
p(θ) incorporates prior knowledge about parameter θ
Useful for limited data case
Lecture 3: Pattern Recognition and Machine Learning (II)SJTU Speech Lab
23 / 25
MAP Estimation for Gaussian
Data:
X = {x1 , x2 , ...xn }
Gaussian model:
1
> −1
− (x − µ) Σ (x − µ)
N (x|µ, Σ) =
d
1 exp
2
(2π) 2 |Σ| 2
1
Prior exists for µ
√
o
n τ
1
τ
> −1
p(µ) = N (µ|µ0 , Σ) =
(µ
−
µ
)
Σ
(µ
−
µ
)
exp
−
0
0
d
1
τ
2
(2π) 2 |Σ| 2
I
Σ is known
I
µ0 and τ are hyper-parameters
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II)SJTU Speech Lab
24 / 25
Solution for MAP Estimation
Criterion:
Lmap (µ) = log N (µ|µ0 , Σ) +
N
X
log N (xn |µ, Σ)
n=1
(
)
N
X
1
= − (µ − µ0 )> Σ−1 (µ − µ0 )+ (xn − µ)> Σ−1 (xn − µ)
2
n=1
P
N
Solution:
xn + τ µ0
µ = n=1
N +τ
Kai Yu
Lecture 3: Pattern Recognition and Machine Learning (II)SJTU Speech Lab
25 / 25
© Copyright 2026 Paperzz