Parameter Estimation:
Maximum Likelihood Estimation
Chapter 3 (Duda et al.) – Sections 3.1-3.2
CS479/679 Pattern Recognition
Dr. George Bebis
Parameter Estimation
• Bayesian Decision Theory allows us to design an
optimal classifier given that we have estimated P(i)
and p(x/i) first:
p( x / j ) P( j )
P( j / x)
p ( x)
• Estimating P(i) is usually not very difficult.
• Estimating p(x/i) could be more difficult:
– Dimensionality of feature space is large.
– Number of samples is often too small.
Parameter Estimation (cont’d)
• We will make the following assumptions:
– A set of training samples D ={x1, x2, ...., xn} is given, where the
samples were drawn according to p(x|j).
– p(x|j) has some known parametric form:
e.g., p(x /i) ~ N(μ i , i)
also denoted as p(x/q) where q=(μi , Σi)
• Parameter estimation problem:
Given D, find the best possible q
Main Methods in
Parameter Estimation
• Maximum Likelihood (ML)
• Bayesian Estimation (BE)
Main Methods in Parameter Estimation
• Maximum Likelihood (ML)
– Best estimate is obtained by maximizing the probability
of obtaining the samples D ={x1, x2, ...., xn} actually
observed:
p(x1 , x2 ,..., xn / θ) p( D / θ)
θˆ arg max p( D / θ)
θ
– ML assumes that θ is fixed and makes a point estimate:
p (x / q ) p(x / qˆ)
Main Methods in Parameter Estimation
(cont’d)
• Bayesian Estimation (BE)
– Assumes that θ is a set of random variables that have
some known a-priori distribution p(θ).
– Estimates a distribution rather than making a point
estimate (i.e., like ML):
p ( x / D ) p ( x / θ) p (θ / D ) dθ
Note: the BE solution p(x/D) might not be of the
parametric form assumed (e.g., p(x/q)).
ML Estimation - Assumptions
• Consider c classes and c training data sets (i.e.,
one for each class):
D1, D2, ...,Dc
• Samples in Dj are drawn independently according
to p(x/ωj).
• Problem: given D1, D2, ...,Dc and a model for
p(x/ωj) ~ p(x/q), estimate:
q1, q2,…, qc
ML Estimation - Problem Formulation
• If the samples in Dj provide no information about qi (i j ),
we need to solve c independent problems (i.e., one for each
class).
• The ML estimate for D={x1,x2,..,xn} is the value θ̂ that
maximizes p(D / q) (i.e., best supports the training data).
θˆ arg max p( D / θ)
q
– Using independence assumption, we can simplify p(D / q) :
n
p( D / θ) p(x1 , x 2 ,..., x n / θ) p( x k / θ)
k 1
ML Estimation - Solution
• How can we find the maximum of p(D/ q) ?
θ p( D / θ) 0
where
(gradient)
ML Estimation Using Log-Likelihood
• Taking the log for simplicity:
n
p( D / θ) p(x1 , x 2 ,..., x n / θ) p( x k / θ)
k 1
n
ln p( D / θ) ln p(x k / θ)
log-likelihood
k 1
• Maximizes ln p(D/ θ):
θˆ arg max ln p( D / θ)
q
θ ln p( D / θ) 0 or
n
k 1
θ
ln p(x k / θ) 0
Example
training data:
unknown mean,
known variance
p(D / θ)
ln p(D/ θ)
θ̂ =μ
ML for Multivariate Gaussian Density:
Case of Unknown θ=μ
• Assume p(x / μ) ~ N (μ, )
1
d
1
t 1
ln p(x / μ) ( x μ) ( x μ) ln 2 ln | |
2
2
2
• Computing the gradient, we have:
μ ln p( D / μ) μ ln p(xk / μ) 1 (xk μ)
k
k
ML for Multivariate Gaussian Density:
Case of Unknown θ=μ (cont’d)
• Setting μ ln p( D / μ) 0 we have:
n
k 1
1
(x k μ) 0 or
n
x
k 1
k
nμ 0
1 n
• The solution μ̂ is given by μˆ x k
n k 1
The ML estimate is simply the “sample mean”.
Special Case: Maximum A-Posteriori
Estimator (MAP)
• Assume that θ is a random variable with known p(θ).
p( D / θ) p(θ)
Consider: p(θ / D)
p( D)
•
Maximize p(θ/D) or p(D/θ)p(θ) or ln p(D/ θ)p(θ):
n
p(x
k 1
n
k
/ θ) p(θ)
ln p(x
k 1
k
/ θ) ln p(θ)
Special Case: Maximum A-Posteriori
Estimator (MAP) (cont’d)
•
What happens when p(θ) is uniform?
n
ln p(x
k 1
k
/ θ) ln p(θ)
n
ln p(x
k 1
k
/ θ)
MAP is equivalent to ML
MAP for Multivariate Gaussian Density:
Case of Unknown θ=μ
• Assume p(x / μ) ~ N (μ, Diag ( μ ))
and
p(μ) ~ N (μ 0 , Diag (σμ0 ))
(both μ 0 and σ μ0 are known)
• Maximize ln p(μ /D) = ln p(D/ μ)p(μ):
n
ln p(x
k 1
n
k
/ μ) ln p(μ)
μ ( ln p(x k / μ) ln p(μ)) 0
k 1
MAP for Multivariate Gaussian Density:
Case of Unknown θ=μ (cont’d)
μ2 n
μ0 2 xk
μ k 1
μˆ
μ2
1 2 n
μ
0
n
1
k 1
• If
2
μ
(x k μ)
μ2
1 ,
2
μ
0
1
2
μ0
(μ μ 0 ) 0
or
0
1 n
then μˆ x k
n k 1
• What happens when μ0 0 ?
μˆ μ0
ML for Univariate Gaussian Density:
Case of Unknown θ=(μ,σ2)
• Assume p( x / θ) ~ N ( , 2 ) θ =(θ1,θ2)=(μ,σ2)
1
1
2
ln p(x k / θ) ln 2 2 (x k ) 2 or
2
2
1
1
ln p(x k / θ) ln 2q 2
(x k q1 ) 2
2
2q 2
p(xk/θ)
p(xk/θ)
p(xk/θ)
ML for Univariate Gaussian Density:
Case of Unknown θ=(μ,σ2) (cont’d)
p(xk/ θ)=0
=0
=0
• The solutions are given by:
sample mean
sample variance
ML for Multivariate Gaussian Density:
Case of Unknown θ=(μ,Σ)
• In the general case (i.e., multivariate Gaussian) the
solutions are:
1 n
μˆ x k
n k 1
sample mean
n
1
ˆ (x k μˆ )(x k μˆ )t
n k 1
sample covariance
Biased and Unbiased Estimates
• An estimate θ̂ is unbiased when
E[θˆ ] θ
• The ML estimate μ̂ is unbiased, i.e.,
E[μˆ ] μ
• The ML estimates σ̂ and ̂ are biased:
n 1 2
E[σˆ ]
n
2
n 1
ˆ
E[]
n
Biased and Unbiased Estimates (cont’d)
• The following are unbiased estimates of σ̂ and ̂
1 n
2
ˆ
ˆ
(
x
μ
)
k
n 1 k 1
n
1
t
ˆ
ˆ
ˆ
(
x
μ
)(
x
μ
)
k
k
n 1 k 1
Comments
• ML estimation is simpler than alternative methods.
• ML provides more accurate estimates as the
number of training samples increases.
• If the model for p(x/ θ) is correct, and the
independence assumptions among samples are
true, ML will work well.
© Copyright 2026 Paperzz