Parameter Estimation
主講人:虞台文
Contents
Introduction
Maximum-Likelihood
Bayesian
Estimation
Estimation
Parameter Estimation
Introduction
Bayesian Rule
p(x | i ) P(i )
P(i | x)
p ( x)
c
p(x) p(x | i ) P(i )
j 1
We want to estimate the parameters of class-conditional
densities if its parametric form is known, e.g.,
X ~ N (μ, Σ)
Methods
The Method of Moments
–
Maximum-Likelihood Estimation
–
Assume parameters are fixed but unknown
Bayesian Estimation
–
Not discussed in this course
Assume parameters are random variables
Sufficient Statistics
–
Not discussed in this course
Parameter Estimation
Maximum-Likelihood
Estimation
Samples
D {D1 , D 2 ,, Dc }
The samples in Dj are drawn
independently according to the
probability law p(x|j).
Assume that p(x|j) has a known
parametric form with parameter vector j.
e.g., p(x | j ) ~ N (μ j , Σ j )
j
1
2
D2
D1
D3
3
θ j (1 ,2 ,,m )
T
The estimated version will be denoted by
Goal
D {D1 , D 2 ,, Dc }
θ̂ j
θ̂ 2
1 θ̂
1
D1
2
D2
p(x | j ) p(x | θ j )
D3
θ̂3
Dj to estimate the unknown
T
θ j (1 ,2 ,,m )
parameter vector j
Use
3
Problem Formulation
Because each class is consider individually, the subscript
used before will be dropped.
Now the problem is:
Given a sample set D, whose elements are
drawn independently from a population
possessing a known parameter form, say
p(x|), we want to choose a θ̂ that will
make D to occur most likely.
D
θ̂
Criterion of ML
D {x1 , x 2 ,, x n }
By the independence assumption, we have
n
p(D | θ) p(x1 | θ) p(x 2 | θ) p(x n | θ) p(x k | θ)
k 1
Likelihood function:
n
L(θ | D ) p(D | θ) p(x k | θ)
k 1
MLE θˆ that maximizes L(θ | D ).
n
L(θ | D ) p (x n | θ)
k 1
Criterion of ML
Often, we resort to maximize the log-likelihood function
n
l (θ | D ) ln L(θ | D ) ln p(x k | θ)
k 1
θˆ arg max l (θ | D )
θ
θˆ arg max L(θ | D )
How?
θ
MLE θˆ that maximizes L(θ | D ).
n
L(θ | D ) p (x n | θ)
k 1
Criterion of ML
Example: X ~ N ( , 2 )
θˆ arg max L(θ | D )
θ
How?
θˆ arg max l (θ | D )
θ
Differential Approach if Possible
Find the extreme values using the method in differential calculus.
Let f() be a continuous function, where =(1, 2,…, n)T.
Gradient
θ
,
,,
Operator
n
1 2
Find the extreme values by solving
θ f 0
T
df ( x f ) dx
T
Preliminary
Let f (x) x Ax
T
x ( x1 , x2 ,, xn )T
A [aij ]nn
x f ?
f
f
f
df f (x dx) f (x)
dx1
dx2
dxn
x1
x2
xn
dx1
f f
f dx2
,
,,
xn
x1 x2
dx
n
( x f )T dx
df ( x f ) dx
T
Preliminary
Let f (x) xT Ax
x f ( A AT ) x
x f 2Ax if A is symmetric
T
T
(
x
d
x
)
A
(
x
d
x
)
x
Ax
df f (x dx) f (x)
(dx)T Ax xT Adx (dx)T Adx (xT AT xT A)dx
0
(xf )T
p ( x | μ, Σ )
1
(2 ) d / 2 | Σ |1/ 2
1
exp (x μ)T Σ 1 (x μ)
2
The Gaussian Population
X ~ N (μ, Σ)
Two cases:
1.Unknown
2.Unknown and
p ( x | μ, Σ )
1
(2 ) d / 2 | Σ |1/ 2
1
T
1
exp (x μ) Σ (x μ)
2
The Gaussian Population: Unknown
D {x1 , x 2 ,, x n }
n
L(μ | D ) p(D | μ) p(x k | θ)
k 1
1
T
1
exp
(
x
μ
)
Σ
(
x
μ
)
k
nd / 2
n/2
2 k
(2 )
| Σ | k 1
n
1
l (μ | D ) ln L(μ | D )
ln( 2 )
nd / 2
|Σ|
n/2
1 n
(x k μ)T Σ 1 (x k μ)
2 k 1
The Gaussian Population: Unknown
n
1
l
(
μ
|
D
)
Σ
Set μ
(x k μ) 0
k 1
n
1
μˆ x k
n k 1
l (μ | D ) ln( 2 )
nd / 2
|Σ|
n/2
Sample Mean
1 n
(x k μ)T Σ 1 (x k μ)
2 k 1
2
1
(
x
)
2
p( x | , )
exp
2
2
The Gaussian Population: Unknown and
Consider univariate normal case
θ (1 , 2 )T ( , 2 )T
D {x1 , x2 ,, xn }
n
L(θ | D ) p(D | θ)
k 1
X ~ N ( , 2 )
n
( xk ) 2
1
p ( xk | θ)
exp
n/2 n
2
(2 ) k 1
2
l (θ | D ) ln L(θ | D ) ln( 2 )
n/2
ln( 2 ) 2
n/2
n/2
1
2 2
n
n
1
2 2
2
(
x
)
k 1
k 1
n
(x
k 1
k
)2
The Gaussian Population: Unknown and
Consider univariate normal case
X ~ N ( , 2 )
unbiased
1 n
(
x
)
k 1
2 k 1
Set θl (θ | D )
2 0
n
n ( xk 1 )
2
2 22
k 1
2
n
1
ˆ ˆ1 xk
n k 1
n
1
ˆ ˆ2 ( xk ˆ ) 2
n k 1
2
biased
l (θ | D ) ln( 2 ) 2
n/2
n/2
1
2 2
n
2
(
x
)
k 1
k 1
The Gaussian Population: Unknown and
For multivariate normal case
X ~ N (μ, Σ)
The MLE of and are:
n
1
μˆ x k
n k 1
unbiased
n
1
ˆΣ (x μˆ )( x μˆ )T
k
k
n k 1
biased
Unbiasedness
Unbiased Estimator
(Absolutely unbiase)
Consistent Estimator
(asymptotically unbiased)
ˆ
E[θ] θ
ˆ
lim E[θ] θ
n
MLE for Normal Population
X ~ N (μ, Σ)
1 n
μˆ x k
n k 1
n
1
ˆ (x μˆ )( x μˆ )T
Σ
k
k
n k 1
1 n
T
ˆ
ˆ
C
(
x
μ
)(
x
μ
)
k
k
n 1 k 1
Sample Mean
E[μˆ ] μ
n 1
ˆ
E[ Σ]
ΣΣ
n
Sample Covariance Matrix
E[C] Σ
Parameter Estimation
Bayesian Estimation
Comparison
MLE (Maximum-Likelihood Estimation)
–
to find the fixed but unknown parameters
of a population.
Bayesian Estimation
–
Consider the parameters of a population to
be random variables.
Heart of Bayesian Classification
Ultimate Goal:
Evaluate
P(i | x)
p(x | i ) P(i )
P(i | x)
p ( x)
What can we do if prior probabilities and class-conditional
densities are unknown?
Helpful Knowledge
Functional form for unknown densities
–
Ranges for the values of unknown parameters
–
e.g., Normal, exponential, …
e.g., uniform distributed over a range
Training Samples
–
Sampling according to the states of nature.
D {D1 , D 2 ,, Dc }
Posterior Probabilities from Sample
P(i | x, D )
P(x | i , D ) P(i | D )
c
P(x | , D ) P(
j
j 1
P(x | i , D ) P(x | i , Di )
P(i | x, D )
j
| D)
P(i | D ) P(i )
P(x | i , D i ) P(i )
c
P(x | , D
j 1
j
j
) P( j )
D {D1 , D 2 ,, Dc }
Posterior Probabilities from Sample
Each class can be
considered independently
P(i | x, D )
P(x | i , D i ) P(i )
c
P(x | , D
j 1
j
j
) P( j )
Problem Formulation
Let D be a set of samples drawn independently
according to the fixed but known distribution p(x).
We want to determine
D
p(x | D )
This the central problem of Bayesian Learning.
p(x | D ) ?
Parameter Distribution
Assume p(x) is unknown but knowing it has a fixed
form with parameter vector .
p (x | θ) is complete known
Assume is a random vector, and p() is a
known a priori.
p(x | D ) p(x, θ | D )dθ
p(x | D ) p(x, θ | D )dθ
Class-Conditional Density Estimation
p( x, θ | D ) p(x | θ, D ) p(θ | D )
p ( x | θ)
p (x | D ) p (x | θ) p (θ | D )dθ
Class-Conditional Density Estimation
The form of
distribution is
assumed known
The posterior
density we want
to estimate
p (x | D ) p (x | θ) p (θ | D )dθ
p(θ | D ) ?
Class-Conditional Density Estimation
If p(|D) has a
sharp peak at θ̂
p (θ | D )
p(x | D ) p(x | θˆ )
θ̂
p (x | D ) p (x | θ) p (θ | D )dθ
θ
p(θ | D ) ?
Class-Conditional Density Estimation
p(θ | D )
p(D | θ) p(θ)
p
(
D
|
θ
)
p
(
θ
)
d
θ
n
p(x k | θ) p(θ)
k 1
n
p (θ | D ) p (x k | θ) p (θ)
k 1
The Univariate Gaussian: Unknown
X ~ N ( , ) distribution form is known
2
2
1
1 x
p( x | )
exp
2
2
~ N ( 0 , 02 ) assume is normal distributed
2
1
1 0
p( )
exp
2 0
2 0
2
1
1 x
p( x | )
exp
2
2
n
p (θ | D ) p (x k | θ) p (θ)
k 1
2
1 Gaussian:
The1 Univariate
Unknown
0
p( )
exp
2 0
2 0
n
p( | D )
k 1
2
2
1
1 xk
1
1 0
exp
exp
2
2 0
2 2 0
1 n x 2 2
0
exp k
2 k 1 0
1 n
1 2 1
exp 2 2 2 2
0
2
0
xk 2
0
k 1
n
n
p (θ | D ) p (x k | θ) p (θ)
k 1
The Univariate Gaussian: Unknown
p( | D ) ~ N ( n , n2 )
1
2
2
exp 2 2n n
2 n
2 n
1
1 n
1 2 1
p ( | D ) exp 2 2 2 2
0
2
0
xk 2
0
k 1
n
Comparison
2
1
1 n
p( | D )
exp
2 n
2 n
n
p (θ | D ) p (x k | θ) p (θ)
k 1
The Univariate Gaussian: Unknown
p( | D ) ~ N ( n , n2 )
n 02
2
ˆ
n 2
0
2 n
2
2
n 0
n 0
n 2
2
n
2
0
2
0
2
lim n ˆ n
n
lim 0
n
2
n
1 n
̂ n xk
n k 1
The Univariate Gaussian: Unknown
p (x | D ) p (x | θ) p (θ | D )dθ
The Univariate Gaussian: p(x|D)
p( x | ) ~ N ( , 2 )
p( | D ) ~ N ( n , n2 )
1 x 2
1
p( x | )
exp
2
2
2
1
1 n
p( | D )
exp
2 n
2 n
n 02
2
ˆ
n 2
0
2 n
2
2
n 0
n 0
2 2
0
n2
n 02 2
p (x | D ) p (x | θ) p (θ | D )dθ
The Univariate Gaussian: p(x|D)
1 x 2
1
p( x | )
exp
2
2
p( x | ) ~ N ( , 2 )
p( | D ) ~ N ( n , n2 )
2
1
1 n
p( | D )
exp
2 n
2 n
1 2
1 x 2
n
d
p( x | D )
exp
exp
2n
2 n
2
1
2
2
2
2
2
1 ( x n )
n x n
1
1 n
d
exp
exp
2
2
2 2
2
2
2n
n
2 n
2 n
2
p (x | D ) p (x | θ) p (θ | D )dθ
The Univariate Gaussian: p(x|D)
2
2
2
2
2
1 ( x n )
n x n
1
1 n
d
p( x | D )
exp
exp
2
2
2 2
2
2
2n
n
2 n
2 n
2
=?
p( x | D ) ~ N ( n , )
2
2
n
n
p (θ | D ) p (x k | θ) p (θ)
k 1
The Multivariate Gaussian: Unknown
X ~ N (μ, Σ)
p(x | μ)
distribution form is known
1
(2 ) d / 2 | Σ |1/ 2
μ ~ N (μ 0 , Σ 0 )
1
exp (x μ)T Σ(x μ)
2
assume is normal distributed
1
1
T
p(μ)
exp
(
μ
μ
)
Σ
(
μ
μ
)
0
0
0
2
(2 ) d / 2 | Σ0 |1/ 2
n
p (θ | D ) p (x k | θ) p (θ)
k 1
The Multivariate Gaussian: Unknown
1
exp (x μ)T Σ(x μ)
2
X ~ N (μ, Σ)
p(x | μ)
μ ~ N (μ 0 , Σ 0 )
1
1
T
p(μ)
exp
(
μ
μ
)
Σ
(
μ
μ
)
0
0
0
2
(2 ) d / 2 | Σ0 |1/ 2
1
(2 ) d / 2 | Σ |1/ 2
n
p (θ | D ) p (x k | θ) p (θ)
k 1
n
1 T
1
1
T 1
1
exp μ (nΣ Σ 0 )μ 2μ Σ x k Σ 0 μ 0
k 1
2
1
exp (μ μ n )T Σ n1 (μ μ n )
2
p (x | D ) p (x | θ) p (θ | D )dθ
The Multivariate Gaussian: Unknown
1
p(θ | D ) exp (μ μ n )T Σ n1 (μ μ n ) ~ N (μ n , Σ n )
2
1
1
1
1
1
ˆ
μ n Σ 0 Σ 0 Σ μ n Σ Σ 0 Σ μ 0
n
n
n
1
1 1
Σn Σ0 Σ0 Σ
Σ
n n
p(x | D ) ~ N (μ n , Σ Σn )
1 n
μˆ n x k
n k 1
General Theory
1. p (x | θ)
2. p (θ)
the form of class-conditional density is known.
knowledge about the parameter distribution is
available.
samples are randomly drawn
according to the unknown
n
probability density p(x).
3. D {x1 ,, x }
p(θ | D )
p(D | θ) p(θ)
p(D | θ) p(θ)dθ
General Theory
1. p (x | θ)
2. p (θ)
p (x | D ) p (x | θ) p (θ | D )dθ
the form of class-conditional density is known.
p(θ | D ) p(D | θ) p(θ)
knowledge about the parameter distribution is
available.
n drawn
samples are randomly
according
to| the
unknown
p
(
D
θ
)
p ( x k | θ)
n
probability density p(x).
k 1
3. D {x1 ,, x }
D {x1 , , x n }
n
Incremental Learning
p (x | D ) p (x | θ) p (θ | D )dθ
p(θ | D ) p(D | θ) p(θ)
n
p (D | θ) p (x k | θ)
k 1
p(D n | θ) p(x n | θ) p(D n -1 | θ)
p(θ | D ) p( xn | θ) p(D
n
n -1
| θ)
Recursive
1 /
1. p( x | )
0
3. D {4,7,2,8}
0 10
p( | D 1 ) p( x1 | ) p( | D 0 ) 1 / ,
4 10
p( | D 2 ) p( x2 | ) p( | D 1 ) 1 / 2 ,
7 10
p( | D 3 ) p( x3 | ) p( | D 2 ) 1 / 3 ,
7 10
p( | D 4 ) p( x4 | ) p( | D 4 ) 1 / 4 ,
8 10
p(D n | θ) p(x n | θ) p(D n -1 | θ)
p(θ | D ) p( xn | θ) p(D
n
otherwise
1 / 10 0 10
2. p( )
otherwise
0
Example
p( | D 0 ) 1 / 10,
0 x
n -1
| θ)
1 /
1. p( x | )
0
otherwise
1 / 10 0 10
2. p( )
otherwise
0
Example
p( | D 0 ) 1 / 10,
0 x
3. D {4,7,2,8}
0 10
p( | D 1 ) p( x1 | ) p( | D 0 ) 1 / ,
4 10
p( | D 2 ) p( x2 | ) p( | D 1 ) 1 / 2 ,
7 10
p( | D 3 ) p( x3 | ) p( | D 2 ) 1 / 3 ,
7 10
p( | D 4 ) p( x4 | ) p( | D 4 ) 1 / 4 ,
p(|Dn)
4
8 10
3
2
1
0
2
4
6
8
10
© Copyright 2026 Paperzz