Parameter Estimation

Parameter Estimation
主講人:虞台文
Contents
 Introduction
 Maximum-Likelihood
 Bayesian
Estimation
Estimation
Parameter Estimation
Introduction
Bayesian Rule
p(x | i ) P(i )
P(i | x) 
p ( x)
c
p(x)   p(x | i ) P(i )
j 1
We want to estimate the parameters of class-conditional
densities if its parametric form is known, e.g.,
X ~ N (μ, Σ)
Methods

The Method of Moments
–

Maximum-Likelihood Estimation
–

Assume parameters are fixed but unknown
Bayesian Estimation
–

Not discussed in this course
Assume parameters are random variables
Sufficient Statistics
–
Not discussed in this course
Parameter Estimation
Maximum-Likelihood
Estimation
Samples
D  {D1 , D 2 ,, Dc }
The samples in Dj are drawn
independently according to the
probability law p(x|j).
Assume that p(x|j) has a known
parametric form with parameter vector j.
e.g., p(x |  j ) ~ N (μ j , Σ j )
j
1
2
D2
D1
D3
3
θ j  (1 ,2 ,,m )
T
The estimated version will be denoted by
Goal
D  {D1 , D 2 ,, Dc }
θ̂ j
θ̂ 2
1 θ̂
1
D1
2
D2
p(x |  j )  p(x | θ j )
D3
 θ̂3
Dj to estimate the unknown
T
θ j  (1 ,2 ,,m )
parameter vector j
Use
3
Problem Formulation
Because each class is consider individually, the subscript
used before will be dropped.
Now the problem is:
Given a sample set D, whose elements are
drawn independently from a population
possessing a known parameter form, say
p(x|), we want to choose a θ̂ that will
make D to occur most likely.
D
θ̂
Criterion of ML
D  {x1 , x 2 ,, x n }
By the independence assumption, we have
n
p(D | θ)  p(x1 | θ) p(x 2 | θ) p(x n | θ)   p(x k | θ)
k 1
Likelihood function:
n
L(θ | D )  p(D | θ)   p(x k | θ)
k 1
MLE θˆ that maximizes L(θ | D ).
n
L(θ | D )   p (x n | θ)
k 1
Criterion of ML
Often, we resort to maximize the log-likelihood function
n
l (θ | D )  ln L(θ | D )   ln p(x k | θ)
k 1
θˆ  arg max l (θ | D )
θ
θˆ  arg max L(θ | D )
How?
θ
MLE θˆ that maximizes L(θ | D ).
n
L(θ | D )   p (x n | θ)
k 1
Criterion of ML
Example: X ~ N (  ,  2 )
θˆ  arg max L(θ | D )
θ
How?
θˆ  arg max l (θ | D )
θ
Differential Approach if Possible
Find the extreme values using the method in differential calculus.
Let f() be a continuous function, where =(1, 2,…, n)T.
 

 
Gradient

 θ  
,
,,
Operator
 n 
 1  2
Find the extreme values by solving
θ f  0
T
df  ( x f ) dx
T
Preliminary
Let f (x)  x Ax
T
x  ( x1 , x2 ,, xn )T
A  [aij ]nn
x f  ?
f
f
f
df  f (x  dx)  f (x) 
dx1 
dx2   
dxn
x1
x2
xn
 dx1 


 f f
f  dx2 

 
,
,,
xn   
 x1 x2


 dx 
 n
 ( x f )T dx
df  ( x f ) dx
T
Preliminary
Let f (x)  xT Ax
 x f  ( A  AT ) x
x f  2Ax if A is symmetric
T
T

(
x

d
x
)
A
(
x

d
x
)

x
Ax
df  f (x  dx)  f (x)
 (dx)T Ax  xT Adx  (dx)T Adx  (xT AT  xT A)dx
0
(xf )T
p ( x | μ, Σ ) 
1
(2 ) d / 2 | Σ |1/ 2
 1

exp  (x  μ)T Σ 1 (x  μ)
 2

The Gaussian Population
X ~ N (μ, Σ)
Two cases:
1.Unknown 
2.Unknown  and 
p ( x | μ, Σ ) 
1
(2 ) d / 2 | Σ |1/ 2
 1

T
1
exp  (x  μ) Σ (x  μ)
 2

The Gaussian Population: Unknown 
D  {x1 , x 2 ,, x n }
n
L(μ | D )  p(D | μ)   p(x k | θ)
k 1
 1

T
1

exp

(
x

μ
)
Σ
(
x

μ
)
k
nd / 2
n/2 
 2 k

(2 )
| Σ | k 1
n
1
l (μ | D )  ln L(μ | D )
  ln( 2 )
nd / 2
|Σ|
n/2
1 n
  (x k  μ)T Σ 1 (x k  μ)
2 k 1
The Gaussian Population: Unknown 
n
1

l
(
μ
|
D
)

Σ
Set μ
 (x k  μ)  0
k 1
n
1
μˆ   x k
n k 1
l (μ | D )   ln( 2 )
nd / 2
|Σ|
n/2
Sample Mean
1 n
  (x k  μ)T Σ 1 (x k  μ)
2 k 1
2


1
(
x


)
2
p( x |  ,  ) 
exp 

2
2 


The Gaussian Population: Unknown  and 
Consider univariate normal case
θ  (1 , 2 )T  ( ,  2 )T
D  {x1 , x2 ,, xn }
n
L(θ | D )  p(D | θ)  
k 1
X ~ N ( , 2 )
n
 ( xk   ) 2 
1
p ( xk | θ) 
exp 

n/2 n 
2
(2 )  k 1
2



l (θ | D )  ln L(θ | D )   ln( 2 ) 
n/2
  ln( 2 )  2
n/2
n/2
1

2 2
n
n

1
2 2
2
(
x


)
 k 1
k 1
n
 (x
k 1
k
  )2
The Gaussian Population: Unknown  and 
Consider univariate normal case
X ~ N ( , 2 )
unbiased
1 n


(
x


)
   k 1

2 k 1

Set  θl (θ | D )  
2 0
n
 n  ( xk  1 ) 
 2 
2 22 
k 1
2

n
1
ˆ  ˆ1   xk
n k 1
n
1
ˆ  ˆ2   ( xk  ˆ ) 2
n k 1
2
biased
l (θ | D )   ln( 2 )  2
n/2
n/2
1

2 2
n
2
(
x


)
 k 1
k 1
The Gaussian Population: Unknown  and 
For multivariate normal case
X ~ N (μ, Σ)
The MLE of  and  are:
n
1
μˆ   x k
n k 1
unbiased
n
1
ˆΣ   (x  μˆ )( x  μˆ )T
k
k
n k 1
biased
Unbiasedness
Unbiased Estimator
(Absolutely unbiase)
Consistent Estimator
(asymptotically unbiased)
ˆ
E[θ]  θ
ˆ
lim E[θ]  θ
n
MLE for Normal Population
X ~ N (μ, Σ)
1 n
μˆ   x k
n k 1
n
1
ˆ   (x  μˆ )( x  μˆ )T
Σ
k
k
n k 1
1 n
T
ˆ
ˆ
C
(
x

μ
)(
x

μ
)

k
k
n  1 k 1
Sample Mean
E[μˆ ]  μ
n 1
ˆ
E[ Σ] 
ΣΣ
n
Sample Covariance Matrix
E[C]  Σ
Parameter Estimation
Bayesian Estimation
Comparison

MLE (Maximum-Likelihood Estimation)
–

to find the fixed but unknown parameters
of a population.
Bayesian Estimation
–
Consider the parameters of a population to
be random variables.
Heart of Bayesian Classification
Ultimate Goal:
Evaluate
P(i | x)
p(x | i ) P(i )
P(i | x) 
p ( x)
What can we do if prior probabilities and class-conditional
densities are unknown?
Helpful Knowledge

Functional form for unknown densities
–

Ranges for the values of unknown parameters
–

e.g., Normal, exponential, …
e.g., uniform distributed over a range
Training Samples
–
Sampling according to the states of nature.
D  {D1 , D 2 ,, Dc }
Posterior Probabilities from Sample
P(i | x, D ) 
P(x | i , D ) P(i | D )
c
 P(x |  , D ) P(
j
j 1
P(x | i , D )  P(x | i , Di )
P(i | x, D ) 
j
| D)
P(i | D )  P(i )
P(x | i , D i ) P(i )
c
 P(x |  , D
j 1
j
j
) P( j )
D  {D1 , D 2 ,, Dc }
Posterior Probabilities from Sample
Each class can be
considered independently
P(i | x, D ) 
P(x | i , D i ) P(i )
c
 P(x |  , D
j 1
j
j
) P( j )
Problem Formulation
Let D be a set of samples drawn independently
according to the fixed but known distribution p(x).
We want to determine
D
p(x | D )
This the central problem of Bayesian Learning.
p(x | D )  ?
Parameter Distribution
Assume p(x) is unknown but knowing it has a fixed
form with parameter vector .
p (x | θ) is complete known
Assume  is a random vector, and p() is a
known a priori.
p(x | D )   p(x, θ | D )dθ
p(x | D )   p(x, θ | D )dθ
Class-Conditional Density Estimation
p( x, θ | D )  p(x | θ, D ) p(θ | D )
p ( x | θ)
p (x | D )   p (x | θ) p (θ | D )dθ
Class-Conditional Density Estimation
The form of
distribution is
assumed known
The posterior
density we want
to estimate
p (x | D )   p (x | θ) p (θ | D )dθ
p(θ | D )  ?
Class-Conditional Density Estimation
If p(|D) has a
sharp peak at θ̂
p (θ | D )
p(x | D )  p(x | θˆ )
θ̂
p (x | D )   p (x | θ) p (θ | D )dθ
θ
p(θ | D )  ?
Class-Conditional Density Estimation
p(θ | D ) 
p(D | θ) p(θ)
p
(
D
|
θ
)
p
(
θ
)
d
θ

n
   p(x k | θ) p(θ)
k 1
n
p (θ | D )    p (x k | θ) p (θ)
k 1
The Univariate Gaussian: Unknown 
X ~ N (  ,  ) distribution form is known
2
2

1
1 x  
p( x |  ) 
exp  
 
2 
 2    
 ~ N ( 0 ,  02 ) assume  is normal distributed
2

1
1    0  
 
p(  ) 
exp  
2  0
 2   0  
2

1
1 x  
p( x |  ) 
exp  
 
2 
 2    
n
p (θ | D )    p (x k | θ) p (θ)
k 1
2
 1    Gaussian:
The1 Univariate
Unknown 
 
0
p(  ) 
exp  
2  0
 2   0
n
p(  | D )   
k 1
 
 
2
2



1
1  xk   
1
1    0  
 
exp  
exp  
 
2 
 2   0  
 2     2  0
 1  n x   2      2 


0
 
   exp     k
  
 2  k 1      0  


 1  n
1  2  1



  exp   2  2    2 2
0 

 2  
0   
xk  2    

 0   
k 1
n
n
p (θ | D )    p (x k | θ) p (θ)
k 1
The Univariate Gaussian: Unknown 
p(  | D ) ~ N (  n ,  n2 )
 1
2
2 

exp  2   2n   n 
2  n
 2 n

1

 1  n
1  2  1



p (  | D )   exp   2  2    2 2
0 

 2  

0   
xk  2    

 0   
k 1
n
Comparison
2

1
1    n  
 
p(  | D ) 
exp  
2  n
 2   n  
n
p (θ | D )    p (x k | θ) p (θ)
k 1
The Univariate Gaussian: Unknown 
p(  | D ) ~ N (  n ,  n2 )
 n 02 
2
ˆ 
n   2
0
2  n
2
2
n 0  
 n 0   
 
 
n   2
2
n
2
0
2
0
2
lim  n  ˆ n
n 
lim   0
n
2
n
1 n
̂ n   xk
n k 1
The Univariate Gaussian: Unknown 
p (x | D )   p (x | θ) p (θ | D )dθ
The Univariate Gaussian: p(x|D)
p( x |  ) ~ N (  ,  2 )
p(  | D ) ~ N (  n ,  n2 )
 1  x   2 
1
p( x |  ) 
exp  
 
2 
 2    
2

1
1    n  
 
p(  | D ) 
exp  
2  n
 2   n  
 n 02 
2
ˆ 
n   2
0
2  n
2
2
n 0  
 n 0   
2 2

0
 n2 
n 02   2
p (x | D )   p (x | θ) p (θ | D )dθ
The Univariate Gaussian: p(x|D)
 1  x   2 
1
p( x |  ) 
exp  
 
2 
 2    
p( x |  ) ~ N (  ,  2 )
p(  | D ) ~ N (  n ,  n2 )
2

1
1    n  
 
p(  | D ) 
exp  
2  n
 2   n  
 1     2 
 1  x   2 
n

 d

p( x | D ) 
exp

exp







2n
 2   n  
 2    
1
2
2
2
2
2

 1 ( x  n ) 
 n x   n  
1
1  n 
 
 d

exp 
exp 
2
2 
2 2 
2
2
2n
 n  
 2   n 
 2  n 

2
p (x | D )   p (x | θ) p (θ | D )dθ
The Univariate Gaussian: p(x|D)
2
2
2
2
2

 1 ( x  n ) 
 n x   n  
1
1  n 
 
 d
p( x | D ) 
exp 
exp 
2
2 
2 2 
2
2
2n
 n  
 2   n 
 2  n 

2
=?
p( x | D ) ~ N (  n ,    )
2
2
n
n
p (θ | D )    p (x k | θ) p (θ)
k 1
The Multivariate Gaussian: Unknown 
X ~ N (μ, Σ)
p(x | μ) 
distribution form is known
1
(2 ) d / 2 | Σ |1/ 2
μ ~ N (μ 0 , Σ 0 )
 1

exp  (x  μ)T Σ(x  μ)
 2

assume  is normal distributed
1
 1

T
p(μ) 
exp

(
μ

μ
)
Σ
(
μ

μ
)
0
0
0 
 2
(2 ) d / 2 | Σ0 |1/ 2

n
p (θ | D )    p (x k | θ) p (θ)
k 1
The Multivariate Gaussian: Unknown 
 1

exp  (x  μ)T Σ(x  μ)
 2

X ~ N (μ, Σ)
p(x | μ) 
μ ~ N (μ 0 , Σ 0 )
1
 1

T
p(μ) 
exp

(
μ

μ
)
Σ
(
μ

μ
)
0
0
0 
 2
(2 ) d / 2 | Σ0 |1/ 2

1
(2 ) d / 2 | Σ |1/ 2
n
p (θ | D )    p (x k | θ) p (θ)
k 1
n
 1 T
 
1
1
T  1
1
   exp   μ (nΣ  Σ 0 )μ  2μ  Σ  x k  Σ 0 μ 0  
k 1
 2 

 
 1

   exp  (μ  μ n )T Σ n1 (μ  μ n )
 2

p (x | D )   p (x | θ) p (θ | D )dθ
The Multivariate Gaussian: Unknown 
 1

p(θ | D )    exp  (μ  μ n )T Σ n1 (μ  μ n ) ~ N (μ n , Σ n )
 2

1
1
1 
1 
1 

ˆ
μ n  Σ 0  Σ 0  Σ  μ n  Σ Σ 0  Σ  μ 0
n 
n 
n 

1
1  1

Σn  Σ0  Σ0  Σ 
Σ
n  n

p(x | D ) ~ N (μ n , Σ  Σn )
1 n
μˆ n   x k
n k 1
General Theory
1. p (x | θ)
2. p (θ)
the form of class-conditional density is known.
knowledge about the parameter distribution is
available.
samples are randomly drawn
according to the unknown
n
probability density p(x).
3. D  {x1 ,, x }
p(θ | D ) 
p(D | θ) p(θ)
p(D | θ) p(θ)dθ

General Theory
1. p (x | θ)
2. p (θ)
p (x | D )   p (x | θ) p (θ | D )dθ
the form of class-conditional density is known.
p(θ | D )  p(D | θ) p(θ)
knowledge about the parameter distribution is
available.
n drawn
samples are randomly
according
to| the
unknown
p
(
D
θ
)

p ( x k | θ)
n
probability density p(x).
k 1
3. D  {x1 ,, x }

D  {x1 , , x n }
n
Incremental Learning
p (x | D )   p (x | θ) p (θ | D )dθ
p(θ | D )  p(D | θ) p(θ)
n
p (D | θ)   p (x k | θ)
k 1
p(D n | θ)  p(x n | θ) p(D n -1 | θ)
p(θ | D )  p( xn | θ) p(D
n
n -1
| θ)
Recursive
1 / 
1. p( x |  )  
 0
3. D  {4,7,2,8}
0    10
p( | D 1 )  p( x1 |  ) p( | D 0 )  1 /  ,
4    10
p( | D 2 )  p( x2 |  ) p( | D 1 )  1 /  2 ,
7    10
p( | D 3 )  p( x3 |  ) p( | D 2 )  1 /  3 ,
7    10
p( | D 4 )  p( x4 |  ) p( | D 4 )  1 /  4 ,
8    10
p(D n | θ)  p(x n | θ) p(D n -1 | θ)
p(θ | D )  p( xn | θ) p(D
n
otherwise
1 / 10 0    10
2. p( )  
otherwise
 0
Example
p( | D 0 )  1 / 10,
0  x 
n -1
| θ)
1 / 
1. p( x |  )  
 0
otherwise
1 / 10 0    10
2. p( )  
otherwise
 0
Example
p( | D 0 )  1 / 10,
0  x 
3. D  {4,7,2,8}
0    10
p( | D 1 )  p( x1 |  ) p( | D 0 )  1 /  ,
4    10
p( | D 2 )  p( x2 |  ) p( | D 1 )  1 /  2 ,
7    10
p( | D 3 )  p( x3 |  ) p( | D 2 )  1 /  3 ,
7    10
p( | D 4 )  p( x4 |  ) p( | D 4 )  1 /  4 ,
p(|Dn)
4
8    10
3
2
1
0
2
4
6
8
10
