Unknown Mean and Variance

ECE 8527
to Machine Learning and Pattern Recognition
8443 – Introduction
Pattern Recognition
LECTURE 04: MAXIMUM LIKELIHOOD ESTIMATION
• Objectives:
Discrete Features
Maximum Likelihood
Bias in ML Estimates
Bayesian Estimation
Example
• Resources:
D.H.S: Chapter 3 (Part 1)
D.H.S.: Chapter 3 (Part 2)
J.O.S.: Tutorial
Nebula: Links
BGSU: Example
A.W.M.: Tutorial
A.W.M.: Links
S.P.: Primer
CSRN: Unbiased
A.W.M.: Bias
Wiki: ML
M.Y.: ML Tutorial
J.O.S.: Bayesian Est.
J.H.: Euro Coin
Discrete Features
• For problems where features are discrete:
 p(x  j )dx   P(x |ω j )
x
• Bayes formula involves probabilities (not densities):

P  j x  
    P x   Px  j  P j 
j
px 
Px 
p x j P  j
where
c

  
Px    P x  j P  j
j 1
• Bayes rule remains the same:
α*  arg min R(αi | x )
i
• The maximum entropy distribution is a uniform distribution:
P( x  x i ) 
1
N
ECE 8527: Lecture 04, Slide 1
Discriminant Functions For Discrete Features
• Consider independent binary features:
x  ( x1,..., xd )t
• Assuming conditional independence:
d
P(x | ω1)  
i 1
pixi (1 
1 xi
pi )
P(x | ω2 )   qixi (1  qi )1 xi
• The likelihood ratio is:
x
P (x | ω1 ) d pi i (1  pi )1 xi
 x
P (x | ω2 ) i 1 qi i (1  qi )1 xi
• The discriminant function is:
ECE 8527: Lecture 04, Slide 2
d
i 1
Introduction to Maximum Likelihood Estimation
• In Chapter 2, we learned how to design an optimal classifier if we knew the
prior probabilities, P(ωi), and class-conditional densities, p(x|ωi).
• What can we do if we do not have this information?
• What limitations do we face?
• There are two common approaches to parameter estimation: maximum
likelihood and Bayesian estimation.
• Maximum Likelihood: treat the parameters as quantities whose values are
fixed but unknown.
• Bayes: treat the parameters as random variables having some known prior
distribution. Observations of samples converts this to a posterior.
• Bayesian Learning: sharpen the a posteriori density causing it to peak near
the true value.
ECE 8527: Lecture 04, Slide 3
General Principle
• I.I.D.: c data sets, D1,...,Dc, where Dj drawn independently according to p(x|ωj).
• Assume p(x|ωj) has a known parametric form and is completely determined
by the parameter vector θj (e.g., p(x|ωj) ~ N(μj, Σj),
where θj=[μ1, ..., μj , σ11, σ12, ..., σdd]).
• p(x|ωj) has an explicit dependence on θj: p(x|ωj, θj)
• Use training samples to estimate θ1, θ2,..., θc
• Functional independence: assume Di gives no useful information
about θj for i≠j.
• Simplifies notation to a set D of training samples (x1,... xn) drawn
independently from p(x|ω) to estimate ω.
• Because the samples were drawn independently:
n
p ( D | )   p ( x k )
k 1
ECE 8527: Lecture 04, Slide 4
Example of ML Estimation
• p(D|θ) is called the likelihood of θ with respect to the data.
• The value of θ that maximizes this likelihood, denoted ̂ ,
is the maximum likelihood estimate (ML) of θ.
• Given several training points
• Top: candidate source distributions are
shown
• Which distribution is the ML estimate?
• Middle: an estimate of the likelihood of
the data as a function of θ (the mean)
• Bottom: log likelihood
ECE 8527: Lecture 04, Slide 5
General Mathematics
Let   (1, 2 ,..., p ) t .
  
  
 1
Let      .
  
  p 
Define : l    ln p D  
ˆ  arg max l  
θ
n
 ln(  p ( x k ))
k 1
  ln  p x k  
n
k 1
ECE 8527: Lecture 04, Slide 6
• The ML estimate is found by
solving this equation:
 l    [  ln  p x k  ]
n
k 1
    ln  p x k    0.
n
k 1
• The solution to this equation can
be a global maximum, a local
maximum, or even an inflection
point.
• Under what conditions is it a global
maximum?
Maximum A Posteriori Estimation
• A class of estimators – maximum a posteriori (MAP) – maximize l   p  
where p   describes the prior probability of different parameter values.
• An ML estimator is a MAP estimator for uniform priors.
• A MAP estimator finds the peak, or mode, of a posterior density.
• MAP estimators are not transformation invariant (if we perform a nonlinear
transformation of the input data, the estimator is no longer optimum in the
new space). This observation will be useful later in the course.
ECE 8527: Lecture 04, Slide 7
Gaussian Case: Unknown Mean
• Consider the case where only the mean, θ = μ, is unknown:
   ln  p x k    0
n
k 1
ln( p (xk ))  ln[
1
(2 ) d / 2 
exp[
1/ 2
1
(x k  ) t  1 (x k  )]
2
1
1
  ln[( 2 ) d  ]  (x k  ) t  1 (x k  )
2
2
which implies:   ln( p (xk ))   1 (x k  )
because:
  1
1

d
t 1
[

ln[(
2

)

]

(
x


)

(
x


)]


k
k
  2
2

 1
 1
 [ ln[( 2 ) d  ]  [ (x k  ) t  1 (x k  )]
 2
 2
  1 (x k  )
ECE 8527: Lecture 04, Slide 8
Gaussian Case: Unknown Mean
• Substituting into the expression for the total likelihood:
 l    ln  p x k     1 (x k  )  0
n
n
k 1
k 1
• Rearranging terms:
n
1
  (x k  ˆ)  0
k 1
n
 (x k  ˆ)  0
k 1
n
n
 x k   ˆ  0
k 1
n
k 1
 x k  n ˆ  0
k 1
n
1
ˆ   x k
n k 1
• Significance???
ECE 8527: Lecture 04, Slide 9
Gaussian Case: Unknown Mean and Variance
• Let θ = [μ,σ2]. The log likelihood of a SINGLE point is:
1
1
1
ln( p( xk ))   ln[( 2 ) 2 ]  ( xk  1 ) t  2 (xk  1 )
2
2
1


(
x


)
k
1
 

2
θl  θ ln( p ( xk θ))  
2
 1  ( xk  1 ) 
 2 2
2 22 

• The full likelihood leads to:
n
1
 ˆ ( xk  ˆ1 )  0
k 1 2
n
n
1
( xk  ˆ1 ) 2
2
ˆ
 0   ( xk  1 )   ˆ2
 ˆ 
2
2ˆ2
k 1 2 2
k 1
k 1
n
ECE 8527: Lecture 04, Slide 10
Gaussian Case: Unknown Mean and Variance
1 n
ˆ
• This leads to these equations: 1  ˆ   xk
n k 1
• In the multivariate case:
1 n
2
ˆ
 2  ˆ  ( xk  ˆ ) 2
n k 1
1 n
ˆ   x k
n k 1
ˆ 2 
1 n
t
 x k  ˆ  x k  ˆ 
n k 1
• The true covariance is the expected value of the matrix  x k  ˆ  x k  ˆ  ,
which is a familiar result.
t
ECE 8527: Lecture 04, Slide 11
Convergence of the Mean
• Does the maximum likelihood estimate of the variance converge to the true
value of the variance? Let’s start with a few simple results we will need later.
• Expected value of the ML estimate of the mean:
1 n
E[ ˆ ]  E[  xi ]
n i 1

n
1
 E[ xi ]
n i 1
1 n
   
n i 1
ECE 8527: Lecture 04, Slide 12
var[ˆ ]  E[ ˆ 2 ]  ( E[ ˆ ]) 2
 E[ ˆ 2 ]   2

 1 n  1 n
 E[  xi   x j ]   2
 n i 1  n j 1 
2

1  n n

 2    E[ xi x j ]    2
n  i 1 j 1

Variance of the ML Estimate of the Mean
• The expected value of xixj,, E[xixj,], will be μ2 for i ≠ j and μ2 + σ2 otherwise
since the two random variables are independent.
• The expected value of xi2 will be μ2 + σ2.
• Hence, in the summation above, we have n2-n terms with expected value μ2
and n terms with expected value μ2 + σ2.
• Thus,
var[ˆ ] 
1
n
2
n
2


 n   n  
2
2
2
 
2

2
n
which implies:
E[ ˆ ]  var[ˆ ]  ( E[ ˆ ]) 
2
2
2
n
 2
• We see that the variance of the estimate goes to zero as n goes to infinity, and
our estimate converges to the true estimate (error goes to zero).
ECE 8527: Lecture 04, Slide 13
Variance Relationships
• We will need one more result:
 2  E[( x   ) 2  E[ x 2 ]  2 E[ x]  E[  2 ]
 E[ x 2 ]  2  2  E[  2 ]
 E[ x 2 ]   2
n
  xi2
i 1
1 n 2
 (  xi )
n i 1
Note that this implies:
n
2
2
2
 xi    
i 1
• Now we can combine these results. Recall our expression for the ML
estimate of the variance:
1 n
n i 1
ˆ 2  E[   xi  ˆ 2 ]
ECE 8527: Lecture 04, Slide 14
Covariance Expansion
• Expand the covariance and simplify:
n
1 n
2 1
ˆ  E[   xi  ˆ   E[  ( xi2  2 xi ˆ  ˆ 2 )]
n i 1
n i 1
2
1 n
  ( E[ xi2 ]  2 E[ xi ˆ ]  E[ ˆ 2 ])
n i 1

1 n
2
2
2
2
 ((   )  2 E[ xi ˆ ]  (    n))
n i 1
• One more intermediate term to derive:
n
1 n
1 n
E[ xi ˆ ]  E[ xi  x j ]   E[ xi x j ]  (  E[ xi x j ]  E[ xi xi ])
n j 1
n j 1
j 1
i j


2
1
1
2
2
2
2
2
2 
 ((n  1)      )  ((n   )   
n
n
n
ECE 8527: Lecture 04, Slide 15
Biased Variance Estimate
• Substitute our previously derived expression for the second term:
1 n
ˆ   (( 2   2 )  2 E[ xi ˆ ]  (  2   2 n))
n i 1
2
1 n
  (( 2   2 )  2(  2   2 n)  (  2   2 n))
n i 1
1 n
  ( 2   2  2  2   2  2  2 n   2 n)
n i 1
1 n
  ( 2   2 n)
n i 1
1 n 2 (n  1)
1 n 2
1 n
2
2
  (   n)    (1  1 / n)   
n
n i 1
n i 1
n i 1
(n  1) 2


n
ECE 8527: Lecture 04, Slide 16
Expectation Simplification
• Therefore, the ML estimate is biased:
n 1 2
1 n
ˆ  E[  xi  ˆ 2 ] 
 2
n
ni 1
2
However, the ML estimate converges (and is MSE).
• An unbiased estimator is:
C
1 n
t
 xi  ˆ xi  ˆ 
n  1 i 1
• These are related by:
ˆ  (n  1) C

n
which is asymptotically unbiased. See Burl, AJWills and AWM for excellent
examples and explanations of the details of this derivation.
ECE 8527: Lecture 04, Slide 17
Summary
• Discriminant functions for discrete features are completely analogous to the
continuous case (end of Chapter 2).
• To develop an optimal classifier, we need reliable estimates of the statistics of
the features.
• In Maximum Likelihood (ML) estimation, we treat the parameters as having
unknown but fixed values.
• Justified many well-known results for estimating parameters (e.g., computing
the mean by summing the observations).
• Biased and unbiased estimators.
• Convergence of the mean and variance estimates.
ECE 8527: Lecture 04, Slide 18