Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

0
Pattern
Classification
All materials in these slides were taken
from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and
the publisher
Pattern Classification, Chapter 3
0
1
Chapter 3:
Maximum-Likelihood and Bayesian
Parameter Estimation (part 2)
 Bayesian Estimation (BE)
 Bayesian Parameter Estimation: Gaussian Case
 Bayesian Parameter Estimation: General
Estimation
 Problems of Dimensionality
 Computational Complexity
 Component Analysis and Discriminants
 Hidden Markov Models

2
Bayesian Estimation (Bayesian learning
to pattern classification problems)
In MLE  was supposed fix
 In BE  is a random variable
 The computation of posterior probabilities
P(i | x) lies at the heart of Bayesian
classification
 Goal: compute P(i | x, D)
Given the sample D, Bayes formula can be
written
p(x | i , D) P(i | D)
P(i | x, D)  c
 p(x |  j , D) P( j | D)

j 1
Pattern Classification, Chapter 3
2
3
3

To demonstrate the preceding equation, use:
P (x, D | i )  P (x | D, i ) P ( D | i ) (from def. of cond. prob.)
P ( x | D )   P ( x,  j | D )
(from law of total prob.)
j
P (i )  P (i | D ) (Training sample provides this! )
We assume that samples in different classes are independen t
Thus :
P (i | x, D ) 
p (x | i , Di ) P (i )
c
 p(x |  , D) P( )
j 1
j
j
Pattern Classification, Chapter 3
3
3
4

Bayesian Parameter Estimation: Gaussian
Case
Goal: Estimate  using the a-posteriori density
P( | D)

The univariate case: P( | D)
 is the only unknown parameter
P(x |  ) ~ N(,  2 )
2
P( ) ~ N( 0 ,  0 )
(0 and 0 are known!)
Pattern Classification, Chapter 3
4
4
P (  | D) 
5
P (D |  ) P (  )
 P(D |  ) P( )d
Bayes Formula
k n
   P ( xk |  ) P (  )
k 1

But we know
p( xk |  ) ~ N ( , 2 ) and p( ) ~ N (0 ,  02 )
Plugging in their gaussian expressions and
extracting out factors not depending on  yields:
 1  n
1  2  1

p(  | D)   exp   2  2    2 2
 2 
0 



(from eq. 29 page 93)
0   
xk  2   

 0   
k 1
n
Pattern Classification, Chapter 3
5
4
6
Observation: p(|D) is an exponential of a
quadratic
n
 1  n





1
1
2
0
p(  | D)   exp    2  2   2 2  xk  2    
 2 




k

1
0
0







P(  | D) ~ N (  n ,  n )
It is again normal! It is called
2 a reproducing density
Pattern Classification, Chapter 3
6
4
7
 1  n


1
1
2

p(  | D)   exp   2  2   2 2
 2 
0 




0   
xk  2   

 0   
k 1
n
Identifying coefficients in the top equation with that of
the generic Gaussian
 1    n  
p (  | D) 
exp   
 
2  n
 2  n 
2
1
Yields expressions for n and n2
1

2
n

n

2

1
 02
n
0
n
and 2  2 ˆ n  2
n 
0
Pattern Classification, Chapter 3
7
4
8
Solving for n and n2 yields:
 n 02 
2
 ˆ 
 n  
0
2
2  n
2
2
n 0  
 n0 0   
2 2

2
0
and  n 
2
2
n 0  
From these equations we see as n increases:
 the variance decreases monotonically
 the estimate of p(|D) becomes more peaked
Pattern Classification, Chapter 3
8
4
9
Pattern Classification, Chapter 3
9
4

10
The univariate case P(x | D)


P( | D) computed (in preceding discussion)
P(x | D) remains to be computed!
P( x | D)   P( x |  ) P(  | D)d is Gaussian
It provides:
P( x | D) ~ N (  n ,  2   n2 )
 n and σ n2
2
We know
and how to compute
(Desired class-conditional density P(x | Dj, j))
Therefore: using P(x | Dj, j) together with P(j)
And using Bayes formula, we obtain the
Bayesian classification rule:



Max P( j | x, D  Max P( x |  j , D j ) P( j )
j
j

Pattern Classification, Chapter 3
10
4
11

3.5 Bayesian Parameter Estimation: General
Theory

P(x | D) computation can be applied to any situation
in which the unknown density can be
parameterized: the basic assumptions are:
form of P(x | ) is assumed known, but the value of 
is not known exactly
 Our knowledge about  is assumed to be contained in a
known prior density P()
 The rest of our knowledge  is contained in a set D of n
random variables x1, x2, …, xn that follows P(x)
 The
Pattern Classification, Chapter 3
11
5
12
The basic problem is:
“Compute the posterior density P( | D)”
then “Derive P(x | D)”, where p (x | D)   p (x | θ) p(θ | D)dθ
Using Bayes formula, we have:
P(θ | D) 
P(D | θ) P(θ)
 P(D | θ) P(θ)dθ
,
And by the independence assumption:
n
p(D | θ)   p(x k | θ)
k 1
Pattern Classification, Chapter 3
12
5
13
Convergence (from notes)
Pattern Classification, Chapter 3
13
14

Problems of Dimensionality
Problems involving 50 or 100 features are
common (usually binary valued)
 Note: microarray data might entail ~20000
real-valued features
 Classification accuracy dependant on

 dimensionality
 amount
of training data
 discrete vs continuous
Pattern Classification, Chapter 3
14
7
15

Case of two class multivariate normal
with the same covariance
P(x|j) ~N(j,), j=1,2
 Statistically independent features
 If the priors are equal then:

P (error ) 
1
2

e
u 2
2
du
(Bayes error)
r/2
where : r 2  ( 1   2 ) t  1 ( 1   2 )
r 2 is the squared Mahalanobi s distance
lim P (error )  0
r 
Pattern Classification, Chapter 3
15
7
16

If features are conditionally independent then:
  diag( 12 ,  22 ,..., 2d )
i  d 
r   
i 1
2


i1
  i2 

i

2
Do we remember what conditional independence is?
Example for binary features:
Let pi= Pr[xi=1|1] then P(x|1) is the product of the pi
Pattern Classification, Chapter 3
16
7
17

Most useful features are the ones for which the difference between
the means is large relative to the standard deviation

Doesn’t require independence

Adding independent features helps increase r  reduce error

Caution: adding features increases cost & complexity of feature
extractor and classifier

It has frequently been observed in practice that, beyond a certain
point, the inclusion of additional features leads to worse rather
than better performance:


we have the wrong model !
we don’t have enough training data to support the additional
dimensions
Pattern Classification, Chapter 3
17
7
18
7
Pattern Classification, Chapter 3
18
7

19
Computational Complexity

Our design methodology is affected by the
computational difficulty
 “big
oh” notation
f(x) = O(h(x)) “big oh of h(x)”
If: (c 0 , x 0 )   2 ; f (x )  c 0 h(x )
(An upper bound on f(x) grows no worse than h(x) for
sufficiently large x!)
f(x) = 2+3x+4x2
g(x) = x2
f(x) = O(x2)
Pattern Classification, Chapter 3
19
7
20
 “big
oh” is not unique!
f(x) = O(x2); f(x) = O(x3); f(x) = O(x4)
“big
theta” notation
f(x) = (h(x))
If: (x 0 , c1, c 2 )   3 ; x  x 0
0  c 1g(x )  f (x )  c 2 g(x )
f(x) = (x2) but f(x)  (x3)
Pattern Classification, Chapter 3
20
7

21
Complexity of the ML Estimation
 Gaussian
priors in d dimensions classifier
with n training samples for each of c classes
 For
each category, we have to compute the
discriminant function
O (1)


(d n)
O ( dn) O

1
d
1 ˆ
t ˆ 1
g (x)   (x  μˆ ) Σ (x  μˆ )  ln 2  ln Σ  ln P( )

2
2
2

O(n)
2
O ( d 2n)
Total = O(d2n)
Total for c classes = O(cd2n)  O(d2n)
 Cost
increase when d and n are large!
Pattern Classification, Chapter 3
21
7
22
Overfitting

Dimensionality of model vs size of training data
Issue: not enough data to support the model
 Possible solutions:

 Reduce
model dimensionality
 Make (possibly incorrect) assumptions to better estimate 
Pattern Classification, Chapter 3
22
23
Overfitting

Estimate better 

use data pooled from all classes
 normalization
issues
use pseudo-Bayesian form 0 + (1-)n
 “doctor”  by thresholding entries

 reduces

chance correlations
assume statistical independence
 zero
all off-diagonal elements
Pattern Classification, Chapter 3
23
24
Shrinkage

Shrinkage: weighted combination of common and
individual covariances
(1   )ni Σ i  nΣ
Σ i ( ) 
for 0    1
(1   )ni  n

We can also shrink the estimate common covariances
toward the identity matrix
Σ(  )  (1   ) Σ  I for 0    1
Pattern Classification, Chapter 3
24

Component Analysis and Discriminants
25
Combine features in order to reduce the
dimension of the feature space
 Linear combinations are simple to compute and
tractable
 Project high dimensional data onto a lower
dimensional space
 Two classical approaches for finding “optimal”
linear transformation

 PCA
(Principal Component Analysis) “Projection that
best represents the data in a least- square sense”
 MDA (Multiple Discriminant Analysis) “Projection that
best separates the data in a least-squares sense”
Pattern Classification, Chapter 3
25
8
PCA (from notes)
Pattern Classification, Chapter 3
26
26

Hidden Markov Models:

Markov Chains

Goal: make a sequence of decisions
27
 Processes
that unfold in time, states at time t are
influenced by a state at time t-1
 Applications:
speech recognition, gesture recognition,
parts of speech tagging and DNA sequencing,
 Any
temporal process without memory
T = {(1), (2), (3), …, (T)} sequence of states
We might have 6 = {1, 4, 2, 2, 1, 4}
 The
system can revisit a state at different steps and not
27
every state need to be visitedPattern Classification, Chapter 3
10
28

First-order Markov models
 Our
productions of any sequence is described by the
transition probabilities
P(j(t + 1) | i (t)) = aij
Pattern Classification, Chapter 3
28
10
29
Pattern Classification, Chapter 3
29
10
30
 = (aij, T)
P(T | ) = a14 . a42 . a22 . a21 . a14 .
P((1) = i)
Example: speech recognition
“production of spoken words”
Production of the word: “pattern” represented
by phonemes
/p/ /a/ /tt/ /er/ /n/ // ( // = silent state)
Transitions from /p/ to /a/, /a/ to /tt/, /tt/ to er/,
/er/ to /n/ and /n/ to a silent state
Pattern Classification, Chapter 3
30
10