Discriminative Training

Discriminative Training
and
Machine Learning Approaches
Chih-Pin Liao
Machine Learning Lab, Dept. of CSIE, NCKU
2
Discriminative Training
Our Concerns
3
Feature extraction and HMM modeling should be
jointly performed.
Common objective function should be considered.
To alleviate model confusion and improve
recognition performance, we should estimate
HMM using discriminative criterion built from
statistics theory.
Model parameters should be calculated rapidly without
applying descent algorithm.
Minimum Classification Error (MCE)
4



MCE is a popular discriminative training algorithm
developed for speech recognition and extended to
other PR applications.
Rather than maximizing likelihood of observed data,
MCE aims to directly minimize classification errors.
Gradient descent algorithm was used to estimate HMM
parameters.
MCE Training Procedure
5



Procedure of training discriminative models
{ j } using observations X
Discriminant function
g( X ,  j )  log P( X  j )
Anti-discriminant function
1


1
G( X ,  j )  log P( X  j )  
 log P( X c )
C 1




c
j




Misclassification measure
d ( X ,  j )   g( X ,  j )  G( X ,  j )
Expected Loss
6


Loss function is calculated by mapping
d ( X ,  j ) into a range between zero to one
through a sigmoid function.
1
l( X ,  j ) 
1  exp( d ( X ,  j ))
Minimize the expected loss or classification error
to find discriminative model.

ˆ  argmin E [l ( X ,  )]  arg min E 

X
X




C

j 1

l ( X ,  j ) 1( X   j )


7
Hypothesis Test
Likelihood Ratio Test
8



New training criterion was derived from hypothesis test
theory.
We are testing null hypothesis against alternative
hypothesis.
Optimal solution is obtained by a likelihood ratio test
according to Neyman-Pearson Lemma
LR 

P( X H 0 )
P( X H 1 )

Higher likelihood ratio imply stronger confidence
towards accepting null hypothesis.
Hypotheses in HMM Training
9
 Null and alternative hypotheses
H 0 : Observations X are from target HMM state j
H1 : Observation X are not from target HMM
state j
 We develop discriminative HMM parameters for
target state against non-target states.
 Problem turns out to verify the goodness of data
alignment to the corresponding HMM states.
10
Maximum Confidence
Hidden Markov Model
Maximum Confidence HMM
11

MCHMM is estimated by maximizing the log
likelihood ratio or the confidence measure
 MC  arg max LLR( X |  )

 arg max log P( X |  )  log P( X |  )

where parameter set consists of HMM parameters
and transformation matrix
  { jk ,  jk ,  jk ,W }
Hybrid Parameter Estimation
12


Expectation-maximization (EM) algorithm is applied
to tackle missing data problem for maximum
confidence estimation
E-step
Q (    )  ES [ LLR ( X , S  ) X ,  ] 
 P(S X , )  LLR ( X , S )
S

T
C
t 1
j 1



1
P( st  j X ,  )  log P ( x t  j ) 
P (x t c )


C 1



c
j



Expectation Function
13
Q (   ) 
T
C
t 1
j 1

d
1
1 



K
log


log
W

log
2


log

ik
ik 
 
2
2
 t ( j, k ) 

1
   (W x t  ik )T ik1 (W x t  ik ) 
k 1
 

2

d
1

1  
  log W   log 2  log ck  
log ck
1

2
2



1
C 1

c  j   (W x t  ck
 )T ck1 (W x t  ck
 )

 
2

 Q ({jk }{ jk })  Qg ({ jk , jk ,W }{ jk ,  jk ,W })
MC Estimates of HMM Parameters
14
T
jk 

 t ( j, k ) 

 t ( j, k ) 
t 1
K
T
t 1 k 1



 jk  W 



T

t 1
T
1
C 1
   ( c, k )
1
C 1
  (c, k )
t
t 1 c  j
T
K
t
t 1 k 1 c  j
1
 t ( j, k )  x t 
C 1
T

t 1
 t ( j, k ) 
1
C 1
T

 t ( c, k )  x t 

c  j


 t ( c, k ) 
c  j


t 1
T

t 1
MC Estimates of HMM Parameters
15
T


T
 t ( j, k )  ( x t   jk )( x t   jk ) 




t 1


T
 1
T 


 t (c, k )  ( x t  ck )( x t  ck ) 
 C 1
t 1 c  j


jk  W T  
 W '
T
T
1

 t ( j, k ) 
 t ( c, k ) 


C 1
t 1
t 1 c  j














MC Estimate of Transformation Matrix
16
W  (i 1)  W  (i )   
Qg (W (i ) W )
W (i )
C

Q g (W  (i ) W )
W  (i )
K

j 1 k 1
( i )T
T  jkW   (W 
(i )
jkW (i ) )1
Uniform
segmentation
Training features
from face images
17
Viterbi
decoding
Estimate initial
HMM parameters
Initialize W
W (t 1)  W (t )   
Extract features
with estimated W
from observation
Q(W  | W )
W
no
yes
W convergence?
Transform HMM
parameters with W
no
Estimate transformation matrix W
with GPD algorithm
Convergence?
yes
MCM-based
HMM parameters
MC Classification Rule
18

Let Y denote an input test image data. We apply
the same criterion to identify the most likely
category corresponding to Y
cMC  arg max LLR( Y  c )
c
Summary
19




A new maximum confidence HMM framework was proposed.
Hypothesis test principle was used for building training
criterion.
Discriminative feature extraction and HMM modeling were
performed under the same criterion.
“Maximum Confidence Hidden Markov Modeling for Face Recognition”
Chien, Jen-Tzung; Liao, Chih-Pin;
Pattern Analysis and Machine Intelligence, IEEE Transactions on
Volume 30, Issue 4, April 2008 Page(s):606 – 616
20
Machine Learning Approaches
Introduction
21

Conditional Random Fields (CRF)
relax the normal conditional independence assumption of
the likelihood model
 enforce the homogeneity of labeling variables conditioned
on the observation


Due to the weak assumptions of CRF model and its
discriminative nature


allows arbitrary relationship among data
may require less resources to train its parameters
22

Better performance of CRF models than the Hidden
Markov Model (HMM) and Maximum Entropy Markov
models (MEMMs)
language and text processing problem
 Object recognition problems
 Image and video segmentation
 tracking problem in video sequences

23
Generative & Discriminative Model
Two Classes of Models
24

Generative model (HMM)
- model the distribution of states
Sˆ  arg max p(S | X)
st 1
st
st 1
xt 1
xt
xt 1
S
P(S | X)  P( X | S)

Direct model (MEMM and CRF)
- model the posterior probability directly
st 1
st
st 1
st 1
st
st 1
xt 1
xt
xt 1
xt 1
xt
xt 1
MEMM
CRF
Comparisons of Two Kinds of Model
25

Generative model – HMM
 Use Bayesian rule approximation
 Assume that observations are independent
 Multiple overlapping features are not modeled
 Model is estimated through recursive Viterbi algorithm
 t 1 (s)   t (s )  P(s | s )  P( xt 1 | s)
sS
26

Direct model - MEMM and CRF
 Direct modeling of posterior probability
 Dependencies of observations are flexibly modeled
 Model is estimated through recursive Viterbi algorithm
t 1 (s)  t (s)  P(s | s, xt 1 )
sS
27
Hidden Markov Model &
Maximum Entropy Markov Model
HMM for Human Motion Recognition

st 1
st
st 1
xt 1
xt
xt 1
HMM is defined by
 Transition probability p( st | st 1 )
 Observation probability p( xt | st )
28
Maximum Entropy Markov Model
29

st 1
st
st 1
xt 1
xt
xt 1
MEMM is defined by
 p(st | st 1 , xt ) is used to replace transition and
observation probability in HMM model
Maximum Entropy Criterion
30

Definition of feature functions
  0, if b(ct )  1 and st  s
f b,s  (ct , st )  

where

0
ct  {xt , st 1}
Constrained optimization problem
~
f i : E fi  E fi
where
E fi 
 p( s | c ) ~p (c ) f (c , s)
c C  , sV
~
E fi 
i
N
1
~
p (c , s) fi (c , s)   fi (c j , s j )

N j 1
c C , sV
empirical expectation
model expectation
Solution of MEMM

Lagrange multipliers are used for constrained optimization
~
( p( s | c ),  )  H ( p( s | c ))   (i  ( E fi  E fi ))
i
where {i } are the model parameters
H ( p( s | c ))  
 ~p (c ) p(s | c ) log p(s | c )
c C , sV

Solution is obtained by
p ( s | c ) 
1
 exp(  i  f i (c , s)) 
Z c 
i
exp(  i  f i (c , s))
i
 exp(  
sS
31
j
j
 f j (c , s))
GIS Algorithm
 Optimize the Maxmimum Mutual Information Criterion
(MMI)
Step1: Calculate the empirical expectation
1
~
E fi 
N
Step2: Start from an initial value
N
f
j 1
i
(c j , s j )
i ( 0)  1
Step3: Calculate the model expectation
E fi 
1
N
 p(s | c ) f (c , s)
c C , sV
i
Step4: Update model parameters
i ( new)  i ( current)    log(
Ef
Repeat step 3 and 4 until convergence
32
~
E fi
( current)
i
)
33
Conditional Random Field
Conditional Random Field

34
Definition
S  (Sv ). vV
Let G  (X, S) be a graph such that
When conditioned on ,X Sand
obeyed the Markov
v
property p(Sv | X, S w , w  v)  p(Sv | X, S w , w ~ v)
Then, ( X, S) is a conditional random field
st 1
st
st 1
xt 1
xt
xt 1
CRF Model Parameters


The undirected graphical structure can be used to factorize
into a normalized product of potential functions
Consider the graph as a linear-chain structure


p(S | X, )  exp    i f i (e, S e , X)    j g j (v, S v , X) 
vV , j
 eE ,i


Model parameter set
{1 , 2 ,...; 1 ,  2 ,...}

Feature function set
{ f1, f 2 ,...; g1, g2 ,...}
35
CRF Parameter Estimation
36

We can rewrite and maximize the posterior
probability
p(S | X, ) 
1
exp(  k Fk (S, X))
Z ( X)
k
where { ,  ,....}  { , ,...;  ,  ,...}
1
2
1
2
1
2
and

{F1 , F2 ,...}  { f1 , f 2 ,...; g1 , g2 ,...}
Log posterior probability is given by


1
(k )
(k )
L( )   log
   j Fj (S , X )
(k )
Z (X ) j
k 

Parameter Updating by GIS Algorithm
37
Differentiating the log posterior probability with respect to
parameter
L( )
 E ~p (S,X) [ Fj (S, X)]   E p (S|X( k ) , ) [ Fj (S, X( k ) )]
 j
k


Setting this derivative to zero yields the constraint in maximum
entropy model
This estimation has no closed-form solution. We can use GIS
algorithm.
38
Difference
CRF
MEMM
Objective Function
Max. posterior
probability with Gibbs
distribution
Max. entropy under
constrain
Complexity of
calculating
normalization term
Full
Inference in model
Similarity
O(| s |N )
O(| s | N )
2
O(| s |  N )
DP
O(k )
N-Best
Top One O(1)
p ( st | st 1 , xt )
p( S | X )
Feature function
State & observation
State & state
Parameter
Weight of feature function
Distribution
Gibbs distribution
Summary and Future works
39




We construct complex CRF with cycle for better modeling
of contextual dependency. Graphical model algorithm is
applied.
In the future, the variational inference algorithm will be
developed for improving calculation of conditional
probability.
The posterior probability can be calculated directly by a
approximating approach.
“Graphical modeling of conditional random fields for human
motion recognition” Liao, Chih-Pin; Chien, Jen-Tzung;
ICASSP 2008. IEEE International Conference on March 31 2008April 4 2008 Page(s):1969 - 1972
Thanks for your attention
and
Discussion
40