Discriminative Training
and
Machine Learning Approaches
Chih-Pin Liao
Machine Learning Lab, Dept. of CSIE, NCKU
2
Discriminative Training
Our Concerns
3
Feature extraction and HMM modeling should be
jointly performed.
Common objective function should be considered.
To alleviate model confusion and improve
recognition performance, we should estimate
HMM using discriminative criterion built from
statistics theory.
Model parameters should be calculated rapidly without
applying descent algorithm.
Minimum Classification Error (MCE)
4
MCE is a popular discriminative training algorithm
developed for speech recognition and extended to
other PR applications.
Rather than maximizing likelihood of observed data,
MCE aims to directly minimize classification errors.
Gradient descent algorithm was used to estimate HMM
parameters.
MCE Training Procedure
5
Procedure of training discriminative models
{ j } using observations X
Discriminant function
g( X , j ) log P( X j )
Anti-discriminant function
1
1
G( X , j ) log P( X j )
log P( X c )
C 1
c
j
Misclassification measure
d ( X , j ) g( X , j ) G( X , j )
Expected Loss
6
Loss function is calculated by mapping
d ( X , j ) into a range between zero to one
through a sigmoid function.
1
l( X , j )
1 exp( d ( X , j ))
Minimize the expected loss or classification error
to find discriminative model.
ˆ argmin E [l ( X , )] arg min E
X
X
C
j 1
l ( X , j ) 1( X j )
7
Hypothesis Test
Likelihood Ratio Test
8
New training criterion was derived from hypothesis test
theory.
We are testing null hypothesis against alternative
hypothesis.
Optimal solution is obtained by a likelihood ratio test
according to Neyman-Pearson Lemma
LR
P( X H 0 )
P( X H 1 )
Higher likelihood ratio imply stronger confidence
towards accepting null hypothesis.
Hypotheses in HMM Training
9
Null and alternative hypotheses
H 0 : Observations X are from target HMM state j
H1 : Observation X are not from target HMM
state j
We develop discriminative HMM parameters for
target state against non-target states.
Problem turns out to verify the goodness of data
alignment to the corresponding HMM states.
10
Maximum Confidence
Hidden Markov Model
Maximum Confidence HMM
11
MCHMM is estimated by maximizing the log
likelihood ratio or the confidence measure
MC arg max LLR( X | )
arg max log P( X | ) log P( X | )
where parameter set consists of HMM parameters
and transformation matrix
{ jk , jk , jk ,W }
Hybrid Parameter Estimation
12
Expectation-maximization (EM) algorithm is applied
to tackle missing data problem for maximum
confidence estimation
E-step
Q ( ) ES [ LLR ( X , S ) X , ]
P(S X , ) LLR ( X , S )
S
T
C
t 1
j 1
1
P( st j X , ) log P ( x t j )
P (x t c )
C 1
c
j
Expectation Function
13
Q ( )
T
C
t 1
j 1
d
1
1
K
log
log
W
log
2
log
ik
ik
2
2
t ( j, k )
1
(W x t ik )T ik1 (W x t ik )
k 1
2
d
1
1
log W log 2 log ck
log ck
1
2
2
1
C 1
c j (W x t ck
)T ck1 (W x t ck
)
2
Q ({jk }{ jk }) Qg ({ jk , jk ,W }{ jk , jk ,W })
MC Estimates of HMM Parameters
14
T
jk
t ( j, k )
t ( j, k )
t 1
K
T
t 1 k 1
jk W
T
t 1
T
1
C 1
( c, k )
1
C 1
(c, k )
t
t 1 c j
T
K
t
t 1 k 1 c j
1
t ( j, k ) x t
C 1
T
t 1
t ( j, k )
1
C 1
T
t ( c, k ) x t
c j
t ( c, k )
c j
t 1
T
t 1
MC Estimates of HMM Parameters
15
T
T
t ( j, k ) ( x t jk )( x t jk )
t 1
T
1
T
t (c, k ) ( x t ck )( x t ck )
C 1
t 1 c j
jk W T
W '
T
T
1
t ( j, k )
t ( c, k )
C 1
t 1
t 1 c j
MC Estimate of Transformation Matrix
16
W (i 1) W (i )
Qg (W (i ) W )
W (i )
C
Q g (W (i ) W )
W (i )
K
j 1 k 1
( i )T
T jkW (W
(i )
jkW (i ) )1
Uniform
segmentation
Training features
from face images
17
Viterbi
decoding
Estimate initial
HMM parameters
Initialize W
W (t 1) W (t )
Extract features
with estimated W
from observation
Q(W | W )
W
no
yes
W convergence?
Transform HMM
parameters with W
no
Estimate transformation matrix W
with GPD algorithm
Convergence?
yes
MCM-based
HMM parameters
MC Classification Rule
18
Let Y denote an input test image data. We apply
the same criterion to identify the most likely
category corresponding to Y
cMC arg max LLR( Y c )
c
Summary
19
A new maximum confidence HMM framework was proposed.
Hypothesis test principle was used for building training
criterion.
Discriminative feature extraction and HMM modeling were
performed under the same criterion.
“Maximum Confidence Hidden Markov Modeling for Face Recognition”
Chien, Jen-Tzung; Liao, Chih-Pin;
Pattern Analysis and Machine Intelligence, IEEE Transactions on
Volume 30, Issue 4, April 2008 Page(s):606 – 616
20
Machine Learning Approaches
Introduction
21
Conditional Random Fields (CRF)
relax the normal conditional independence assumption of
the likelihood model
enforce the homogeneity of labeling variables conditioned
on the observation
Due to the weak assumptions of CRF model and its
discriminative nature
allows arbitrary relationship among data
may require less resources to train its parameters
22
Better performance of CRF models than the Hidden
Markov Model (HMM) and Maximum Entropy Markov
models (MEMMs)
language and text processing problem
Object recognition problems
Image and video segmentation
tracking problem in video sequences
23
Generative & Discriminative Model
Two Classes of Models
24
Generative model (HMM)
- model the distribution of states
Sˆ arg max p(S | X)
st 1
st
st 1
xt 1
xt
xt 1
S
P(S | X) P( X | S)
Direct model (MEMM and CRF)
- model the posterior probability directly
st 1
st
st 1
st 1
st
st 1
xt 1
xt
xt 1
xt 1
xt
xt 1
MEMM
CRF
Comparisons of Two Kinds of Model
25
Generative model – HMM
Use Bayesian rule approximation
Assume that observations are independent
Multiple overlapping features are not modeled
Model is estimated through recursive Viterbi algorithm
t 1 (s) t (s ) P(s | s ) P( xt 1 | s)
sS
26
Direct model - MEMM and CRF
Direct modeling of posterior probability
Dependencies of observations are flexibly modeled
Model is estimated through recursive Viterbi algorithm
t 1 (s) t (s) P(s | s, xt 1 )
sS
27
Hidden Markov Model &
Maximum Entropy Markov Model
HMM for Human Motion Recognition
st 1
st
st 1
xt 1
xt
xt 1
HMM is defined by
Transition probability p( st | st 1 )
Observation probability p( xt | st )
28
Maximum Entropy Markov Model
29
st 1
st
st 1
xt 1
xt
xt 1
MEMM is defined by
p(st | st 1 , xt ) is used to replace transition and
observation probability in HMM model
Maximum Entropy Criterion
30
Definition of feature functions
0, if b(ct ) 1 and st s
f b,s (ct , st )
where
0
ct {xt , st 1}
Constrained optimization problem
~
f i : E fi E fi
where
E fi
p( s | c ) ~p (c ) f (c , s)
c C , sV
~
E fi
i
N
1
~
p (c , s) fi (c , s) fi (c j , s j )
N j 1
c C , sV
empirical expectation
model expectation
Solution of MEMM
Lagrange multipliers are used for constrained optimization
~
( p( s | c ), ) H ( p( s | c )) (i ( E fi E fi ))
i
where {i } are the model parameters
H ( p( s | c ))
~p (c ) p(s | c ) log p(s | c )
c C , sV
Solution is obtained by
p ( s | c )
1
exp( i f i (c , s))
Z c
i
exp( i f i (c , s))
i
exp(
sS
31
j
j
f j (c , s))
GIS Algorithm
Optimize the Maxmimum Mutual Information Criterion
(MMI)
Step1: Calculate the empirical expectation
1
~
E fi
N
Step2: Start from an initial value
N
f
j 1
i
(c j , s j )
i ( 0) 1
Step3: Calculate the model expectation
E fi
1
N
p(s | c ) f (c , s)
c C , sV
i
Step4: Update model parameters
i ( new) i ( current) log(
Ef
Repeat step 3 and 4 until convergence
32
~
E fi
( current)
i
)
33
Conditional Random Field
Conditional Random Field
34
Definition
S (Sv ). vV
Let G (X, S) be a graph such that
When conditioned on ,X Sand
obeyed the Markov
v
property p(Sv | X, S w , w v) p(Sv | X, S w , w ~ v)
Then, ( X, S) is a conditional random field
st 1
st
st 1
xt 1
xt
xt 1
CRF Model Parameters
The undirected graphical structure can be used to factorize
into a normalized product of potential functions
Consider the graph as a linear-chain structure
p(S | X, ) exp i f i (e, S e , X) j g j (v, S v , X)
vV , j
eE ,i
Model parameter set
{1 , 2 ,...; 1 , 2 ,...}
Feature function set
{ f1, f 2 ,...; g1, g2 ,...}
35
CRF Parameter Estimation
36
We can rewrite and maximize the posterior
probability
p(S | X, )
1
exp( k Fk (S, X))
Z ( X)
k
where { , ,....} { , ,...; , ,...}
1
2
1
2
1
2
and
{F1 , F2 ,...} { f1 , f 2 ,...; g1 , g2 ,...}
Log posterior probability is given by
1
(k )
(k )
L( ) log
j Fj (S , X )
(k )
Z (X ) j
k
Parameter Updating by GIS Algorithm
37
Differentiating the log posterior probability with respect to
parameter
L( )
E ~p (S,X) [ Fj (S, X)] E p (S|X( k ) , ) [ Fj (S, X( k ) )]
j
k
Setting this derivative to zero yields the constraint in maximum
entropy model
This estimation has no closed-form solution. We can use GIS
algorithm.
38
Difference
CRF
MEMM
Objective Function
Max. posterior
probability with Gibbs
distribution
Max. entropy under
constrain
Complexity of
calculating
normalization term
Full
Inference in model
Similarity
O(| s |N )
O(| s | N )
2
O(| s | N )
DP
O(k )
N-Best
Top One O(1)
p ( st | st 1 , xt )
p( S | X )
Feature function
State & observation
State & state
Parameter
Weight of feature function
Distribution
Gibbs distribution
Summary and Future works
39
We construct complex CRF with cycle for better modeling
of contextual dependency. Graphical model algorithm is
applied.
In the future, the variational inference algorithm will be
developed for improving calculation of conditional
probability.
The posterior probability can be calculated directly by a
approximating approach.
“Graphical modeling of conditional random fields for human
motion recognition” Liao, Chih-Pin; Chien, Jen-Tzung;
ICASSP 2008. IEEE International Conference on March 31 2008April 4 2008 Page(s):1969 - 1972
Thanks for your attention
and
Discussion
40
© Copyright 2026 Paperzz