Survey on semi-supervised CRFs

Survey on Semi-Supervised CRFs
Yusuke Miyao
Department of Computer Science
The University of Tokyo
Contents
1.
2.
3.
4.
Conditional Random Fields (CRFs)
Semi-Supervised Log-Linear Model
Semi-Supervised CRFs
Dynamic Programming for Semi-Supervised
CRFs
1. Conditional Random Fields (CRFs)
• Log-linear model: for sentence x = <x1,…,xn>
and label sequence y = <y1,…,yn>,
1


p(y | x)  exp   k f k (x, y ) 
Z
 k

– λk: parameter
– fk: feature function
– Z: partition function


Z   exp   k f k (x, y ) 
y
 k

Parameter Estimation (1/2)
• Estimate parameters λk, given labeled training
data D = {< xi , yi >}
• Objective function: log-likelihood (+ regularizer)
L(λ )   log p (y i | x i )  R (λ )
i


   k f k (x i , y i )  log Z   R(λ )
i  k

Parameter Estimation (2/2)
• Gradient-based optimization is applied (CG,
pseudo-Newton, etc.)


L( λ ) 
 j
 j

 

f
(
x
,
y
)

log
Z
R(λ )

i k k k i i
  j

 
   f j (x i , y i )   p(y | x i ) f j (x i , y ) 
R(λ )
i 
y
  j
model expectation

1 
log( Z ) 
 j
Z  j

1
Z
f
y
j


(x i , y ) exp   k f k (x i , y ) 
 k

  f j (xi , y )
y



  exp   k f k (x i , y )  


 k

 y
1


exp   k f k (x i , y ) 
Z
 k

Dynamic Programming for CRFs
His
friend runs
the company
Det
Det
Det
Det
Det
Noun
Noun
Noun
Noun
Noun
x
y
• Computation of model expectations requires
summation over y → exponential
• Dynamic programming allows for efficient
computation of model expectations
Dynamic Programming for CRFs
His
friend runs
the company
Det
Noun
Verb
yt
Adj
p( yt | x)
• Assumption:
f (x, y )   f (x, yt )
(0-th order)
t
f
y
j
(x, y ) p(y | x)   p(y | x) f j (x, yt )
y
t
  f j (x, yt ) p( yt | x)
t
yt
t
Forward/Backward Probability
His
friend runs
the company
Det
Noun
Verb
Adj
  (y1..t 1 | x)
p ( yt | x ) 

  (y t 1.. n | x)
1






exp   k f k (x, yt )   exp   k f k (x, y1..t 1 )   exp   k f k (x, y t 1.. n ) 
Z
 k
 y1..t 1
 k
 y t 1.. n
 k

1


exp   k f k (x, yt )   (y1..t 1 | x)  (y t 1.. n | x)
Z
 k

Computed by Dynamic Programming
2. Semi-Supervised Log-Linear Model
• Grandvalet et al. (2004)
• Given labeled data DL = {<xi, yi>} and
unlabeled data DU = {zi}
• Objective function: log-likelihood + negative
entropy regularizer
RL (λ )   log p(y i | xi )  R(λ )    H (y | z i )
i
H (y | x)   p(y | x) log p(y | x)
y
i
Negative Entropy Regularizer
• Maximizing H (y | x)   p(y | x) log p(y | x)
y
→ Minimizing class overwraps
= Targets are separated
p2 ( y1 | x)  0.1
p1 ( y1 | x)  0.4
p2 ( y2 | x)  0.9
p1 ( y2 | x)  0.6
H1 ( y | x)

H 2 ( y | x)
Gradient of Entropy (1/2)


H ( y | x) 
 j
 j
 p(y | x) log p(y | x)
y
 


 
p(y | x) log p(y | x)  p(y | x)
log p(y | x)
 j

y 
  j


  p(y | x) log p(y | x) f j (x, y )   f j (x, y ' ) p(y ' | x)
y
y'




  p(y | x) log p(y | x) f j (x, y )   p(y | x) log p(y | x) f j (x, y ' ) p(y ' | x)
y
y'
y

Gradient of Entropy (2/2)

y  p(y | x) log p(y | x)
j


   f j (x, y ) p (y | x)  p (y | x) f j (x, y ' ) p (y ' | x) log p(y | x)
y 
y'



  p (y | x) log p (y | x) f j (x, y )   f j (x, y ' ) p (y ' | x)
y
y'



p
(
y
|
x
)
log p (y | x)
y
 j
  p ( y | x)
y

1
p (y | x)p (y | x)
 j


  p ( y | x )  f j ( x, y )   f j ( x , y ' ) p ( y ' | x ) 
y
y'


  p (y | x) f j (x, y )   p (y | x) f j (x, y ' ) p (y ' | x)
y
0
y
y'
3. Semi-Supervised CRFs
• Jiao et al. (2006)
• Given labeled data DL ={<xi, yi>} and
unlabeled data DU = {zi}
• Objective function: log-likelihood + negative
entropy regularizer
RL (λ )   log pλ (y i | xi )  R(λ )    H (y | z i )
i
i
Application to NER
• Gene and protein identification
• A (labeled): 5448 words, B (unlabeled): 5210 words, C:
10208 words, D: 25145 words
A&B
γ
P
R
A&C
F
P
R
A&D
F
P
R
F
0
0.80 0.36 0.50 0.77 0.29 0.43 0.74 0.30 0.43
0.1
0.82 0.40 0.54 0.79 0.32 0.46 0.74 0.31 0.44
0.5
0.82 0.40 0.54 0.79 0.33 0.46 0.74 0.31 0.44
1
0.82 0.40 0.54 0.77 0.34 0.47 0.73 0.33 0.45
5
0.84 0.45 0.59 0.78 0.38 0.51 0.72 0.36 0.48
10
0.78 0.46 0.58 0.66 0.38 0.48 0.66 0.38 0.47
• Self-training did not get any improvements
Results
4. Dynamic Programming for
Semi-Supervised CRFs
• Mann et al. (2007)
• We have to compute:
 p(y | x) log p(y | x) f
j
(x, y )
y
  p(y | x) log p(y | x) f j (x, yt )
y
t
  f j (x, yt ) p(y t  yt | x) log p(y t  yt | x)
t
yt
y t
where y-t = < y1, …, yt-1, yt+1, …, yn > and
y-t・y = < y1, …, yt-1, y, yt+1, …, yn >
Example
His
friend runs
the company
Det
Noun
y t  yt
Verb
Adj
yt
y t
• Enumerate all y while fixing t-th state to yt
• If we can compute  p(y  y | x) log p(y  y | x)
efficiently, we can compute the gradient
t
y t
t
t
t
Decomposition of Entropy
• In the following, we use
 p(a, b) log( a, b)
b
  p (a ) p (b | a ) log p (a ) p (b | a )
b
 p (a ) log p (a ) p (b | a )  p (a ) p (b | a ) log p (b | a )
b
b
 p (a ) log p (a )  p (a ) p (b | a ) log p (b | a )
b
Subsequence Constrained Entropy
His
friend runs
the company
Det
Noun
Verb
Adj
 p(y
yt
t
y t
 yt | x) log p(y t  yt | x)
y t
 p( yt | x) log p( yt | x)  p( yt | x) p(y t | yt , x) log p(y t | yt , x)
Computed from
forward-backward
probability
y t
Subsequence constrained entropy
Forward/Backward Subsequence
Constrained Entropy
His
friend runs
the company
Det
Noun
Verb
Adj
H  (y1..t 1 | yt , x)
 p(y
y t

t
yt
H  (y t 1.. n | yt , x)
| yt , x) log p(y t | yt , x)
 p(y
1..t 1
| yt , x) log p(y1..t 1 | yt , x) 
y1..t 1
 H  (y1..t 1 | yt , x)  H  (y t 1.. n | yt , x)
 p(y
y t 1.. n
t 1.. n
| yt , x) log p(y t 1.. n | yt , x)
Dynamic Computation of Hα
• Hα can be computed incrementally
H  (y1..t 1 | yt , x)
 H  (y1..t  2  yt 1 | yt , x)
  p ( yt 1 | yt , x) log p ( yt 1 | yt , x)
yt 1
  p ( yt 1 | yt , x)
yt 1
 p(y
1.. t  2
| yt 1 , yt , x) log p (y1..t  2 | yt 1 , yt , x)
y1.. t  2
=
Computed from
forward-backward
probability
H  (y1..t 2 | yt 1 , x)
References
• Y. Grandvalet and Y. Bengio. 2004. Semi-supervised
learning by entropy minimization. In NIPS 2004.
• F. Jiao, S. Wang, C.-H. Lee, R. Greiner, and D.
Schuurmans. 2006. Semi-supervised conditional
random fields for improved sequence segmentation
and labeling. In COLING/ACL 2006.
• G. S. Mann and A. McCallum. 2007. Efficient
Computation of Entropy Gradient for Semi-Supervised
Conditional Random Fields. In NAACL-HLT 2007.
• X. Zhu. 2005. Semi-supervised learning literature
survey. Technical Report 1530, Computer Sciences,
University of Wisconsin-Madison.