Survey on Semi-Supervised CRFs
Yusuke Miyao
Department of Computer Science
The University of Tokyo
Contents
1.
2.
3.
4.
Conditional Random Fields (CRFs)
Semi-Supervised Log-Linear Model
Semi-Supervised CRFs
Dynamic Programming for Semi-Supervised
CRFs
1. Conditional Random Fields (CRFs)
• Log-linear model: for sentence x = <x1,…,xn>
and label sequence y = <y1,…,yn>,
1
p(y | x) exp k f k (x, y )
Z
k
– λk: parameter
– fk: feature function
– Z: partition function
Z exp k f k (x, y )
y
k
Parameter Estimation (1/2)
• Estimate parameters λk, given labeled training
data D = {< xi , yi >}
• Objective function: log-likelihood (+ regularizer)
L(λ ) log p (y i | x i ) R (λ )
i
k f k (x i , y i ) log Z R(λ )
i k
Parameter Estimation (2/2)
• Gradient-based optimization is applied (CG,
pseudo-Newton, etc.)
L( λ )
j
j
f
(
x
,
y
)
log
Z
R(λ )
i k k k i i
j
f j (x i , y i ) p(y | x i ) f j (x i , y )
R(λ )
i
y
j
model expectation
1
log( Z )
j
Z j
1
Z
f
y
j
(x i , y ) exp k f k (x i , y )
k
f j (xi , y )
y
exp k f k (x i , y )
k
y
1
exp k f k (x i , y )
Z
k
Dynamic Programming for CRFs
His
friend runs
the company
Det
Det
Det
Det
Det
Noun
Noun
Noun
Noun
Noun
x
y
• Computation of model expectations requires
summation over y → exponential
• Dynamic programming allows for efficient
computation of model expectations
Dynamic Programming for CRFs
His
friend runs
the company
Det
Noun
Verb
yt
Adj
p( yt | x)
• Assumption:
f (x, y ) f (x, yt )
(0-th order)
t
f
y
j
(x, y ) p(y | x) p(y | x) f j (x, yt )
y
t
f j (x, yt ) p( yt | x)
t
yt
t
Forward/Backward Probability
His
friend runs
the company
Det
Noun
Verb
Adj
(y1..t 1 | x)
p ( yt | x )
(y t 1.. n | x)
1
exp k f k (x, yt ) exp k f k (x, y1..t 1 ) exp k f k (x, y t 1.. n )
Z
k
y1..t 1
k
y t 1.. n
k
1
exp k f k (x, yt ) (y1..t 1 | x) (y t 1.. n | x)
Z
k
Computed by Dynamic Programming
2. Semi-Supervised Log-Linear Model
• Grandvalet et al. (2004)
• Given labeled data DL = {<xi, yi>} and
unlabeled data DU = {zi}
• Objective function: log-likelihood + negative
entropy regularizer
RL (λ ) log p(y i | xi ) R(λ ) H (y | z i )
i
H (y | x) p(y | x) log p(y | x)
y
i
Negative Entropy Regularizer
• Maximizing H (y | x) p(y | x) log p(y | x)
y
→ Minimizing class overwraps
= Targets are separated
p2 ( y1 | x) 0.1
p1 ( y1 | x) 0.4
p2 ( y2 | x) 0.9
p1 ( y2 | x) 0.6
H1 ( y | x)
H 2 ( y | x)
Gradient of Entropy (1/2)
H ( y | x)
j
j
p(y | x) log p(y | x)
y
p(y | x) log p(y | x) p(y | x)
log p(y | x)
j
y
j
p(y | x) log p(y | x) f j (x, y ) f j (x, y ' ) p(y ' | x)
y
y'
p(y | x) log p(y | x) f j (x, y ) p(y | x) log p(y | x) f j (x, y ' ) p(y ' | x)
y
y'
y
Gradient of Entropy (2/2)
y p(y | x) log p(y | x)
j
f j (x, y ) p (y | x) p (y | x) f j (x, y ' ) p (y ' | x) log p(y | x)
y
y'
p (y | x) log p (y | x) f j (x, y ) f j (x, y ' ) p (y ' | x)
y
y'
p
(
y
|
x
)
log p (y | x)
y
j
p ( y | x)
y
1
p (y | x)p (y | x)
j
p ( y | x ) f j ( x, y ) f j ( x , y ' ) p ( y ' | x )
y
y'
p (y | x) f j (x, y ) p (y | x) f j (x, y ' ) p (y ' | x)
y
0
y
y'
3. Semi-Supervised CRFs
• Jiao et al. (2006)
• Given labeled data DL ={<xi, yi>} and
unlabeled data DU = {zi}
• Objective function: log-likelihood + negative
entropy regularizer
RL (λ ) log pλ (y i | xi ) R(λ ) H (y | z i )
i
i
Application to NER
• Gene and protein identification
• A (labeled): 5448 words, B (unlabeled): 5210 words, C:
10208 words, D: 25145 words
A&B
γ
P
R
A&C
F
P
R
A&D
F
P
R
F
0
0.80 0.36 0.50 0.77 0.29 0.43 0.74 0.30 0.43
0.1
0.82 0.40 0.54 0.79 0.32 0.46 0.74 0.31 0.44
0.5
0.82 0.40 0.54 0.79 0.33 0.46 0.74 0.31 0.44
1
0.82 0.40 0.54 0.77 0.34 0.47 0.73 0.33 0.45
5
0.84 0.45 0.59 0.78 0.38 0.51 0.72 0.36 0.48
10
0.78 0.46 0.58 0.66 0.38 0.48 0.66 0.38 0.47
• Self-training did not get any improvements
Results
4. Dynamic Programming for
Semi-Supervised CRFs
• Mann et al. (2007)
• We have to compute:
p(y | x) log p(y | x) f
j
(x, y )
y
p(y | x) log p(y | x) f j (x, yt )
y
t
f j (x, yt ) p(y t yt | x) log p(y t yt | x)
t
yt
y t
where y-t = < y1, …, yt-1, yt+1, …, yn > and
y-t・y = < y1, …, yt-1, y, yt+1, …, yn >
Example
His
friend runs
the company
Det
Noun
y t yt
Verb
Adj
yt
y t
• Enumerate all y while fixing t-th state to yt
• If we can compute p(y y | x) log p(y y | x)
efficiently, we can compute the gradient
t
y t
t
t
t
Decomposition of Entropy
• In the following, we use
p(a, b) log( a, b)
b
p (a ) p (b | a ) log p (a ) p (b | a )
b
p (a ) log p (a ) p (b | a ) p (a ) p (b | a ) log p (b | a )
b
b
p (a ) log p (a ) p (a ) p (b | a ) log p (b | a )
b
Subsequence Constrained Entropy
His
friend runs
the company
Det
Noun
Verb
Adj
p(y
yt
t
y t
yt | x) log p(y t yt | x)
y t
p( yt | x) log p( yt | x) p( yt | x) p(y t | yt , x) log p(y t | yt , x)
Computed from
forward-backward
probability
y t
Subsequence constrained entropy
Forward/Backward Subsequence
Constrained Entropy
His
friend runs
the company
Det
Noun
Verb
Adj
H (y1..t 1 | yt , x)
p(y
y t
t
yt
H (y t 1.. n | yt , x)
| yt , x) log p(y t | yt , x)
p(y
1..t 1
| yt , x) log p(y1..t 1 | yt , x)
y1..t 1
H (y1..t 1 | yt , x) H (y t 1.. n | yt , x)
p(y
y t 1.. n
t 1.. n
| yt , x) log p(y t 1.. n | yt , x)
Dynamic Computation of Hα
• Hα can be computed incrementally
H (y1..t 1 | yt , x)
H (y1..t 2 yt 1 | yt , x)
p ( yt 1 | yt , x) log p ( yt 1 | yt , x)
yt 1
p ( yt 1 | yt , x)
yt 1
p(y
1.. t 2
| yt 1 , yt , x) log p (y1..t 2 | yt 1 , yt , x)
y1.. t 2
=
Computed from
forward-backward
probability
H (y1..t 2 | yt 1 , x)
References
• Y. Grandvalet and Y. Bengio. 2004. Semi-supervised
learning by entropy minimization. In NIPS 2004.
• F. Jiao, S. Wang, C.-H. Lee, R. Greiner, and D.
Schuurmans. 2006. Semi-supervised conditional
random fields for improved sequence segmentation
and labeling. In COLING/ACL 2006.
• G. S. Mann and A. McCallum. 2007. Efficient
Computation of Entropy Gradient for Semi-Supervised
Conditional Random Fields. In NAACL-HLT 2007.
• X. Zhu. 2005. Semi-supervised learning literature
survey. Technical Report 1530, Computer Sciences,
University of Wisconsin-Madison.
© Copyright 2026 Paperzz