Conditional Random Fields (CRFs)

Conditional Random Fields (CRFs)
Joseph Keshet
May 26, 2014
We consider an abstract set X of possible inputs and a set Y of possible outputs drawn
from a fixed but unknown distribution ρ. We also consider a linear classifier parameterized by
a weight vector w ∈ `2 , such that the parameter setting w determines a mapping of any input
x ∈ X to an output ŷw (x) ∈ Y, defined as follows
ŷw (x) = arg max w> φ(x, y)
y∈Y
(1)
where φ : X × Y → `2 is a vector-valued function that is called a feature map.
The goal of the learning problem is to find a setting of the weight vector w of the classifier so
as to achieve desirable performance. We also assume a cost or a task loss function L : Y × Y →
R+ such that for y, ŷ ∈ Y we have that L(y, ŷ) is a non-negative real number giving a cost or
a loss value when the system output is ŷ but the desired output is y. The goal of the learning
algorithm is to set the weight vector w so as to minimize the expected cost of the classifier
w∗ = arg min E(x,y)∼ρ [L(y, ŷw (x))] .
w
(2)
The expected cost is often called the risk.
Conditional random field (CRF) [1] is a probabilistic framework for labeling structured data.
The CRF training is usually formulated in terms of a finite sample (x1 , y1 ), . . . , (xm , ym ) drawn
i.i.d. from ρ, and is defined by the following optimization training equation
m
w∗ = arg min
w
1 X
λ
− ln Pw (yi |xi ) + kwk2
m
2
(3)
i=1
where
Pw (y|x) =
1 w> φ(x,y)
e
Z(x)
and
Z(x) =
X
ew
> φ(x,y)
.
(4)
(5)
y∈Y
The main objection to using CRF training directly is that the training does not make use
of the cost function L and cannot be expected to produce a good solution to Eq. (2). However,
1
given a conditional probability model Pw (y|x), when w is found according to Eq. (3), one could
use decision-theoretic prediction defined as follows
ŷw (x) = arg min Ey|x [L(y, ŷ(x))]
(6)
ŷ
If Pw (y|x) equals ρ(y|x) then Eq. (6) gives an optimal decoding. For example, if the cost
function is the zero-one loss, L(y, ŷw ) = 1[y 6= ŷw ], then from Eq. (6) we get the Bayes optimal
decoder
ŷw (x) = arg min
Ey|x 1 y 6= y0 = arg min
P(y 6= y0 |x) = arg max
P(y = y0 |x).
0
0
0
y
y
y
However, in many applications it is unrealistic to assume that Pw (y|x) equals ρ(y|x). For this
reason the CRF training leads to inconsistency – the optimum of Eq. (3) is different from the
optimum of Eq. (2).
The objective of the optimization problem defined in Eq. (3) is convex. One way of solving it
is using stochastic gradient descent. The gradient of the objective of the optimization problem
given in Eq. (3) with respect to w is
#
"
w> φ(xi ,yi )
λ
e
λ
∇w − ln Pw (yi |xi ) + kwk2 = ∇w − ln P
+ kwk2
w> φ(xi ,y)
2
2
e
y∈Y


X >
λ
= ∇w −w> φ(xi , yi ) + ln
ew φ(xi ,y) + kwk2  (7)
2
y∈Y
P
w> φ(xi ,y)
y∈Y φ(xi , y)e
= −φ(xi , yi ) +
+ λw
(8)
P
w> φ(xi ,y0 )
y0 ∈Y e
X
= −φ(xi , yi ) +
φ(xi , y)Pw (y|xi ) + λw
(9)
y∈Y
Hence the update rule of the algorithm is
h
i
w ← (1 − λη) w + η φ(xi , yi ) − Ey|xi φ(xi , y)
(10)
References
[1] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models
for segmenting and labeling sequence data. In Proceedings of the Eightneenth International
Conference on Machine Learning (ICML), pages 282–289, 2001.
2