Conditional Random Fields (CRFs) Joseph Keshet May 26, 2014 We consider an abstract set X of possible inputs and a set Y of possible outputs drawn from a fixed but unknown distribution ρ. We also consider a linear classifier parameterized by a weight vector w ∈ `2 , such that the parameter setting w determines a mapping of any input x ∈ X to an output ŷw (x) ∈ Y, defined as follows ŷw (x) = arg max w> φ(x, y) y∈Y (1) where φ : X × Y → `2 is a vector-valued function that is called a feature map. The goal of the learning problem is to find a setting of the weight vector w of the classifier so as to achieve desirable performance. We also assume a cost or a task loss function L : Y × Y → R+ such that for y, ŷ ∈ Y we have that L(y, ŷ) is a non-negative real number giving a cost or a loss value when the system output is ŷ but the desired output is y. The goal of the learning algorithm is to set the weight vector w so as to minimize the expected cost of the classifier w∗ = arg min E(x,y)∼ρ [L(y, ŷw (x))] . w (2) The expected cost is often called the risk. Conditional random field (CRF) [1] is a probabilistic framework for labeling structured data. The CRF training is usually formulated in terms of a finite sample (x1 , y1 ), . . . , (xm , ym ) drawn i.i.d. from ρ, and is defined by the following optimization training equation m w∗ = arg min w 1 X λ − ln Pw (yi |xi ) + kwk2 m 2 (3) i=1 where Pw (y|x) = 1 w> φ(x,y) e Z(x) and Z(x) = X ew > φ(x,y) . (4) (5) y∈Y The main objection to using CRF training directly is that the training does not make use of the cost function L and cannot be expected to produce a good solution to Eq. (2). However, 1 given a conditional probability model Pw (y|x), when w is found according to Eq. (3), one could use decision-theoretic prediction defined as follows ŷw (x) = arg min Ey|x [L(y, ŷ(x))] (6) ŷ If Pw (y|x) equals ρ(y|x) then Eq. (6) gives an optimal decoding. For example, if the cost function is the zero-one loss, L(y, ŷw ) = 1[y 6= ŷw ], then from Eq. (6) we get the Bayes optimal decoder ŷw (x) = arg min Ey|x 1 y 6= y0 = arg min P(y 6= y0 |x) = arg max P(y = y0 |x). 0 0 0 y y y However, in many applications it is unrealistic to assume that Pw (y|x) equals ρ(y|x). For this reason the CRF training leads to inconsistency – the optimum of Eq. (3) is different from the optimum of Eq. (2). The objective of the optimization problem defined in Eq. (3) is convex. One way of solving it is using stochastic gradient descent. The gradient of the objective of the optimization problem given in Eq. (3) with respect to w is # " w> φ(xi ,yi ) λ e λ ∇w − ln Pw (yi |xi ) + kwk2 = ∇w − ln P + kwk2 w> φ(xi ,y) 2 2 e y∈Y X > λ = ∇w −w> φ(xi , yi ) + ln ew φ(xi ,y) + kwk2 (7) 2 y∈Y P w> φ(xi ,y) y∈Y φ(xi , y)e = −φ(xi , yi ) + + λw (8) P w> φ(xi ,y0 ) y0 ∈Y e X = −φ(xi , yi ) + φ(xi , y)Pw (y|xi ) + λw (9) y∈Y Hence the update rule of the algorithm is h i w ← (1 − λη) w + η φ(xi , yi ) − Ey|xi φ(xi , y) (10) References [1] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eightneenth International Conference on Machine Learning (ICML), pages 282–289, 2001. 2
© Copyright 2026 Paperzz