CONDITIONAL RANDOM FIELDS
JOHN THICKSTUN
logistic regression
CRFs can be seen as a generalization of logistic regression. So we will begin by reviewing
logistic regression. This is the simplest example of a “log-linear model” where the log-odds
of the probability of a binary label y ∈ {0, 1} are a linear function of the data x ∈ Rd :
d
logit(P (y = 1|x)) = log
X
P (y = 1|x)
P (y = 1|x)
= log
= β T x = β0 +
β k xk .
P (y = 0|x)
1 − P (y = 1|x)
k=1
Rearranging terms, we find that the probability that y = 1 is given by the sigmoid function
1
P (y = 1|x) =
= S(β T x).
1 + exp(−β T x)
The logistic regression game is to find the MLE of the weights. The likelihood for n
independent observations X = (X1 , . . . , Xn )T is
n
Y
LX (β) =
P (Yi = 1|Xi )Yi P (Yi = 0|Xi )1−Yi .
i=1
And the log-likelihood is
LX (β) = log LX (β) =
n
X
(Yi log P (Yi = 1|Xi ) + (1 − Yi ) log P (Yi = 0|Xi )) .
i=1
Proposition.
∇LX (β) = X T (Y − S(Xβ)).
Proof. Note that P (Yi = 0|Xi ) = 1 − P (Yi = 1|Xi ) so the log-likelihood can be rewritten:
n
X
LX (β) =
Yi log S(β T Xi ) + (1 − Yi ) log(1 − S(β T Xi ))
i=1
d
dz S(z)
= S(z)(1 − S(z)) and
n X
∇S(β T Xi )
∇S(β T Xi )
∇LX (β) =
Yi
− (1 − Yi )
S(β T Xi )
1 − S(β T Xi )
Calculus verifies that
i=1
=
n X
Yi
i=1
S(β T Xi )(1
− S(β T Xi ))
S(β T Xi )(1 − S(β T Xi ))
X
−
(1
−
Y
)
Xi
i
i
S(β T Xi )
1 − S(β T Xi )
1
2
JOHN THICKSTUN
=
n
X
n
X
Yi (1 − S(β Xi )) − (1 − Yi )S(β Xi ) Xi =
Yi − S(β T Xi ) Xi .
T
T
i=1
i=1
We can proceed from here with first-order gradient updates:
β 0 = β + t∇LX (β) = β + tX T (Y − S(Xβ)).
As an aside, if we observe that EYi = P (Yi = 1|Xi ) = S(β T Xi ) then the first order
optimality condition can be written
X T Y = Eβ X T Y.
I.e. logistic regression is the distribution in the class of log-linear models satisfying the
constraint that the empirical expectation of each feature matches the model distribution.
Alternatively, we can do something second-order.
Proposition. Let W = diag(S(β T Xi )(1 − S(β T Xi ))). Then
∇2 LX (β) = X T W X.
Proof.
2
∇ LX (β) = −
n
X
XiT ∇S(β T Xi )
=−
i=1
T
n
X
XiT S(β T Xi )(1 − S(β T Xi ))Xi
i=1
T
= X diag(S(β Xi )(1 − S(β T Xi )))X = X T W X.
Note that W 0 so the logistic loss is convex. This guarantees the convergence of both
the first and second order methods to the MLE. In particular, the Newton updates are
β 0 = β − (X T W X)−1 ∇LX (β) = β − (X T W X)−1 X T (Y − S(Xβ)).
Compare this with least squares regression. In that case we want to optimize y = Xβ and
the closed form solution is
β = (X T X)−1 X T Y.
For this reason, logistic regression is sometimes considered as an iteratively re-updated
weighted least squares regression.
multinomial logistic regression
We will now generalize the model from the previous section to a multiclass setting. For k
classes, we can generalize the log-odds characterization of logistic regression by expressing
the log-odds of each class j > 1 versus class 1 as a log-linear function of the data:
logit(P (y = j)|x) = log
P (y = j|x)
= βjT x.
P (y = 1|x)
Because probabilities sum to one, we must have
P (y = 1|x) =
1+
1
Pk
βjT x
j=2 e
.
CONDITIONAL RANDOM FIELDS
3
And a little algebra shows that for j > 1,
T
P (y = j|x) =
1+
e βj x
Pk
j=2 e
βjT x
.
We can symmetrize the model by reparameterizing; observe that
T
T
eβ1 x
e β1 x
=
P (y = 1|x) = T
.
P
P
T
k
βj0T x
eβ1 x + kj=2 e(β1 +βj ) x
j=1 e
Therefore replacing β 0 with β, for all j
T
T
1
eβj x
eβj x .
=
P (y = j|x) = P
βjT0 x
k
Z
(x)
β
j 0 =1 e
This is the softmax representation of multinomial logistic regression. The log-likelihood
for n independent observations is LX : Rk×d → R defined by
LX (β) = log LX (β) =
n X
k
X
log P (Yi = j|Xi )1Yi =j
i=1 j=1
=
n X
k
X
βjT Xi
− log Zβ (Xi ) 1Yi =j =
i=1 j=1
n
X
βYTi Xi − log Zβ (Xi ) .
i=1
Proposition.
∇j LX (β) = X T (1Y =j − P (Y = j|X))
Proof. The derivative of the log-partition function is
k
T
∂ log Zβ (x)
1 X
1
βT x
=
xl e j 0 1j 0 =j =
xl eβj x = xl P (y = j|x).
∂βjl
Zβ (x) 0
Zβ (x)
j =1
And therefore the derivative of the log-likelihood is
n
n
i=1
i=1
X
∂LX (β) X
=
(Xil 1Yi =j − Xil P (Yi = j|Xi )) =
(1Yi =j − P (Yi = j|Xi )) Xil .
∂βjl
We can rewrite this as
∇j LX (β) =
n
X
(1Yi =j − P (Yi = j|Xi ))Xi .
i=1
This gives us the the first order optimality conditions
X T 1Y =j = X T P (Y = j|X) = X T Eβ 1Y =j .
I.e. the empirical marginal probability of each feature matches the distribution probability
under the distribution parameterized by optimal β.
4
JOHN THICKSTUN
conditional random fields
Define a (log-linear) potential over observations x ∈ X and labelings y ∈ Y by
ψt (y, x) = exp β T ϕt (y, x)
Here ϕt is a feature map. We can think of ψt as an un-normalized probability distribution
and consider the product distribution
Ψ(y, x) =
T
Y
ψt (y, x).
t=1
The chain-structured CRF is then given by
T
Y
1
1
P (Y = y|x) =
Ψ(y, x) =
ψt (y, x)
Z(x, β)
Z(x, β)
t=1
!
T
X
1
1
T
ϕt (y, x) =
exp β
exp β T Φ(y, x) .
=
Z(x, β)
Z(x, β)
t=1
Above we define
Φ(y, x) =
T
X
ϕt (y, x).
t=1
And the partition function is
Z(x, β) =
X
Ψ(y, x) =
y
X
exp β T Φ(y, x) ,
y
training
Given supervised training sequences (xi , y i ), the log-likelihood is
L(β) =
n
n
1X
1X T
log P (Y = y i |xi ) =
β Φ(y i , xi ) − log Z(xi , β) .
n
n
i=1
i=1
And the first order optimality conditions are
n ∂L
1X
∂
i i
i
=
Φk (y , x ) −
log Z(x , β) .
0=
∂βk
n
∂βk
i=1
Where
X
∂
1
0 i
T
0 i
log Z(xi , β) =
Φ
(y
,
x
)
exp
β
Φ(y
,
x
)
= Eβ Φk (Y, xi ).
k
∂β
Z(xi , β) 0
y
So the first order condition reduces to
n
X
Φ(y i , xi ) = Eβ Φ(Y, xi ).
i=1
CONDITIONAL RANDOM FIELDS
5
This is the maximum entropy condition: find the weights β that make the distribution
expectation match the empirical expectation. We can find the optimal weights with firstorder gradient updates:
!
n
1X
i i
i
0
Φ(y , x ) − Eβ Φ(Y, x ) .
β = β + t∇L(β) = β + t
n
i=1
(Y, xi )?
How do we compute Eβ Φk
This is intractable in general because we need to sum
over all possible sequences y 0 . But if we impose additional structure on our feature map,
the computation may simplify.
logistic regression revisited. For example, suppose there is some ϕ0 such that
ϕt (y, x) = ϕ0 (yt , x, t).
That is, each ϕt depends only on yt , rather than the whole label space. In this case the
expectation reduces to
Eβ Φ(Y, x) =
T
XX
y
0
ϕ (yt , x, t)P (Y = y|x) =
T X
X
ϕ0 (yt , x, t)
P (Y = y|x).
ys
s6=t
t=1 yt
t=1
X
The marginal probability is
T
P (Yt = yt |x) =
X
p(y|x) =
ys
s6=t
XY
ψt (y, x) X Y
1
ψs (y, x) =
ψs (y, x).
Z(x, β) ys
Z(x, β) ys
s6=t
s=1
s6=t
s6=t
And the product reduces to
Y
ψs (y, x) =
s6=t
Y
exp β T ϕ0 (ys , x, s) = exp β T
X
ϕ0 (ys , x, s) .
s6=t
s6=t
For example, if ϕ0 (ys , x, s) = xs ys and ys ∈ {0, 1} then we recover a simple logistic regression model:
X
xT y = Φ(y, x) = Eβ Φ(Y, x) =
xT y 0 P (Y = y 0 |x) = xT Eβ Y.
y0
In particular, the partition function factors:
Z(x, β) =
T X
Y
t=1 yt
T
exp β xt yt =
T
Y
1 + exp(β T xt ) .
t=1
And the expectation Eβ Yt = p(Yt = 1|x) can be written as
X Y exp(β T xs ys )
exp β T xt X Y
exp(β T xs ys ) = S(β T xt )
= S(β T xt ).
Tx )
Z(x, β) ys
1
+
exp(β
s
ys
s6=t
s6=t
s6=t
s6=t
6
JOHN THICKSTUN
quadratic interactions. We get a much richer model when we allow interaction terms
between the labels yt but there is a delicate balance between the richness of these interactions and tractable computations. In the case of pairwise interactions, the computations
remain tractable. Here we let
ϕt (y, x) = ϕ00 (yt , yt−1 , x, t).
By marginalization,
Eβ Φ(Y, x) =
T
X
X
=
P (Y = y|x)
yj
j ∈{t−1,t}
/
t=1 yt ,yt−1
T
X
X
X
ϕ(yt , yt−1 , x)
ϕ(yt , yt−1 , x)P (Yt = yt , Yt−1 = yt−1 |x).
t=1 yt ,yt−1
The probabilities P (Yt = yt , Yt−1 = yt−1 |x) can be computed by dynamic programming,
known in this context as the forward-backward algorithm. From definitions and algebra,
X
P (Yt = yt , Yt−1 = yt−1 |x) =
P (Y = y|x) =
yj
j ∈{t−1,t}
/
=
1
X
ψ(yt , yt−1 , x)
Z(x, β)
yj
j<t−1
t−1
Y
s=1
1
Z(x, β)
X
T
Y
ψ(ys , ys−1 , x)
yj
s=1
j ∈{t−1,t}
/
T
X Y
ψ(ys , ys−1 , x)
ψ(ys , ys−1 , x) .
yj s=t+1
j>t
The latter two terms can be calculated recursively; define
αt (yt ) =
t
XY
ψ(ys , ys−1 , x),
γt (yt ) =
T
X Y
yj s=1
j<t
ψ(ys , ys−1 , x).
yj s=t+1
j>t
If we define α−1 (yt ) = 1 and γT +1 (yt ) = 1 then
X
αt (y) =
ψ(y, y 0 , x)αt−1 (y 0 ),
γt (y) =
X
ψ(y 0 , y, x)γt+1 (y 0 ).
y0
y0
To summarize,
1
αt−1 (yt−1 )ψ(yt , yt−1 , x)γt (yt ).
Z(x, β)
And because probabilities sum to one, the normalizing constant is
P (Yt = yt , Yt−1 = yt−1 |x) =
Z(x, β) =
T
XY
y t=1
ψ(yt , yt−1 , x) =
X
y
αT (y).
© Copyright 2026 Paperzz