Supplement: Marginal Structured SVM with Hidden Variables
Wei Ping
WPING @ ICS . UCI . EDU
Qiang Liu
QLIU 1@ ICS . UCI . EDU
Alexander Ihler
IHLER @ ICS . UCI . EDU
Department of Computer Science, UC Irvine
Constraint Form of Marginal Structured SVM
Here we give the constraint form of Eq. (3) in the main paper,
n
min
w,{ξi ≥0}
X
1
kwk2 + C
ξi ,
2
i=1
s.t. ∀i ∈ {1, ..., n}, ∀y ∈ Y, log
(1)
X
h
i
X
exp wT φ(xi , yi , h) − log
exp[wT φ(xi , y, h)] ≥ ∆(yi , y) − ξi ,
h
h
where {ξi } are the slack variables. One can show that the optimal solution {ξi∗ }, w∗ satisfies,
n
o
X
X
ξi∗ = max ∆(yi , y) + log
exp[w∗T φ(xi , y, h)] − log
exp[w∗T φ(xi , yi , h)],
y
h
h
which gives the same objective value as the the unconstrained form. One can also derive a cutting plane-based training
algorithm (Joachims et al., 2009) for this constraint formulation.
Details of Proofs
In this section, we give proofs for two lemmas referenced but omitted from the main paper.
Lemma 1. The objective of the unified framework (Eq. (5) in main paper) is an upper bound of the empirical loss function
∆(yi , yˆi h (w)) over the training set, where the prediction yˆi h (w) is decoded by “annealed” marginal MAP,
h wT φ(x , y, h) i
X
i
yˆi h (w) = arg max log
exp
.
y
h
h
Proof.
X
wT φ(xi , yˆi h (w), h) wT φ(xi , yi , h) − h log
exp
h
h
h
h
n
h
io
T
X
X
X
wT φ(xi , yi , h) 1
w φ(xi , y, h)
≤ y log
exp
∆(yi , y) + h log
exp
− h log
exp
,
y
h
h
y
∆(yi , yˆi h (w)) ≤ ∆(yi , yˆi h (w)) + h log
X
exp
h
h
h
where the first inequality holds by the definition of yˆi (w), and the second holds for ∀y > 0, because the summation
over y contains yˆi h (w).
For convenience, we denote this upper bound as
Ui (w; y , h ) = Ui+ (w; y , h ) − Ui− (w; h )
(2)
Supplement: Marginal Structured SVM with Hidden Variables
where
Ui+ (w; y , h ) = y log
X
Ui− (w; h ) = h log
X
exp
y
h
n1h
wT φ(x , y, h) io
X
i
∆(yi , y) + h log
exp
y
h
h
wT φ(xi , yi , h) .
exp
h
Lemma 2. The (sub-)gradient of Ui (w; y , h ) in (2) is,
∇w Ui (w; y , h ) = Ep(y ,h ) (y,h|xi ) [φ(xi , y, h)] − Eph (h|xi ,yi ) [φ(xi , yi , h)],
where the corresponding temperature controlled distribution is defined as,
h wT φ(x , y, h) i
i
,
ph (h|xi , y) ∝ exp
h
n1
X
wT φ(xi , y, h) o
p(y ,h ) (y|xi ) ∝ exp
,
∆(y, yi ) + h log
exp
y
h
h
p(y ,h ) (y, h|xi ) = ph (h|xi , y) · p(y ,h ) (y|xi ).
Proof.
∇w h log
X
h
i h
io
h T
φ(xi ,y,h)
i ,y,h)
·
exp w φ(x
h
h
h T
i
P
w φ(xi ,y,h)
h exp
h
h T
i
w φ(xi ,y,h)
o
X n exp
h
i
h
·
φ(x
,
y,
h)
=
i
P
wT φ(xi ,y,h)
h
h exp
h
wT φ(xi , y, h) exp
= h
h
P n
h
= Eph (h|xi ,y) [φ(xi , y, h)]
(3)
As a result, ∇w Ui− (w; h ) = Eph (h|xi ,yi ) [φ(xi , yi , h)], and
n h
io
T
wT φ(xi ,y,h) o
P n
P
P
1
i ,y,h)
·
∇
exp 1y ∆(yi , y) + h log h exp w φ(x
·
log
exp
w
h
y
h
h
y
h
io
n h
T
∇w Ui+ (w; y , h ) = y
P
P
w
φ(x
,y,h)
1
i
y exp y ∆(yi , y) + h log
h exp
h
Subinstitute the gradient result (3),
n h
T
io
o
P n
P
w φ(xi ,y,h)
1
h
exp
∆(y
,
y)
+
log
exp
·
E
[φ(x
,
y,
h)]
i
h
i
p (h|xi ,y)
y
h
y
h
n h
T
io
=
P
P
w φ(xi ,y,h)
1
y exp y ∆(yi , y) + h log
h exp
h
= Ep(y ,h ) (y|xi ) Eph (h|xi ,y) [φ(xi , y, h)]
= Ep(y ,h ) (y,h|xi ) [φ(xi , y, h)]
(4)
which completes the proof.
Likelihood vs. Prediction Accuracy
In our main paper, we demonstrate that our proposed MSSVM consistently outperforms HCRF on prediction accuracy.
However, it is worth noting that the HCRF model always achieves higher test likelihood than the MSSVM and LSSVM
on our simulated data set. As an example, Figure 1 shows the test log-likelikelihood across the different methods on these
data. This should not be surprising, since the HCRF model directly optimizes the likelihood objective, and (in this case)
the model class being optimized is correct (i.e., the data were drawn from a true model with the same structure). However,
higher likelihood does not necessarily imply that the HCRF will have better predictions on the target variables. As was
illustrated in the main paper (see details in Section 7.1, Training Sample Size), explicitly minimizing the empirical loss can
lead to better predictions in situations with high dimensional model parameters and relatively few training instances.
Supplement: Marginal Structured SVM with Hidden Variables
−500
log−likelihood on test data
−600
−700
−800
MSSVMs
LSSVMs
HCRFs
−900
−1000
−1100
−1200
−1300
−1400
0
50
100
150
200
number of iterations
250
300
Figure 1. The test log-likelihood of MSSVM, LSSVM and HCRF using SGD when 20 training and 100 test instances are sampled from
40-node hidden chain MRF (same setting as Table 2 in main paper).
References
Joachims, T., Finley, T., and Yu, C.N.J. Cutting-plane training of structural SVMs. Machine Learning, 77:27–59, 2009.
© Copyright 2026 Paperzz