An Introduction to Structured Output
Learning Using Support Vector
Machines
Yisong Yue
Cornell University
Some material used courtesy of Thorsten Joachims
(Cornell University)
Supervised Learning
• Find function from input space X to output space Y
such that the prediction error is low.
x
Microsoft announced today that they
acquired Apple for the amount equal to the
national product of Switzerland.
x gross
Microsoft officials stated that they first
GATACAACCTATCCCCGTATATATATTCTA
wanted
to buy Switzerland, but eventually
TGGGTATAGTATTAAATCAATACAACCTAT
were turned off by the mountains and the
snowy
winters…
x CCCCGTATATATATTCTATGGGTATAGTAT
TAAATCAATACAACCTATCCCCGTATATAT
ATTCTATGGGTATAGTATTAAATCAGATAC
AACCTATCCCCGTATATATATTCTATGGGT
ATAGTATTAAATCACATTTA
y
1
y
-1
y
7.3
Examples of Complex Output Spaces
• Natural Language Parsing
– Given a sequence of words x, predict the parse tree y.
– Dependencies from structural constraints, since y has to be a
tree.
y
x
S
NP
VP
The dog chased the cat
NP
Det
N
V
Det
N
Examples of Complex Output Spaces
• Part-of-Speech Tagging
– Given a sequence of words x, predict sequence of tags y.
– Dependencies from tag-tag transitions in Markov model.
x
The rain wet the cat
y
Det
N
V
Det
Similarly for other sequence labeling problems, e.g., RNA
Intron/Exon Tagging.
N
Examples of Complex Output Spaces
• Multi-class Labeling
• Protein Sequence Alignment
• Noun Phrase Co-reference Clustering
• Learning Parameters of Graphical Models
– Markov Random Fields
• Multivariate Performance Measures
–
–
–
–
F1 Score
ROC Area
Average Precision
NDCG
Notation
• Bold x,y are structured input/output examples.
– Usually consists of a collection of elements
• x = (x1,…,xn), y = (y1,…,yn)
– Each input element xi belongs to some high
dimensional feature space, Rd
– Each output element yi is usually a multiclass label or
real valued number
• Joint feature functions Ψ,Φ map input/output
examples to points in RD
1st Order Sequence Labeling
• Given:
– scoring function S(x, y1, y2)
– input example (x1,…,xn)
• Finds sequence (y1,…,yn) to maximize
h(x) arg max S ( x t , yt , yt 1 )
y1 ,..., yn
t
• Solved with dynamic programming (Viterbi)
Some Formulation Restrictions
• Assume S is parameterized linearly by
some weight vector w in RD.
h(x; w) arg max S ( x t , yt , yt 1; w)
y1 ,..., yn
t
• This means that
S ( x, y1, y 2; w) wT ( x, y1, y 2)
h(x; w) arg max w ( x t , yt , yt 1 )
T
y1 ,..., yn
“Hypothesis Function”
t
The Linear Discriminant
• From last slide:
h(x; w) arg max wT ( x t , yt , yt 1 )
y1 ,..., yn
t
• Putting it together:
(y, x) ( xt , yt , yt 1 )
t
• Our hypothesis function:
T
h(x; w) arg max w (y, x)
y
wT (y, x) “Linear Discriminant Function”
Structured Learning Problem
• Efficient Inference/Prediction – hypothesis function solves for y
when given x and w
T
h(x; w) arg max yY [w (y, x)]
–
–
–
–
Viterbi in sequence labeling
CKY Parser for parse trees
Belief Propagation for Markov random fields
Sorting for ranking
• Efficient Learning/Training – need to efficiently learn parameters
w from training data {xi,yi}i=1..N
– Solution: use Structural SVM framework
– Can also use Perceptrons, CRFs, MEMMs, M3Ns etc.
How to Train?
w ( y , x)
T
• Given a set of structured training examples {x(i),y(i)}i=1..N
• Different training methods can be used.
– Perceptrons perform update whenever current model
mispredicts.
– CRFs plug the discriminant into a conditional log-likelihood
function to optimize.
– Structural SVMs solve a quadratic program minimizes a tradeoff
between model complexity and a convex upper bound of
performance loss.
Support Vector Machines
• Input examples denoted by x (high dimensional point)
• Output targets denoted by y (either +1 or -1)
• SVMs learns a hyperplane w, predictions are sign(wTx)
• Training involves finding w which minimizes
1 2 C
w
2
N
• subject to
• The sum of slacks
N
i 1
i
i : yi ( wT xi ) 1 i
i
i
upper bounds the accuracy loss
Structural SVM Formulation
• Let x denote a structured input example (x1,…,xn)
• Let y denote a structured output target (y1, …,yn)
• Same objective function:
1 2 C
w
2
N
i
i
• Constraints are defined for each incorrect labeling y’ over input x(i) .
y ' y (i ) : wT (y (i ) , x (i ) ) wT (y ' , x (i ) ) (y (i ) , y ' ) i
• Discriminant score for the correct labeling at least as large as
incorrect labeling plus the performance loss.
• Another interpretation: the margin between correct label and
incorrect label at least as large as how ‘bad’ the incorrect label is.
Adapting to Sequence Labeling
• Minimize
subject to
where
1 2 C
w
2
N
i
i
y ' y (i ) : wT (y (i ) , x (i ) ) wT (y ' , x (i ) ) (y (i ) , y ' ) i
(y ' , x) ( xt , yt , yt 1 )
Use the same slack variable for all
constraints of the same structured
training example
t
and (y, y ' ) 1
• Sum of slacks
1
1 yt y 't
n t
i
upper bound performance loss.
i
• Too many constraints!
Structural SVM Training
y ' y (i ) : wT (y (i ) , x (i ) ) wT (y ' , x (i ) ) (y (i ) , y ' ) i
• Suppose we only solve the SVM objective over a small
subset of constraints (working set).
• Some constraints from global set might be violated.
• When finding a violated constraint, only y’ is free,
everything else is fixed
– y’s and x’s fixed from training
– w and slack variables fixed from solving SVM objective
• Degree of violation of a constraint is measured by:
wT (y ' , x (i ) ) (y (i ) , x (i ) ) (y (i ) , y ' ) i
Structural SVM Training
• STEP 1: Solve the SVM objective function using only the current
working set of constraints.
• STEP 2: Using the model learned in STEP 1, find the most violated
constraint from the global set of constraints.
• STEP 3: If the constraint returned in STEP 2 is violated by more
than epsilon, add it to the working set.
• Repeat STEP 1-3 until no additional constraints are added. Return
the most recent model that was trained in STEP 1.
STEP 1-3 is guaranteed to loop for at most O(1/epsilon^2) iterations.
[Tsochantaridis et al. 2005]
*This is known as a “cutting plane” method.
Illustrative Example
Original SVM Problem
Structural SVM Approach
•
•
•
Exponential constraints
Most are dominated by a small set
of “important” constraints
•
Repeatedly finds the next most
violated constraint…
…until set of constraints is a good
approximation.
Illustrative Example
Original SVM Problem
Structural SVM Approach
•
•
•
Exponential constraints
Most are dominated by a small set
of “important” constraints
•
Repeatedly finds the next most
violated constraint…
…until set of constraints is a good
approximation.
Illustrative Example
Original SVM Problem
Structural SVM Approach
•
•
•
Exponential constraints
Most are dominated by a small set
of “important” constraints
•
Repeatedly finds the next most
violated constraint…
…until set of constraints is a good
approximation.
Illustrative Example
Original SVM Problem
Structural SVM Approach
•
•
•
Exponential constraints
Most are dominated by a small set
of “important” constraints
*This is known as a “cutting plane” method.
•
Repeatedly finds the next most
violated constraint…
…until set of constraints is a good
approximation.
Finding Most Violated Constraint
• Structural SVM is an oracle framework.
• Requires subroutine to find the most violated constraint.
– Dependent on formulation of loss function and joint feature
representation.
• Exponential number of constraints!
• Can usually expect efficient solution when inference has
efficient algorithm.
Finding Most Violated Constraint
y ' y (i ) : wT (y (i ) , x (i ) ) wT (y ' , x (i ) ) (y (i ) , y ' ) i
• Finding most violated constraint is equivalent to
maximizing the RHS w/o slack:
H (y' ; y, x, w) wT (y' , x) (y, y' )
• Requires solving:
arg max H (y ' ; y, x, w) arg max wT (y ' , x) (y, y ' )
y'
y'
• Highly related to inference:
h(x; w) arg max yY [wT (y, x)]
Sequence Labeling Revisited
• Finding most violated constraint…
arg max H (y ' ; y, x, w) arg max wT (y ' , x) (y, y ' )
y'
y'
1
T
arg max w ( xt , y 't , y 't 1 ) 1 y 't yt
n
y'
t
arg max w'T ' ( xt , y 't , y 't 1 )
y'
t
… can be solved using Viterbi!
SVMStruct Abstracts Away “Structure”
• Minimize:
1 2 C
w
2
N
i
i
• Subject to:
y ' y (i ) : wT (y (i ) , x (i ) ) wT (y ' , x (i ) ) (y (i ) , y ' ) i
• Working set of constraints are fixed y’s and x’s:
wT v1 wT v2 c
• Just like solving a conventional linear SVM!
• Notion of “structure” almost completely used in
finding the most violated constraint
• (this is just an interpretation)
© Copyright 2026 Paperzz