Max-Margin Markov Networks

Max-Margin Markov Networks
by Ben Taskar, Carlos Guestrin, and Daphne Koller
Presented by
Michael Cafarella
CSE574
May 25, 2005
Introduction

Kernel methods (SVMs) and maxmargin are terrific for classification


Graphical models (Markov networks)
can capture complex structure


No way to model structure, relations
Not trained for discrimination
Maximum Margin Markov (M3)
Networks capture advantages of both
Standard classification

Want to learn a classification function:
n
hw(x)  arg max  w j f j (x, y )  arg max w f(x, y)
y



i 1
y
f(x,y) are the features (basis functions), w
are weights
y is a multi-label classification. The possible
assignments, Y, is exponential in number of
labels l
So, can’t compute argmax, can’t even
represent all the features
Probabilistic classification



Graphical model defines P(Y|X). Select label
argmaxy P(y | x)
Exploit sparseness in dependencies through
model design. (e.g., OCR chars are
independent given neighbors)
We’ll use pairwise Markov network to model:
P(y | x)  (i , j )E ij (x, yi , y j )

Each pot-func is log sum of basis functions
n
 ij (x, yi , y j )  exp[  wk f k (x, yi , y j )]  exp[ w f (x, yi , y j )]
k 1
M3N



For regular Markov networks, we train
w to maximize likelihood or cond.
likelihood
For M3N, we’ll train w to maximize
margin
Main contribution of this paper is how
to choose w accordingly
Choosing w



With SVMs, choose
w to maximize
margin
Where
max  s.t. || w || 1;
w fx (y )   , x  S , y  t (x)

fx(y)  f(x, t(x))  f(x, y )
Constraints ensure
arg max y w f(x, y)  t(x)
T
Maximizing margin magnifies difference between value of true label and the best
runner up
Multiple labels


Structured problems have multiple labels, not
a single classification
We extend “margin” to scale with the number
of mistaken labels. So we now have:
max  s.t. || w || 1;
w fx (y )  t x (y), x  S , y


Where: l
t x (y)   t x ( yi ),
i 1
t x ( yi )  I ( yi  (t(x))i )
Convert to optimization prob

We can remove margin term to obtain a
quadratic program:
1
2
min . || w ||
2



s.t. w f x (y)  t x (y), x  S , y
We have to add slack variables, because data
might not be separable
We can now reformulate the whole M3N
learning problem as the following optimization
task…
Grand formulation

The primal:
1
|| w ||2 C   x
2
x
min
s.t. w Δf x (y )  t x (y )   x , x, y


The dual:
2
1
 x (y )t x (y )  ||   x (y )f x (y ) ||

2 x, y
x, y
max
s.t.


y
x
(y )  C , x;  x (y )  0, x, y
Note extra dual vars; have no effect on sol.
Unfortunately, not enough!




Constraints in primal, and #vars in dual, are
exponential in #labels, l
Let’s interpret variables in dual as density
function over y, conditional on x
Dual objective is function of expectations; we
need just node, edge marginals of dual vars
to compute them
Define marginal dual vars as:
x ( yi , y j )   x (y ), (i, j )  E , yi , y j , x
y ~[ yi , y j ]
 x ( yi ) 

y ~[ yi ]
x
(y ), i, yi , x
Now reformulate the QP







But first, a pause
I can’t copy any more formulae.
I’m sorry.
It’s making me crazy.
I just can’t.
Please refer to the paper, section 4!
OK, now back to work…
Now reformulate the QP (2)

The duals vars must arise from a legal
density. Or, they must be in the marginal
polytope.


That means we must enforce consistency
between pairwise and singleton marginal vars


See equation 9!
See equation 10!
If network is not a forest, those constraints
aren’t enough


Can triangulate and add new vars, constraints
Or, approximate a relaxation of the polytope using
belief prop
Experiment #1: Handwriting





6100 words, 8 chars long, 150 subjects
Each char is 16x8 pixels
Y is classified word, each Yi is one of the 26
letters
LogReg and CRFs, train by max’ing cond
likelihood of labels given features
SVMs and M3N, train by margin
maximization
Experiment #2: Hypertext

The usual collective classification task

Four CS departments. Each page is one of
course, faculty, student, project,
other




Each page has web & anchor text,
represented as binary feature vector
Also has hyperlinks to other examples
RMN trained to max CP of labels, given
text & links
SVM and M3N trained w/max-margin
Conclusions





M3N seem to work great for
discriminative tasks
Nice to borrow theoretical results from
SVMs
Not much testing so far
Future work should use more
complicated models, problems
Future presentations should be done in
Latex, not Powerpoint