Max-Margin Markov Networks by Ben Taskar, Carlos Guestrin, and Daphne Koller Presented by Michael Cafarella CSE574 May 25, 2005 Introduction Kernel methods (SVMs) and maxmargin are terrific for classification Graphical models (Markov networks) can capture complex structure No way to model structure, relations Not trained for discrimination Maximum Margin Markov (M3) Networks capture advantages of both Standard classification Want to learn a classification function: n hw(x) arg max w j f j (x, y ) arg max w f(x, y) y i 1 y f(x,y) are the features (basis functions), w are weights y is a multi-label classification. The possible assignments, Y, is exponential in number of labels l So, can’t compute argmax, can’t even represent all the features Probabilistic classification Graphical model defines P(Y|X). Select label argmaxy P(y | x) Exploit sparseness in dependencies through model design. (e.g., OCR chars are independent given neighbors) We’ll use pairwise Markov network to model: P(y | x) (i , j )E ij (x, yi , y j ) Each pot-func is log sum of basis functions n ij (x, yi , y j ) exp[ wk f k (x, yi , y j )] exp[ w f (x, yi , y j )] k 1 M3N For regular Markov networks, we train w to maximize likelihood or cond. likelihood For M3N, we’ll train w to maximize margin Main contribution of this paper is how to choose w accordingly Choosing w With SVMs, choose w to maximize margin Where max s.t. || w || 1; w fx (y ) , x S , y t (x) fx(y) f(x, t(x)) f(x, y ) Constraints ensure arg max y w f(x, y) t(x) T Maximizing margin magnifies difference between value of true label and the best runner up Multiple labels Structured problems have multiple labels, not a single classification We extend “margin” to scale with the number of mistaken labels. So we now have: max s.t. || w || 1; w fx (y ) t x (y), x S , y Where: l t x (y) t x ( yi ), i 1 t x ( yi ) I ( yi (t(x))i ) Convert to optimization prob We can remove margin term to obtain a quadratic program: 1 2 min . || w || 2 s.t. w f x (y) t x (y), x S , y We have to add slack variables, because data might not be separable We can now reformulate the whole M3N learning problem as the following optimization task… Grand formulation The primal: 1 || w ||2 C x 2 x min s.t. w Δf x (y ) t x (y ) x , x, y The dual: 2 1 x (y )t x (y ) || x (y )f x (y ) || 2 x, y x, y max s.t. y x (y ) C , x; x (y ) 0, x, y Note extra dual vars; have no effect on sol. Unfortunately, not enough! Constraints in primal, and #vars in dual, are exponential in #labels, l Let’s interpret variables in dual as density function over y, conditional on x Dual objective is function of expectations; we need just node, edge marginals of dual vars to compute them Define marginal dual vars as: x ( yi , y j ) x (y ), (i, j ) E , yi , y j , x y ~[ yi , y j ] x ( yi ) y ~[ yi ] x (y ), i, yi , x Now reformulate the QP But first, a pause I can’t copy any more formulae. I’m sorry. It’s making me crazy. I just can’t. Please refer to the paper, section 4! OK, now back to work… Now reformulate the QP (2) The duals vars must arise from a legal density. Or, they must be in the marginal polytope. That means we must enforce consistency between pairwise and singleton marginal vars See equation 9! See equation 10! If network is not a forest, those constraints aren’t enough Can triangulate and add new vars, constraints Or, approximate a relaxation of the polytope using belief prop Experiment #1: Handwriting 6100 words, 8 chars long, 150 subjects Each char is 16x8 pixels Y is classified word, each Yi is one of the 26 letters LogReg and CRFs, train by max’ing cond likelihood of labels given features SVMs and M3N, train by margin maximization Experiment #2: Hypertext The usual collective classification task Four CS departments. Each page is one of course, faculty, student, project, other Each page has web & anchor text, represented as binary feature vector Also has hyperlinks to other examples RMN trained to max CP of labels, given text & links SVM and M3N trained w/max-margin Conclusions M3N seem to work great for discriminative tasks Nice to borrow theoretical results from SVMs Not much testing so far Future work should use more complicated models, problems Future presentations should be done in Latex, not Powerpoint
© Copyright 2026 Paperzz