Discriminative, Unsupervised, Convex Learning Dale Schuurmans Department of Computing Science University of Alberta MITACS Workshop, August 26, 2005 Current Research Group PhD Tao Wang PhD Ali Ghodsi PhD Dana Wilkinson PhD Yuhong Guo PhD Feng Jiao PhD Jiayuan Huang PhD Qin Wang PhD Adam Milstein PhD Dan Lizotte PhD Linli Xu PDF Li Cheng reinforcement learning dimensionality reduction action-based embedding ensemble learning bioinformatics transduction on graphs statistical natural language robotics, particle filtering optimization, everything unsupervised SVMs computer vision 2 Current Research Group PhD Tao Wang reinforcement learning PhD Dana Wilkinson action-based embedding PhD Feng Jiao bioinformatics PhD Qin Wang statistical natural language PhD Dan Lizotte optimization, everything PDF Li Cheng computer vision 3 Today I will talk about: One Current Research Direction Learning Sequence Classifiers (HMMs)  Discriminative  Unsupervised  Convex EM? 4 Outline  Unsupervised SVMs  Discriminative, unsupervised, convex HMMs  Tao, Dana, Feng, Qin, Dan, Li 5 6 Unsupervised Support Vector Machines Joint work with Linli Xu Main Idea  Unsupervised SVMs (and semi-supervised SVMs)  Harder computational problem than SVMs  Convex relaxation – Semidefinite program (Polynomial time) 8 Background: Two-class SVM  Supervised classification learning  Labeled data  linear discriminant wx b  0  Classification rule: y  sgn(w  x  b) + Some better than others? 9 Maximum Margin Linear Discriminant w  x  b  0 to maximize dist ( xi , yi , Plane w  x  b  0) Choose a linear discriminant min xi , yi 10 Unsupervised Learning  Given unlabeled data, how to infer classifications?  Organize objects into groups — clustering 11 Idea: Maximum Margin Clustering  Given unlabeled data, find maximum margin separating hyperplane  Clusters the data  Constraint: class balance: bound difference in sizes between classes 12 Challenge  Find label assignment that results in a large margin  Hard  Convex relaxation – based on semidefinite programming 13 How to Derive Unsupervised SVM? Two-class case: 1. Start with Supervised Algorithm Given vector of assignments, y, solve  * 2 Inv. sq. margin   max λ e  λ 1 2 K λλ  , yy  subject to 0  λ  1 14 How to Derive Unsupervised SVM?  2. Think of * 2 as a function of y If given y, would then solve  * 2 Inv. sq. margin  (y )  max λ e  λ 1 2 Goal: Choose y to minimize inverse squared margin K λλ  , yy  subject to 0  λ  1 Problem: not a convex function of y 15 How to Derive Unsupervised SVM? 3. Re-express problem with indicators comparing y labels  M  yy New variables: If given y, would then solve  * 2 Inv. sq. margin  (y )  max λ e  λ 1 2 An equivalence relation matrix 1 if yi  y j M ij   1 if yi  y j K λλ  , yy  subject to 0  λ  1 16 How to Derive Unsupervised SVM? 3. Re-express problem with indicators comparing y labels  M  yy New variables: If given M, would then solve  * 2 Inv. sq. margin  ( M )  max λ e  λ 1 2 subject to 0  λ  1 An equivalence relation matrix 1 if yi  y j M ij   1 if yi  y j K λλ  , M Note: convex function of Mof linear Maximum functions is convex 17 How to Derive Unsupervised SVM? 4. Get constrained optimization problem Solve for M min  *2 ( M ) M subject to 0  λ  1 Not convex! M  1, 1 n n M  yy  Class balance    e  Me   e   M  1, 1 nn encodes an equivalence relation iff M ± 0, diag( M )  e 18 How to Derive Unsupervised SVM? 4. Get constrained optimization problem Solve for M min  *2 ( M ) M subject to 0  λ  1 M  1, 1 M ± 0, diag( M )  e n n   e  Me   e   M  1, 1 nn encodes an equivalence relation iff M ± 0, diag( M )  e 19 How to Derive Unsupervised SVM? 5. Relax indicator variables to obtain a convex optimization problem Solve for M min  *2 ( M ) M subject to 0  λ  1 M  1, 1 M ± 0, diag( M )  e n n   e  Me   e 20 How to Derive Unsupervised SVM? 5. Relax indicator variables to obtain a convex optimization problem Solve for M min  *2 ( M ) M subject to 0  λ  1 M   1, 1 M ± 0, diag( M )  e n n   e  Me   e Semidefinite program 21 Multi-class Unsupervised SVM? 1. Start with Supervised Algorithm Given vector of assignments, y, solve   max  λ Margin loss K 1   2 1 ij i, j yi    i 1y j   j    ir yi ,r  i ,r subject to i  0, i  e  1 i (Crammer & Singer 01) 22 Multi-class Unsupervised SVM? 2. Think of  as a function of y Goal: Choose y to minimize margin loss If given y, would then solve   y   max  λ Margin loss K 1   2 1 ij i, j yi    i 1y j   j    ir yi ,r  i ,r subject to i  0, i  e  1 i Problem: not a function of y (Crammer &convex Singer 01) 23 Multi-class Unsupervised SVM? 3. Re-express problem with indicators comparing y labels New variables: M & D M ij  1( yi  y j ) , Dir  1( yi  r ) If given y, would then solve   y   max  λ Margin loss K 1   2 1 ij i, j yi  M  DD    i 1y j   j    ir yi ,r  i ,r subject to i  0, i  e  1 i (Crammer & Singer 01) 24 Multi-class Unsupervised SVM? 3. Re-express problem with indicators comparing y labels New variables: M & D M ij  1( yi  y j ) , Dir  1( yi  r ) M  DD  If given M and D, would then solve   M , D   max Q(, M , D) subject to   0, e  e Λ Margin loss where Q(, M , D)  n  D,   1  KD,   1 2 1 2 K, M    , K convex function of M&D 25 Multi-class Unsupervised SVM? 4. Get constrained optimization problem Solve for M and D min M , D   M , D  subject to M  DD  , diag( M )  e M  0,1 nn , D  0,1 nk 1  1    n e  M e       ne Class balance   k  k  26 Multi-class Unsupervised SVM? 5. Relax indicator variables to obtain a convex optimization problem Solve for M and D min M , D   M , D  subject to M  DD  , diag( M )  e M  0,1 nn , D  0,1 nk 1  1    n e  M e        ne k  k  27 Multi-class Unsupervised SVM? 5. Relax indicator variables to obtain a convex optimization problem Solve for M and D min M , D   M , D  subject to M  ± DD  , diag( M )  e 0  M  1, 0  D  1 1  1    n e  M e        ne k  k  Semidefinite program 28 Kmeans Spectral Clustering SemiDef Experimental Results 29 Experimental Results 30 Experimental Results Percentage of misclassification errors Digit dataset 31 Extension to Semi-Supervised Algorithm Matrix M : 1 1 Labeled t Unlabeled (Clamped) M ij  yi y j t t M j 1 ij  2t i  {t  1,..., n} 32 Experimental Results Percentage of misclassification errors Face dataset 33 Experimental Results 34 35 Discriminative, Unsupervised, Convex HMMs Joint work with Linli Xu With help from Li Cheng and Tao Wang Hidden Markov Model Must coordinate local classifiers y1 y2 y3 x1 x2 x3 yi  f ( xi ) “hidden” state observations  Joint probability model P(xy)  Viterbi classifier arg max P (y | x) y 37 HMM Training: Supervised  Given x1 y 1 , x 2 y 2 ,... x n y n Maximum likelihood max i 1 P (x i y i ) Models input distribution n   max i 1 P (y i | xi ) P (xi ) n  Conditional likelihood max   n i 1 P (y i | x i ) Discriminative (CRFs) 38 HMM Training: Unsupervised  Given only x1 , x 2 ,... x n  Now what? EM! Marginal likelihood max i 1 P (x i ) n  Exactly the part we don’t care about 39 HMM Training: Unsupervised  Given only x1 , x 2 ,... x n The problem with EM:     Not convex Wrong objective Too popular Doesn’t work 40 HMM Training: Unsupervised  Given only x1 , x 2 ,... x n The dream:  Convex training  Discriminative training P ( y | x) When will someone invent unsupervised CRFs? 41 HMM Training: Unsupervised  Given only x1 , x 2 ,... x n The question:  How to learn P(y | x) effectively without seeing any y’s? 42 HMM Training: Unsupervised  Given only x1 , x 2 ,... x n The question:  How to learn P(y | x) effectively without seeing any y’s? The answer:  That’s what we already did!  Unsupervised SVMs 43 HMM Training: Unsupervised  Given only x1 , x 2 ,... x n The plan: single supervised unsupervised y SVM  unsup SVM sequence   y M3N  ? 44 M3N: Max Margin Markov Nets  Relational SVMs f x ( y1 , y2 ) y1 y2 x1 x2 y3 x3  Supervised training:  Given x1 y 1 , x 2 y 2 ,... x n y n  Solve factored QP 45 Unsupervised M3Ns  Strategy  Start with supervised M3N QP  y-labels  re-express in local M,D equivalence relations  Impose class-balance  Relax non-convex constraints  Then solve a really big SDP  But still polynomial size 46 Unsupervised M3Ns  SDP 47 Some Initial Results  Synthetic HMM  Protein Secondary Structure pred. 48 49 Current Research Group PhD Tao Wang reinforcement learning PhD Dana Wilkinson action-based embedding PhD Feng Jiao bioinformatics PhD Qin Wang statistical natural language PhD Dan Lizotte optimization, everything PDF Li Cheng computer vision 50 Brief Research Background       Sequential PAC Learning Linear Classifiers: Boosting, SVMs Metric-Based Model Selection Greedy Importance Sampling Adversarial Optimization & Search Large Markov Decision Processes 51