Max-margin Classification of Data with Absent Features Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, and Daphne Koller Journal of Machine Learning Research 2008 Outline • • • • • • Introduction Background knowledge Problem Description Algorithms Experiments Conclusions 2 Introduction • In the traditional supervised learning, data instances are viewed as feature vectors in high-dimensional space. • Why do features miss? – noise – undefined part of objects – structural absent – etc. 3 Introduction • How to handle classification if features missing? – fundamental methods – expectation maximization (EM) – Markov-chain monte-carlo (MCMC) • However, features sometimes are non-existing, rather than have an unknown value. • To classify without filling missing values. 4 Background • Support Vector Machines(SVMs) • Second Order Cone Programming(SOCP) 5 Support Vector Machines • Support Vector Machines(SVMs): a supervised learning method used for classification and regression. • They simultaneously minimize the empirical classification error and maximize the geometric margin, also called as maximum margin classifiers. 6 Support Vector Machines • Given a set of n labeled sample x1…xn, in a feature spaces F of size d. Each sample xi has a binary class label yi 1,1. • We want to give the maximum-margin hyperplane which divides the data having yi = 1 from those having yi = − 1. 7 Support Vector Machines • Any hyperplane can be written as the set of samples x satisfying wx b 0, where w is a normal vector and b is the offset of the hyperplane from the origin along w. • The hyperplane separate samples into two classes, so it could be: wx b 1 for xi for the first class wx b -1 for xi for the second. yi wx i b 1, i 1...n 8 Support Vector Machines • Geometric Margins: we define the margin yi wx i b as ρ min i , and learn a classifier w by w maximize ρ. • It turns to an optimization problem: min w,b w , s.t. yi wx i b 1, i 1...n • Or a quadratic programming (QP) optimization problem: min w,b 1 2 w , s.t. yi wx i b 1, i 1...n 2 9 Support Vector Machines yi wx i b ρ min i w The hyperplane H3 doesn't separate the 2 classes. H1 does, with a small margin and H2 with the maximum margin. 10 Support Vector Machines • Soft-margin SVMs: when the training samples are not linearly separable, we introduce slack variables i to the SVMs. • It turns to: min w,b n 1 2 w C i s.t. 2 i 1 yi wx i b 1 i , i 1...n where C is the trade-off between accuracy and model complexity. 11 Support Vector Machines • The dual problem of SVMs: we consider to use Lagrangian to solven the primal SVMs. 1 2 Lw , b, w i yi wx i b 1 2 i 1 • Set the first derivatives of L to 0: n Lw , b, w yi i xi 0 w i 1 Lw , b, n yi i 0 b i 1 12 Support Vector Machines • We then derive: n 1 2 Lw , b, w i yi wx i b 1 2 i 1 • The dual problem of SVMs: n n n n 1 1 y y x , x 0i j y Lw , b , i j y i j i i j ix i , x j i j 2 i , j 1 i 1 2 i , j 1 i 1 s.t. αi 0, i 1...n n y i 1 i i 0 13 Second Order Cone Programming • Second-order Cone Programming(SOCP): a convex optimization problem of the form: d x R where is the optimization variable. • We can solve SOCPs by using MOSEK. 14 Problem Description • Given a set of n labeled sample x1…xn, in a feature spaces F of size d. Each sample xi has a binary class label yi 1,1. • Let Fi denote the set of features of the ith sample. Each sample xi can be viewed as F embedded in the relevant subspace R R d . i 15 Problem Description • In the traditional SVMs, we try to maximize the geometric margin ρ for all instances. • In the case of missing feature, the margin which measures the distance to a hyperplane is not well defined. • We can’t use the traditional SVMs in this case. 16 Problem Description • We should treat the margin of each instance in its own relevant subspace. • We define the instance margin ρi w for the ith instance as: i yiw xi ρi w w i where w(i) is a vector obtained by taking the entries of w that are relevant for xi. 17 Problem Description • We consider new geometric margin to be the minimum over all instance margins, that is: ρw mini ρi w , and arrive at a new optimization problem for the case: 18 Problem Description • However, since different margin terms are normalized by different norms , we can’t take out of the minimization. i yi w x i • Besides, each of the terms is noni w convex in w, it’s difficult to solve directly. 19 Algorithms • How to solve this optimization problem? • Three approaches: – Linearly separable case: A Convex Formulation – The general case: • Average Norm • Instance-specific Margins 20 A Convex Formulation • In the linearly separable case, we can transform the optimization problem into a series of convex optimization problem. • First step: By maximizing a lower bound, we take the minimization term out of the objective function into the constraints. And the resulting problem is equivalent to the original one since the bound γ can always be maximized until it’s perfect tight. 21 A Convex Formulation • Second step: Replace the single constraint by multiple constraints. From single constraint to multiple constraints. 22 A Convex Formulation • Finally, we write: i Because w 0 for all instances. 23 A Convex Formulation • Assume first that γ is given. For any fixed value of γ, the problem obeys the general structure of SOCP. The structures are similar. 24 A Convex Formulation • We can solve it by doing a bisection search over , for each iteration we solve a SOCP problem. • However, one problem is that any scaled version of a solution is also a solution. Each constraint is invariant to a rescaling of w. And the null case is always a solution. 25 A Convex Formulation • How to solve it? We can add constraints: – A non vanishing norm – On a single entry in w, . No longer convex! for each entry. • We can solve the SOCP twice for each entry of w, once for , and once for . • It becomes a total of 2d problems. 26 A Convex Formulation • The convex formulation is difficult for the nonseparable case. We can’t be sure that it’s jointly convex with w. Consider the slack variables are not normalized by 27 A Convex Formulation • In the case, the vanishing solutions , are also encountered. • We can no longer guarantee that the modified approach we discussed above will coincide. • So the non-separable formulation isn’t likely to be of practical use for this case. 28 Average Norm • We consider an alternative solution based on an approximation of the margin. ρi w • We can approximate the different norms by a common term that doesn’t depend on the instance. 29 Average Norm • Replace each low-dimensional norms by the root-mean-square norm over all instances. • When all samples have all features, it’s the same as the original SVMs. 30 Average Norm • In the case of missing features, the approximation of will be also good if all the norms are equal. • When the norms are near equal, we expect to find nearly optimal solutions. 31 Average Norm • We can derive: 32 Average Norm • The linear-separable case: They can be solved using the same techniques as standard SVMs. • The non-separable case: Quadratic programming Problems! 33 Average Norm • However, average norm isn’t expected to perform well if the norms vary considerably. • How to solve the problem in the case? • Instance-specific Margins approach. 34 Instance-specific Margins • We can represent each of the norms scaling of the full norm . • By defining scaling coefficients can rewrite the equation: as a ,we 35 Instance-specific Margins • So we can derive as above steps: Separable case. Non-separable case. 36 Instance-specific Margins • We consider how to solve it: Not a Quadratic Programming problem! It’s not even convex in w. • Projected gradient approach. 37 Instance-specific Margins • Projected gradient approach: one iterates between steps in the direction of the gradient of the Lagrangian and projections to the constrained space, by calculating . • With the right choices of step sizes, it converges to local minima. • Other solutions? 38 Instance-specific Margins • If we give a set of si , the problem is a Quadratic Programming problem. For any fixed value of si, the problem is a QP! • We can use the fact to devise a iterative algorithm. 39 Instance-specific Margins • For a given tuple of si‘s, we solve a QP for w, and then use the resulting w to calculate new si‘s. • To solve the QP, we derive the dual for given si‘s: The dual problem of SVMs! 40 Instance-specific Margins • The inner product <∙, ∙> is taken only over features that are valid for both xi and xj. • We discuss the kernels for modified SVMs later. 41 Instance-specific Margins • Iterative optimization/projection algorithm: The Dual problem of SVMs. The convergence isn’t always guaranteed. The dual solution is used to find the optimal classifier by setting 42 Instance-specific Margins • Two other approaches for optimizing this: 43 Instance-specific Margins • Updating approach: minimize subject to . • Hybrid approach: combine gradient ascent over s and QP for w. • Those approaches didn’t perform as well as the iterative approach above. 44 Kernels for missing features • Why choose kernels for the SVMs? • Some common kernels: – Polynomial: – Polynomial(inhomogeneous): – Radial Basis Function: – Gaussian Radial basis function: – Sigmoid: RBF kernel for SVM 45 Kernels for missing features • In the dual formulation above, the dependence on the instances is through their inner product. • We focus on kernels like the dependence. – Polynomial: – Sigmodal: , 46 Kernels for missing features • For a polynomial kernel defined the modified kernel as: , with the inner product calculated over valid features . • We define , where replaces invalid entries(missing features) with zeros. 47 Kernels for missing features • We have , simply since multiplying by zero is equivalent to skipping the missing values. • We make the kernels for missing features. 48 Experiments • Three experiments: – Features are missing at random. – Visual object recognition: features are missing because they can’t be located in the image. – Biological network completion(Metabolic Pathway Reconstruction): missingness patterns of features is determined by the known structure of the network. 49 Experiments • Five common approaches for filling missing features: – Zero – Mean – Flag – kNN – EM • Average Norm and Geometric Margins. 50 Missing at Random • Features are missing at random: – Data sets from the UCI repository. – MNIST images. Missing features 51 Missing at Random • Experiment results: Geometric Margins has good performance. Data sets from UCI MNIST images 52 Visual Object Recognition • Visual Object Recognition: to determine if an object from a certain class is present in a given input image. • The trunk of a car may not be found in a picture of a hatch-back car. • The features are structurally missing. 53 Visual Object Recognition • The object model contains a set of “landmarks”, defining the outline of an object. • We find several matches in a given image. Five matches for the front windshield landmark. 54 Visual Object Recognition • In the car model, we located up to 10 matches(candidates) for each of the 19 landmarks. • For each candidate, we compute the first 10 principal component coefficients(PCA) of the image patch. 55 Visual Object Recognition • We concatenate these descriptors to form 1900(19×10×10) features per image. • If the number of descriptors for a given landmark is less than 10, we consider the rest to be structurally absent. 56 Visual Object Recognition • Experiment results: 57 Visual Object Recognition • Examples: 58 Metabolic Pathway Reconstruction • Metabolic Pathway Reconstruction: predicting missing enzymes in metabolic pathways. • Instances in this task have missing features due to the structure of the biochemical network. 59 Metabolic Pathway Reconstruction • Cells use a complex network of chemical reactions to produce their building blocks. enzyme molecular compounds molecular compounds enzyme The enzyme catalyzes a reaction. 60 Metabolic Pathway Reconstruction • For many reactions, the enzyme responsible for their catalysis is unknown, making it an important computational task to predict the identity of such missing enzymes. • How to predict? • The enzymes in local network neighborhoods usually participate in related functions. 61 Metabolic Pathway Reconstruction • Different types of network neighborhood relations between enzyme pairs lead to different relations of their properties. • Three types: – forks(same inputs, different outputs) – funnels(same outputs, different inputs) – linear chains 62 Metabolic Pathway Reconstruction linear chains forks(same inputs, different outputs) funnels(same outputs, different inputs) 63 Metabolic Pathway Reconstruction • Each enzyme is represented using a vector of features that measure its relatedness to each of its different neighbors, across different data types. • A feature vector will have structurally missing entries if the enzyme does not have all types of neighbors. 64 Metabolic Pathway Reconstruction • Three types of data for enzyme attributes: – A compendium of gene expression assays – Protein domains content of enzymes – The cellular localization of proteins • We use those data to measure the similarity between enzymes. 65 Metabolic Pathway Reconstruction • Similarity Measures for Enzyme Predictions: 66 Metabolic Pathway Reconstruction • Positive examples: From the reactions with known enzymes. • Negative examples: By plugging a random impostor genes into each neighborhood. 67 Metabolic Pathway Reconstruction • Experiment results: 68 Conclusions • A novel method for max-margin training of classifiers in the presence of missing features. • To classify instances by skipping the nonexisting features, rather than filling them with hypothetical values. 69
© Copyright 2026 Paperzz