A tutorial about SVM Omer Boehm [email protected] Neurocomputation Seminar © 2011 IBM Corporation IBM Haifa Labs Outline 2 Introduction Classification Perceptron SVM for linearly separable data. SVM for almost linearly separable data. SVM for non-linearly separable data. IBM © 2011 IBM Corporation IBM Haifa Labs Introduction A branch of artificial intelligence, a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data An important task of machine learning is classification. Classification is also referred to as pattern recognition. 3 IBM © 2011 IBM Corporation IBM Haifa Labs Example Classes Objects 4 Income Debt Married age Shelley 60,000 1000 No 30 Elad 200,000 0 Yes 80 Dan 0 No 25 Alona 100,000 10,000 Yes 40 20,000 Approve deny Learning Machine IBM © 2011 IBM Corporation IBM Haifa Labs Types of learning problems Supervised learning (n class, n>1) Classification Regression Unsupervised learning (0 class) Clustering (building equivalence classes) Density estimation 5 IBM © 2011 IBM Corporation IBM Haifa Labs Supervised learning Regression Learn a continuous function from input samples Stock prediction In fact, these are Input – future date Output – stock price approximation problems Training – information on stack price over last period Classification Learn a separation function from discrete inputs to classes. Optical Character Recognition (OCR) Input – images of digits. Output – labeling 0-9. Training - labeled images of digits. 6 IBM © 2011 IBM Corporation IBM Haifa Labs Regression 7 IBM © 2011 IBM Corporation IBM Haifa Labs Classification 8 IBM © 2011 IBM Corporation IBM Haifa Labs Density estimation 9 IBM © 2011 IBM Corporation IBM Haifa Labs What makes learning difficult Given the following examples How should we draw the line? 10 IBM © 2011 IBM Corporation IBM Haifa Labs What makes learning difficult Which one is most appropriate? 11 IBM © 2011 IBM Corporation IBM Haifa Labs What makes learning difficult The hidden test points 12 IBM © 2011 IBM Corporation IBM Haifa Labs What is Learning (mathematically)? We would like to ensure that small changes in an input point from a learning point will not result in a jump to a different classification xi y i xi x y i Such an approximation is called a stable approximation As a rule of thumb, small derivatives ensure stable approximation 13 IBM © 2011 IBM Corporation IBM Haifa Labs Stable vs. Unstable approximation Lagrange approximation (unstable) given n points, ( xi , yi ) , we find the unique polynomial , that passes through the given points f ( x) Ln1 ( x) Spline approximation (stable) given n points, ( xi , yi ) , we find a piecewise approximation f (x) by third degree polynomials such that they pass through the given points and have common tangents at the xn division points and in addition : | f ' ' ( x) |2 dx min x1 14 IBM © 2011 IBM Corporation IBM Haifa Labs What would be the best choice? The “simplest” solution A solution where the distance from each example is as small as possible and where the derivative is as small as possible 15 IBM © 2011 IBM Corporation IBM Haifa Labs Vector Geometry Just in case …. 16 IBM © 2011 IBM Corporation IBM Haifa Labs Dot product The dot product of two vectors [a1 , a2 , a3 ,..., an ] [b1 , b2 , b3 ,..., bn ] Is defined as: a b i 1 aibi a2b2 a3b3 ... anbn n An example [1,3,5] [4,2,1] (1)( 4) (3)( 2) (5)( 1) 3 17 IBM © 2011 IBM Corporation IBM Haifa Labs Dot product a a || a ||2 where || a || denotes the length (magnitude) of ‘a’ a b || a || * || b || * cos if ' a' is perpendicu lar to ' b' then a b 0 (cos 90o 0) Unit vector a a || a || 18 IBM © 2011 IBM Corporation IBM Haifa Labs Plane/Hyperplane Hyperplane can be defined by: Three points Two vectors A normal vector and a point 19 IBM © 2011 IBM Corporation IBM Haifa Labs Plane/Hyperplane Let n be a perpendicular vector to the hyperplane H Let p1 be the position vector of some known point p1 in the plane. A point _p with position vector p is in the plane iff the vector drawn from p1 to p is perpendicular to n Two vectors are perpendicular iff their dot product is zero The hyperplane H can be expressed as n ( p p1 ) 0 n p n p1 0 // substitue n w, p x, b n p1 w x b 0 20 IBM © 2011 IBM Corporation IBM Haifa Labs Classification 21 IBM © 2011 IBM Corporation IBM Haifa Labs Solving approximation problems First we define the family of approximating functions F Next we define the cost function C ( f ) . This function tells how well f F performs the required approximation Getting this done , the approximation/classification consists of solving the minimization problem min (C ) f F A first necessary condition (after Fermat) is C 0 f As we know it is always possible to do Newton-Raphson, and get a sequence of approximations 22 IBM © 2011 IBM Corporation IBM Haifa Labs Classification A classifier is a function or an algorithm that maps every possible input (from a legal set of inputs) to a finite set of categories. X is the input space, x Xis a data point from an input space. A typical input space is high-dimensional, for example x {x1 , x2 , ..., xd } R d , d 1 X is also called a feature vector. Ω is a finite set of categories to which the input data points belong : Ω ={1,2,…,C}. i are called labels. 25 IBM © 2011 IBM Corporation IBM Haifa Labs Classification Y is a finite set of decisions – the output set of the classifier. The classifier is a function f :x y x X f (classify ) y f ( x) Y 26 IBM © 2011 IBM Corporation IBM Haifa Labs The Perceptron 27 IBM © 2011 IBM Corporation IBM Haifa Labs Perceptron - Frank Rosenblatt (1957) Linear separation of the input space f ( x ) w x b h( x) sign ( f ( x)) 28 IBM © 2011 IBM Corporation IBM Haifa Labs Perceptron algorithm Start: The weight vector set t 0 W0 is generated randomly, Test: A vector x P N is selected randomly, if x P and wt x 0 go to test, if x P and wt x 0 go to add, if x N and wt x 0 go to test, if x N and wt x 0 go to subtract Add: Subtract: 29 wt 1 wt x, t t 1 wt 1 wt x, t t 1 IBM go to test, go to test, © 2011 IBM Corporation IBM Haifa Labs Perceptron algorithm Shorter version Update rule for the k+1 iterations (iteration for each data point) if ( yi ( wk xi ) 0, then wk 1 wk yi xi k k 1 30 IBM © 2011 IBM Corporation IBM Haifa Labs Perceptron – visualization (intuition) 31 IBM © 2011 IBM Corporation IBM Haifa Labs Perceptron – visualization (intuition) 32 IBM © 2011 IBM Corporation IBM Haifa Labs Perceptron – visualization (intuition) 33 IBM © 2011 IBM Corporation IBM Haifa Labs Perceptron – visualization (intuition) 34 IBM © 2011 IBM Corporation IBM Haifa Labs Perceptron – visualization (intuition) 35 IBM © 2011 IBM Corporation IBM Haifa Labs Perceptron - analysis Solution is a linear combination of training points W i yi xi i, i 0 Only uses informative points (mistake driven) The coefficient of a point reflect its ‘difficulty’ The perceptron learning algorithm does not terminate if the learning set is not linearly separable (e.g. XOR) 36 IBM © 2011 IBM Corporation IBM Haifa Labs Support Vector Machines 37 IBM © 2011 IBM Corporation IBM Haifa Labs Advantages of SVM, Vladimir Vapnik 1979,1998 Exhibit good generalization Can implement confidence measures, etc. Hypothesis has an explicit dependence on the data (via the support vectors) Learning involves optimization of a convex function (no false minima, unlike NN). Few parameters required for tuning the learning machine (unlike NN where the architecture/various parameters must be found). 38 IBM © 2011 IBM Corporation IBM Haifa Labs Advantages of SVM From the perspective of statistical learning theory the motivation for considering binary classifier SVMs comes from theoretical bounds on the generalization error. These generalization bounds have two important features: 39 IBM © 2011 IBM Corporation IBM Haifa Labs Advantages of SVM The upper bound on the generalization error does not depend on the dimensionality of the space. The bound is minimized by maximizing the margin, i.e. the minimal distance between the hyperplane separating the two classes and the closest data-points of each class. 40 IBM © 2011 IBM Corporation IBM Haifa Labs Basic scenario - Separable data set 41 IBM © 2011 IBM Corporation IBM Haifa Labs Basic scenario – define margin 42 IBM © 2011 IBM Corporation IBM Haifa Labs In an arbitrary-dimensional space, a separating hyperplane can be written : w x b 0 Where W is the normal. The decision function would be : D( x) sign ( w x b) 43 IBM © 2011 IBM Corporation IBM Haifa Labs Note argument in D(x) rescaling of the form is invariant under a w w, b b Implicitly the scale can be fixed by defining H1 : w x b 1 H 2 : w x b 1 as the support vectors (canonical hyperplanes) 44 IBM © 2011 IBM Corporation IBM Haifa Labs The task is to select can be described as: w, b , so that the training data xi w b 1 for yi 1 xi w b 1 for yi 1 These can be combined into: yi ( xi w b) 1 0, i 45 IBM © 2011 IBM Corporation IBM Haifa Labs 46 IBM © 2011 IBM Corporation IBM Haifa Labs The margin will be given by the projection of the vector ( x x ) 1 2 onto the normal vector to the hyperplane i.e. w w || w || So the distance (Euclidian) can be formed | ( x1 x2 ) w | d || w || 47 IBM © 2011 IBM Corporation IBM Haifa Labs Note that x1 lies on H1 i.e. w x b 1 w x1 b 1 Similarly for x2 w x2 b 1 Subtracting the two results in w ( x1 x2 ) 2 48 IBM © 2011 IBM Corporation IBM Haifa Labs The margin can be put as | ( x1 x2 ) w | 2 m || w || || w || Can convert the problem to subject to the constraints: 1 J ( w) ( w w) 2 yi ( xi w b) 1 0, i J(w) is a quadratic function, thus there is a single global minimum 49 IBM © 2011 IBM Corporation IBM Haifa Labs Lagrange multipliers Problem definition : Maximize f ( x, y ) subject to g ( x, y ) c A new λ variable is used , called ‘Lagrange multiplier‘ to define Ld ( x, y, ) f ( x, y) ( g ( x, y) c) 50 IBM © 2011 IBM Corporation IBM Haifa Labs Lagrange multipliers 51 IBM © 2011 IBM Corporation IBM Haifa Labs Lagrange multipliers 52 IBM © 2011 IBM Corporation IBM Haifa Labs Lagrange multipliers - example Maximize f ( x, y ) x y , subject to x2 y 2 1 Formally set ( x, y) x y ( x 2 y 2 1) Set the derivatives to 0 1 2 x 0 x 1 2y 0 y x2 y2 1 0 x y Combining the first two yields : Substituting into the last x y 2 / 2 Evaluating the objective function f on these yields f ( 2 / 2, 2 / 2) 2 f ( 2 / 2, 2 / 2) 2 53 IBM © 2011 IBM Corporation IBM Haifa Labs Lagrange multipliers - example 54 IBM © 2011 IBM Corporation IBM Haifa Labs Primal problem: 1 minimize J ( w) ( w w) 2 s.t. yi ( xi w b) 1 0, i Introduce Lagrange multipliers associated with the constraints i, i 0 The solution to the primal problem is equivalent to determining the saddle point of the function : 1 n L p ( w, b, ) ( w w) i 1 i ( yi ( xi w b) 1) 2 55 IBM © 2011 IBM Corporation IBM Haifa Labs At saddle point, L p , has minimum requiring L w i yi xi 0 w i yi xi w i i L i yi 0 b i 56 IBM © 2011 IBM Corporation IBM Haifa Labs Primal-Dual 1 n n L p ( w w) i 1 i yi ( xi w b) i 1 i 2 Primal: Lp Minimize Substitute w i yi xi and y i Dual: i i subject to i, i 0 0 i 1 n n Ld i 1 i i 1 j 1 i j yi y j xi x j 2 n Maximize Ld i, i 0, 57 w, b with respect to with respect to y i i subject to 0 i IBM © 2011 IBM Corporation IBM Haifa Labs Solving QP using dual problem Maximize 1 n n Ld i 1 i i 1 j 1 i j yi y j xi x j 2 n constrained to i, i 0, and i yi 0 i We have {1 , 2 ,..., n } , new variables. One for each data point This is a convex quadratic optimization problem , and we run a QP solver to get , and W ( w i yi xi ) i 58 IBM © 2011 IBM Corporation IBM Haifa Labs ‘b’ can be determined from the optimal and Karush-Kuhn-Tucker (KKT) conditions i [ yi ( w xi b) 1] 0, i i 0 (data points residing on the SV) implies yi ( w xi b) 1 w xi b yi 59 b yi w xi 1 or AVG b Ns IBM y i w xi i © 2011 IBM Corporation IBM Haifa Labs For every data point i, one of the following must hold i 0 i 0, yi (w xi b 1) 0 Many i 0 w y x i i i sparse solution i Data Points with i 0 are Support Vectors Optimal hyperplane is completely defined by support vectors 60 IBM © 2011 IBM Corporation IBM Haifa Labs SVM - The classification Given a new data point Z, find it’s label Y m D( z ) sign i yi ( xi z ) b i 1 61 IBM © 2011 IBM Corporation IBM Haifa Labs Extended scenario - Non-Separable data set 62 IBM © 2011 IBM Corporation IBM Haifa Labs Data is most likely not to be separable (inconsistencies, outliers, noise), but linear classifier may still be appropriate Can apply SVM in non-linearly separable case Data should be almost linearly separable 63 IBM © 2011 IBM Corporation IBM Haifa Labs SVM with slacks Use non-negative slack variables 1 , 2 ,......, n one per data point Change constraints from yi ( xi w b) 1, i to yi ( xi w b) 1 i , i i is a measure of deviation from ideal position for sample I i 1 0 i 1 64 IBM © 2011 IBM Corporation IBM Haifa Labs SVM with slacks Would like to minimize 1 J ( w, 1 , 2 ,......, n ) ( w w) C i i 2 constrained to yi ( xi w b) 1 i , i and i 0 The parameter C is a regularization term, which provides a way to control over-fitting if C is small, we allow a lot of samples not in ideal position if C is large, we want to have very few samples not in ideal position 65 IBM © 2011 IBM Corporation IBM Haifa Labs SVM with slacks - Dual formulation Maximize 1 n n Ld ( ) i 1 i i 1 j 1 i j yi y j xi x j 2 n Constraint to 66 0 i C i and IBM n y i 1 i i 0 © 2011 IBM Corporation IBM Haifa Labs SVM - non linear mapping Cover’s theorem: “pattern-classification problem cast in a high dimensional space non-linearly is more likely to be linearly separable than in a low-dimensional space” One dimensional space, not linearly separable Lift to two dimensional space with 67 ( x) ( x, x 2 ) IBM © 2011 IBM Corporation IBM Haifa Labs SVM - non linear mapping Solve a non linear classification problem with a linear classifier Project data x to high dimension using function (x ) Find a linear discriminant function for transformed data f (x ) Final nonlinear discriminant function is g ( x) wt ( x) w 0 In 2D, discriminant function is linear In 1D, discriminant function is NOT linear 68 IBM © 2011 IBM Corporation IBM Haifa Labs SVM - non linear mapping Can use any linear classifier after lifting data into a higher dimensional space. However we will have to deal with the “curse of dimensionality” poor generalization to test data computationally expensive SVM handles the “curse of dimensionality” problem: enforcing largest margin permits good generalization It can be shown that generalization in SVM is a function of the margin, independent of the dimensionality computation in the higher dimensional case is performed only implicitly through the use of kernel functions 69 IBM © 2011 IBM Corporation IBM Haifa Labs Non linear SVM - kernels Recall: Maximize 1 n n Ld i 1 i i 1 j 1 i j yi y j xi x j 2 n m classifica tion : D( z ) sign i yi ( xi z ) b i 1 The data points x i appear only in dot products If x i is mapped to high dimensional space using (x ) The high dimensional product is needed ( xi )t ( x j ) Maximize 1 n n Ld i 1 i i 1 j 1 i j yi y j ( xi ) t ( x j ) 2 n The dimensionality of space F not necessarily important. May not even know the map 70 IBM © 2011 IBM Corporation IBM Haifa Labs Kernel A function that returns the value of the dot product between the images of the two arguments: k ( x, y ) ( x ) ( y ) Given a function K, it is possible to verify that it is a kernel. Now we only need to compute k ( x, y ) instead of ( x) ( y ) Maximize Ld i 1 i n 1 n n y y k ( xi , x j ) i 1 j 1 i j i j 2 “kernel trick”: do not need to perform operations in high dimensional space explicitly 71 IBM © 2011 IBM Corporation IBM Haifa Labs Kernel Matrix 72 The central structure in kernel machines Contains all necessary information for the learning algorithm Fuses information about the data AND the kernel Many interesting properties: IBM © 2011 IBM Corporation IBM Haifa Labs Mercer’s Theorem The kernel matrix is Symmetric Positive Definite Any symmetric positive definite matrix can be regarded as a kernel matrix, that is as an inner product matrix in some space Every (semi)positive definite, symmetric function is a kernel: i.e. there exists a mapping φ such that it is possible to write: k ( x, y ) ( x ) ( y ) Positive definite k ( x, y) f ( x) f ( y)dxdy 0 f L 2 73 IBM © 2011 IBM Corporation IBM Haifa Labs Examples of kernels Some common choices (both satisfying Mercer’s condition): Polynomial kernel k ( xi , x j ) ( xi , x j 1) p Gaussian radial basis function (RBF) k ( xi , x j ) exp( 74 1 2 IBM 2 || xi x j ||2 ) i © 2011 IBM Corporation IBM Haifa Labs Polynomial Kernel - example ( x z ) 2 ( x1 z1 x2 z2 ) 2 ( x z x z 2 x1 z1 x2 z2 ) 2 2 1 1 2 2 2 2 ( x , x , 2 x1 x2 ) ( z , z , 2 z1 z2 ) 2 1 2 2 2 1 2 2 ( x) ( z ) 75 IBM © 2011 IBM Corporation IBM Haifa Labs Applying - non linear SVM Start with data {x1 , x2 , ..., xn } , which lives in feature space of dimension n Choose a kernel k ( xi , x j ) corresponding to some function (x ) , which takes data point xi to a higher dimensional space Find the largest margin linear discriminant function in the higher dimensional space by using quadratic programming package to solve 1 n n n Ld ( ) i 1 i i 1 j 1 i j yi y j k ( xi , x j ) 2 constraint to 0 i C i and 76 IBM n y i 1 i i 0 © 2011 IBM Corporation IBM Haifa Labs Applying - non linear SVM Weight vector w in the high dimensional space: w i yi ( xi ) i Linear discriminant function of largest margin in the high dimensional space: t g ( ( x)) w ( x) i yi ( xi ) ( x) xvs t Non-Linear discriminant function in the original space: g ( ( x)) wt ( x) i yi t ( xi ) ( x) i yi k ( xi , x) xvs 77 xvs IBM © 2011 IBM Corporation IBM Haifa Labs Applying - non linear SVM 78 IBM © 2011 IBM Corporation IBM Haifa Labs SVM summary Advantages: Based on nice theory Excellent generalization properties Objective function has no local minima Can be used to find non linear discriminant functions Complexity of the classifier is characterized by the number of support vectors rather than the dimensionality of the transformed space Disadvantages: It’s not clear how to select a kernel function in a principled manner tends to be slower than other methods (in non-linear case). 79 IBM © 2011 IBM Corporation
© Copyright 2026 Paperzz