Linear Discriminant Functions and SVM Seong-Wook Joo 1. Fisher linear discriminant (1) Motivation y Want dimensionality reduction in a classification problem. y PCA seeks directions (principal directions) that best represent the original data. y Fisher linear discriminant analysis (FDA) seeks directions that are efficient for discrimination. (below: which w is better for discrimination?) (2) Projection onto one direction w, two-class problem y Samples: n d-dimensional vectors x1…xn , consisting of two subsets D1, D2 y Projected samples: y = wtx two subsets Y1, Y2 y Criterion: maximize the Fisher linear discriminant ~ −m ~ 2 m 1 2 J (w ) = ~ 2 ~ 2 s +s 1 J (w ) = 2 ~ = mean of y∈Y , ~ ~ 2 m 1 s1 = ∑ y∈Y ( y − m1 ) : scatter of Y1 1 1 2 wtS Bw w t SW w S B = (m 1 − m 2 )(m 1 − m 2 ) t : between scatter matrix (mi = mean of x∈Di) SW = S1 + S 2 : within scatter matrix, S i = ∑ x∈D (x − m i )(x − m i ) t i 1 y Solution w = SW y −1 (m1 − m 2 ) Threshold (decision boundary) Not given by FDA. Use “Bayesian decision theory” (minimize risk or error rate), or any general classifier. (3) Projection onto multiple directions wi, c-class problem: Multiple Discriminant Analysis (MDA) y y Reduction to (c-1) dimensions: There can be at most c-1 solutions for wi Projected samples: yi = witx, i=1,…,c-1 or in matrix notation y = Wt x y Generalization of scatter matrices c SW = ∑ S i i =1 c S B = ∑ ni (m i − m)(m i − m) t comes from the total (all classes) scatter matrix i =1 y Criterion: maximize the Fisher linear discriminant J (W ) = y WtS B W W t SW W where | | denotes matrix determinant (we want a scalar measure of matrix and determinant (product of eigenvals) of the scatter is a measure of scattering volume). Solution: solve the generalized eigenvector problem S B w i = λi SW w i Since SB is at most rank c-1, there can be at most c-1 nonzero eigenvalues λ’s and their corresponding w’s. 2 2. Linear discriminant functions (linear classifiers [1]) (1) Problem formulation y Linear discriminant function for a two-class (ω1 or ω2) problem g (x) = w t x + w0 − − − − y The goal is to find w: weight vector, w0: threshold (bias) so that if g(x) > 0 then x is ω1, if g(x) < 0 then x is ω2. Decision boundary is a hyperplane. Note: The distance from x to the hyperplane is g(x)/||w||. The margin is the minimum of the distances. For a multi-category case, define a linear machine g i (x) = w i x + wi 0 i = 1,", c assign ωi if g i(x) > g j(x) for all j≠i. Generalized linear discriminant function g (x) = a t y + w0 t y y = ϕ (x ) y − Allows for nonlinear decision boundary. − If y is of higher dimension we can have multiply connected regions. − But beware of “the curse of dimensionality” Augmented feature vector ⎡1 ⎤ ⎡w ⎤ y = ⎢ ⎥, a = ⎢ 0 ⎥ ⎣x ⎦ ⎣w⎦ g ( x) = a t y y Note: The distance from y to the hyperplane is g(y)/||a|| Normalization − Replace all y∈{ y is in ω2 } by –y − Problem becomes finding a such that g (x) = a t y > 0 for all y (2) Criterion functions and algorithms y Linearly separable case: Perceptron criterion − ( ) J P = ∑ y∈Y − a t y , Y = {y ' s misclassified by a} − Algorithm: keep adding misclassified y to a − Finite convergence only if linearly separable, else does not terminate 3 y Non-separable case: Minimum squared error − J S = Ya − b = ∑ (a t y i − bi ) for some given b (e.g., all ones) 2 n 2 i =1 − Solution: a = (Y t Y ) Y t b −1 y − Resulting a is same as FDA if b = [n/n1…(n1times) n/n2…(n2times)]t − In addition this gives the threshold w0 = -mtw Ho-Kashyap procedure y − Combination of MSE (both a and b are unknown) and Perceptron − Finite convergence if linearly separable, else provides proof of non-separability (but with no bound on the number of iterations needed) Linear programming − Finite convergence in both separable and non-separable (but solution useful only if separable?) 3. Support Vector Machines (SVM) [2] (1) Capacity of a classifier y Need to be accurate on the training samples but also generalize well on testing samples y A classifier with too much capacity will not generalize well e.g., high degree polynomial y “VC dimension” is a measure of the capacity of a class of function − definition of VC dimension: largest number of pts that can be shattered (classify all possible labeling combinations) by the function. − A line on R2 has VC dim = 3, hyperplane in Rn has VC dim = n+1 4 y VC bound − The expected risk (error) of a classifier is bounded by the following R ≤ Remp + φ(h,…) Remp: training error, h: VC dimension − φ(h,…) monotonically increases with h (2) Linear SVM y Training data: {(xi , yi)| i=1,…,l}, xi: training samples, yi : label (+1 or –1) y SVM looks for the separating hyperplane (w, b) that maximizes the margin y Say (w, b) is scaled so that ⎧x i ⋅ w + b ≥ +1 for yi = +1 ⎨ ⎩x i ⋅ w + b ≤ −1 for yi = −1 Then margin = 2 / w . Therefore minimize ||w||2 subject to above constraint y y − Solve for Lagrange multipliers αi ≥ 0 from the “dual” problem (details omitted) − w = ∑i αiyixi The support vectors are the xi’s satisfying the equality. The capacity decreases as margin increases [2] y Slack variables (ξi > 0) can be added to allow for outliers ⎧x i ⋅ w + b ≥ +1 − ξ i for yi = +1 ⎨ ⎩x i ⋅ w + b ≤ −1 + ξ i for yi = −1 with some penalty: minimize ||w||2 + C(∑i ξi) 5 (3) Non-linear SVM y y Use non-linear mapping Φ that maps x into higher (possibly infinite) dimensional space Note xi appears only in the form of dot products in the training phase y y If there exists a “kernel function” K such that K(xi, xj) = Φ(xi)⋅Φ(xj), then we don’t need to know what Φ(x) is. We don’t need to know what w is in the testing phase either since y x⋅w + b = ∑i αiyi(x⋅xi) + b Examples of such kernels K (x, y ) = (x ⋅ y + 1) p K (x, y ) = e (polymomial) − x − y / 2σ 2 2 (Gaussian) K (x, y ) = tanh(κx ⋅ y − δ ) (sigmoid) (4) Notes y SVM training always finds a global solution (neural network doesn’t) y SVM depends on the choice of kernel. Best choice not known y Speed and size (large datasets) are problems yet to be solved References [1] Duda, Hart, and Stork, "Pattern Classification" Chapter 5, Wiley, 2000 [2] C. J. C. Burges. “A Tutorial on Support Vector Machines for Pattern” Recognition. Knowledge Discovery and Data Mining, 2(2), 1998 [3] Bernhard Scholkopf and Alexander J. Smola, "Learning with Kernels - Support Vector Machines, Regularization, Optimization, and Beyond", MIT Press, 2002, Chapter 1 6
© Copyright 2024 Paperzz