References • An training algorithm for optimal margin classifiers Boser-Guyon-Vapnik, COLT, 1992 http://www.clopinet.com/isabelle/P apers/colt92.ps.Z • Book chapters 1 and 12 • Software LibSVM http://www.csie.ntu.edu.tw/~cjlin/li bsvm/ Lecture 7: Support Vector Machines Isabelle Guyon [email protected] Perceptron Learning Rule wj ← wj + y i x ij If x is misclassified wj = xj wj Σ y Σ i αi yi xij Optimum Margin Perceptron wj ← wj + y i x ij If x is the least well classified wj = Converges to “one” solution classifying well all the examples. Σ i αi yi xij xj wj Σ y Converges to the optimum margin Perceptron (Mezard -Krauth, 1988). QP formulation (vapnik, 1962). 1 Optimum Margin Solution Negative Margin • Unique solution. • Depends only on support vectors ( SV) wj = Σ i∈SV αi yi xij • SVs are examples closest to the boundary. • Bound on leave- one-out error: LOO≤nSV /m • Most “stable”, good from MDL point of view. • But: sensitive to outliers and works only for linearly separable case. Multiple negative margin solutions, which all give the same number of errors. Soft-Margin Examples within the margin area incurr a penalty and become nonmarginal support vectors. Unique solution again ( Cortes -Vapnik , 1995): wj = Σ i∈SV αi yi xij Non-Linear Perceptron x1 φ1 (x) x2 φ2 (x) xn Rosenblatt, 1957 w1 w2 Σ f(x) wN φN(x) 1 b f(x) = w • Φ(x) + b 2 Kernel “Trick” Kernel Method • f(x) = w • Φ(x) x1 k(x 1,x) • w = Σi αi y i Φ(x i) x2 k(x 2,x) Potential functions, Aizerman et al 1964 α1 α2 Σ Dual forms Aizerman-Braverman-Rozonoer -1964 • f(x) = Σ i αi yi k(x i, x) αm xn k(x m,x) f(x) = • k(x i, x) = Φ(x i) • Φ(x) 1 Some Kernels (reminder) A kernel is a dot product in some feature space: k(s, t) = Φ(s) • Φ(t) • • • • Examples: k(s, t) = s • t k(s, t) = exp -γ ||s-t||2 k(s, t) = exp -γ ||s-t|| • k(s, t) = (1 + s • t) q • k(s, t) = (1 + s • t) q b Σ i αi k(x i,x) + b k(. ,. ) is a similarity measure or “ kernel”. Support Vector Classifier x2 f (x)<0 f (x)>0 Linear kernel Gaussian kernel Exponential kernel x=[x 1,x 2 ] f (x)=0 Polynomial kernel exp -γ ||s-t||2 Hybrid kernel f(x) = Σ k ∈ SV αj yj k(x, xj ) x1 Boser -Guyon-Vapnik- 1992 3 Margin and ||w|| Quadratic Programming • Hard margin: min ||w ||2 such that y j(w .xj+b)≥1 for all examples. • Soft margin: Margin min ||w ||2 + C Σ j ξjβ ξj ≥ 0, β=1,2 such that y j(w .xj+b)≥(1- ξj) for all examples. Maximizing the margin is equivalent to minimizing ||w||. Dual Formulation “Ridge SVC” • Soft margin: • Non- linear case: x → Φ(x) • min ||w ||2 + C Σ j ξjβ ξj ≥ 0, β=1 or β= 2 such that y j(w .Φ(xj)+b)≥(1- ξj) for all examples. • max -½ α T K α + α 1 such that αT y = 0 ; 0 ≤ αi ≤ C K= [ yi yj k(xi ,xj )] + (1/C) δij min ||w ||2 + C Σ j ξjβ ξj ≥ 0, β=1,2 such that y j(w .Φ(xj)+b)≥(1- ξj) f (xj) for all examples. 1-yj f(xj )<0, ξj =0, no penalty, not SV 1-yj f(xj )=0, ξj =0, no penalty, marginal SV • Ridge SVC: 1-yj f(xj )=ξj >0, penalty ξj, non- marginal SV Loss L( xj)=max (0, 1-y j f (xj) )β Risk Σi L( xj) min (1/C) ||w ||2 + Σj L(xj) regularized risk 4 Ridge Regression (reminder) • Sum of squares: R = Σi (f(xi) - y i) 2 = Σi ( 1-y i f (xi)) 2 Classification case • Add “regularizer”: R = Σi (1-y i f (xi)) 2 + λ|| w ||2 • Compare with SVC: R =Σ i max(0, 1-y j f (xj)) β + λ|| w ||2 • Nested subsets of models, increasing complexity/capacity: S1 ⊂ S2 ⊂ … SN R Vapnik-1984 • Example, rank with ||w ||2 S k = { w | ||w ||2< A k }, A 1 <A 2 <…<A k • Minimization under constraint: min Remp[f] s.t. ||w ||2< A k • Lagrangian: capacity Rreg [f] = Remp[f] + λ ||w||2 • Radius - margin bound: Vapnik-Chapelle-2000 LOOcv = 4 r2 ||w ||2 Loss Functions L(y, f(x)) Structural Risk Minimization Regularizers 4 Decision boundary 3.5 Margin SVC loss, β=2 max(0, (1- z)2 ) Adaboost 3 loss e-z 2.5 2 logistic loss log(1+e-z) 1.5 SVC loss, β=1 max(0, 1-z) 1 Perceptron 0.5 loss max(0, -z)0 -1 square loss (1- z)2 0/1 loss -0.5 missclassified 0 0.5 1 1.5 2 z=y f(x) • || w ||22= Σ i wi2 : 2-norm regularization (ridge regression, original SVM) • || w ||1= Σ i |wi| : 1-norm regularization (Lasso Tibshirani 1996, 1-norm SVM 1965) • || w ||1 =length(w) : 0-norm (Weston et al., 2003) well classified 5 Regression SVM • Epsilon insensitive loss: | y i – f(x i) |ε Unsupervised learning SVMs for: • density estimation: Fit F(x) (Vapnik, 1998) • finding the support of a distribution (one-class SVM) (Schoelkopf et al, 1999) • novelty detection (Schoelkopf et al, 1999) • clustering (Ben Hur, 2001) Summary • For statistical model inference, two ingredients needed: – A loss function : defines the residual error, i.e. what is not explained by the model; characterizes the data uncertainty or “ noise”. – A regularizer: defines our “prior knowledge” , biases the solution; characterizes our uncertainty about the model. We usually bet on simpler solutions ( Ockham’s razor). Exercise Class 6 Filters: see chapter 3 Homework 7 1) Download the software for homework 7. 2) Inspiring your self by the examples, write a new feature ranking filter object. Choose one in Chapter 3 or invent your own. 3) Provide the pvalue and FDR (using a tabulated distribution or the probe method). 4) Email a zip file your object and a plot of the FDR to [email protected] h with subject "Homework7" no later than: Tuesday December 13th. Dexter Baseline Dexter DEXTER filters texts Size Type Features Training Examples Validation Examples Test Examples 0.9 MB Sparse integer 20000 300 300 2000 Ø my_classif=svc({'coef0=1', 'degree=1', 'gamma=0', 'shrinkage=0.5'}); Ø my_model=chain({s2n('f_max=300'), normalize, my_classif}) 40 DEXTER 20 • Results: 0 0 5 10 15 20 25 30 35 40 45 50 BER Best entries: BER~3.3-3.9% AUC~0.97-0.99 Frac_feat~1.5% Frac_probe~50% % feat % probe 1.5 16.33 Train BER% 0.33 Valid Test Train Valid BER% BER% AUC AUC 7 5 Test AUC 1 0.982 0.988 7 Evalution of pval and FDR Analytic vs. probe -4 x 10 0.14 Red analytic – Blue probe 0.12 Arcene FDR 0.06 FDR – computes pval analytically – FDR~pval *nsc/n 0.06 0.04 00 2 0.04 1 0.02 500 10001500 2000 25003000 35004000 4500 5000 rank 0 0 0.01 0 0 1 0.8 rank rank 0.9 Gisette 0.8 Madelon 0.7 FDR 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0 0 500 100015002000 250030003500 400045005000 500 100015002000 2500300035004000 45005000 1 0.9 FDR – takes any feature ranking object as an argument (e.g. s2n, relief, Ttest) – pval~nsp/np – FDR~pval *nsc/n Dexter 0.05 0.03 0.02 • probe object: Dorothea 0.07 0.08 FDR 0.1 • Ttest object: 0.08 0.1 500 1000 1500 2000 25003000 3500 40004500 5000 rank 0 0 50 100 150 200 250 300 350 400 450 500 rank FDR (Fraction of features falsely found significant) Relief 8 x 10 Relief vs. Ttest -3 (Dashed blue: Ttest with probes) 7 6 5 4 3 2 1 0 0 5 10 15 20 25 30 35 40 45 50 Rank (Number of features found significant) 8
© Copyright 2026 Paperzz