Lecture 7: Support Vector Machines References

References
• An training algorithm for
optimal margin classifiers
Boser-Guyon-Vapnik, COLT, 1992
http://www.clopinet.com/isabelle/P
apers/colt92.ps.Z
• Book chapters 1 and 12
• Software LibSVM
http://www.csie.ntu.edu.tw/~cjlin/li
bsvm/
Lecture 7:
Support Vector Machines
Isabelle Guyon
[email protected]
Perceptron Learning Rule
wj ← wj + y i x ij
If x is misclassified
wj =
xj
wj
Σ
y
Σ i αi yi xij
Optimum Margin Perceptron
wj ← wj + y i x ij
If x is the least
well classified
wj =
Converges to
“one” solution
classifying well all
the examples.
Σ i αi yi xij
xj
wj
Σ
y
Converges to the
optimum margin
Perceptron
(Mezard -Krauth,
1988).
QP formulation
(vapnik, 1962).
1
Optimum Margin Solution
Negative Margin
• Unique solution.
• Depends only on support vectors ( SV)
wj =
Σ i∈SV αi yi xij
• SVs are examples closest to the boundary.
• Bound on leave- one-out error: LOO≤nSV /m
• Most “stable”, good from MDL point of view.
• But: sensitive to outliers and works only for
linearly separable case.
Multiple negative margin solutions,
which all give the same number of errors.
Soft-Margin
Examples within the margin area
incurr a penalty and become nonmarginal support vectors. Unique
solution again ( Cortes -Vapnik , 1995):
wj =
Σ i∈SV αi yi xij
Non-Linear Perceptron
x1
φ1 (x)
x2
φ2 (x)
xn
Rosenblatt, 1957
w1
w2
Σ
f(x)
wN
φN(x)
1
b
f(x) = w • Φ(x) + b
2
Kernel “Trick”
Kernel Method
• f(x) = w • Φ(x)
x1
k(x 1,x)
• w = Σi αi y i Φ(x i)
x2
k(x 2,x)
Potential functions, Aizerman et al 1964
α1
α2
Σ
Dual forms
Aizerman-Braverman-Rozonoer -1964
• f(x) =
Σ i αi yi k(x i, x)
αm
xn
k(x m,x)
f(x) =
• k(x i, x) = Φ(x i) • Φ(x)
1
Some Kernels (reminder)
A kernel is a dot product in some feature space:
k(s, t) = Φ(s) • Φ(t)
•
•
•
•
Examples:
k(s, t) = s • t
k(s, t) = exp -γ ||s-t||2
k(s, t) = exp -γ ||s-t||
• k(s, t) = (1 + s • t) q
• k(s, t) = (1 + s •
t) q
b
Σ i αi k(x i,x) + b
k(. ,. ) is a similarity measure or “ kernel”.
Support Vector Classifier
x2
f (x)<0
f (x)>0
Linear kernel
Gaussian kernel
Exponential kernel
x=[x 1,x 2 ]
f (x)=0
Polynomial kernel
exp -γ ||s-t||2 Hybrid kernel
f(x) =
Σ
k ∈ SV
αj yj k(x, xj )
x1
Boser -Guyon-Vapnik- 1992
3
Margin and ||w||
Quadratic Programming
• Hard margin:
min ||w ||2
such that
y j(w .xj+b)≥1
for all examples.
• Soft margin:
Margin
min ||w ||2 + C
Σ j ξjβ
ξj ≥ 0, β=1,2
such that
y j(w .xj+b)≥(1- ξj)
for all examples.
Maximizing the margin is equivalent to minimizing ||w||.
Dual Formulation
“Ridge SVC”
• Soft margin:
• Non- linear case: x → Φ(x)
• min ||w ||2 + C Σ j ξjβ
ξj ≥ 0, β=1 or β= 2
such that
y j(w .Φ(xj)+b)≥(1- ξj)
for all examples.
• max -½ α T K α + α 1
such that
αT y = 0 ; 0 ≤ αi ≤ C
K= [ yi yj k(xi ,xj )] + (1/C) δij
min ||w ||2 + C Σ j ξjβ
ξj ≥ 0, β=1,2
such that
y j(w .Φ(xj)+b)≥(1- ξj)
f (xj)
for all examples.
1-yj f(xj )<0, ξj =0, no penalty, not SV
1-yj f(xj )=0, ξj =0, no penalty, marginal SV
• Ridge SVC: 1-yj f(xj )=ξj >0, penalty ξj, non- marginal SV
Loss L( xj)=max (0, 1-y j f (xj) )β
Risk Σi L( xj)
min (1/C) ||w ||2 + Σj L(xj)
regularized risk
4
Ridge Regression (reminder)
• Sum of squares:
R = Σi (f(xi) - y i) 2 = Σi ( 1-y i f (xi)) 2
Classification case
• Add “regularizer”:
R = Σi (1-y i f (xi)) 2 + λ|| w ||2
• Compare with SVC:
R =Σ i max(0, 1-y j f (xj)) β + λ|| w ||2
• Nested subsets of models, increasing
complexity/capacity:
S1 ⊂ S2 ⊂ … SN
R
Vapnik-1984
• Example, rank with ||w ||2
S k = { w | ||w ||2< A k }, A 1 <A 2 <…<A k
• Minimization under constraint:
min Remp[f]
s.t. ||w ||2< A k
• Lagrangian:
capacity
Rreg [f] = Remp[f] + λ ||w||2
• Radius - margin bound:
Vapnik-Chapelle-2000
LOOcv = 4 r2 ||w ||2
Loss Functions
L(y, f(x))
Structural Risk Minimization
Regularizers
4
Decision
boundary
3.5
Margin
SVC loss, β=2
max(0, (1- z)2 )
Adaboost 3
loss e-z 2.5
2
logistic loss
log(1+e-z) 1.5
SVC loss, β=1
max(0, 1-z)
1
Perceptron
0.5 loss
max(0, -z)0
-1
square loss
(1- z)2
0/1 loss
-0.5
missclassified
0
0.5
1
1.5
2
z=y f(x)
• || w ||22= Σ i wi2 : 2-norm regularization
(ridge regression,
original SVM)
• || w ||1= Σ i |wi| : 1-norm regularization
(Lasso Tibshirani 1996,
1-norm SVM 1965)
• || w ||1 =length(w) : 0-norm (Weston et
al., 2003)
well classified
5
Regression SVM
• Epsilon insensitive loss:
| y i – f(x i) |ε
Unsupervised learning
SVMs for:
• density estimation: Fit F(x) (Vapnik,
1998)
• finding the support of a distribution
(one-class SVM) (Schoelkopf et al,
1999)
• novelty detection (Schoelkopf et al,
1999)
• clustering (Ben Hur, 2001)
Summary
• For statistical model inference, two
ingredients needed:
– A loss function : defines the residual error,
i.e. what is not explained by the model;
characterizes the data uncertainty or “ noise”.
– A regularizer: defines our “prior knowledge” ,
biases the solution; characterizes our
uncertainty about the model. We usually bet
on simpler solutions ( Ockham’s razor).
Exercise Class
6
Filters: see chapter 3
Homework 7
1) Download the software for homework 7.
2) Inspiring your self by the examples, write a
new feature ranking filter object. Choose
one in Chapter 3 or invent your own.
3) Provide the pvalue and FDR (using a
tabulated distribution or the probe method).
4) Email a zip file your object and a plot of the
FDR to [email protected] h with subject
"Homework7" no later than:
Tuesday December 13th.
Dexter
Baseline Dexter
DEXTER filters texts
Size
Type
Features
Training
Examples
Validation
Examples
Test
Examples
0.9
MB
Sparse
integer
20000
300
300
2000
Ø my_classif=svc({'coef0=1', 'degree=1',
'gamma=0', 'shrinkage=0.5'});
Ø my_model=chain({s2n('f_max=300'),
normalize, my_classif})
40
DEXTER
20
• Results:
0
0
5
10
15
20
25
30
35
40
45
50
BER
Best entries:
BER~3.3-3.9%
AUC~0.97-0.99
Frac_feat~1.5%
Frac_probe~50%
%
feat
%
probe
1.5
16.33
Train
BER%
0.33
Valid Test
Train Valid
BER% BER% AUC AUC
7
5
Test
AUC
1 0.982 0.988
7
Evalution of pval and FDR
Analytic vs. probe
-4
x 10
0.14
Red analytic – Blue probe
0.12
Arcene
FDR
0.06
FDR
– computes pval analytically
– FDR~pval *nsc/n
0.06
0.04
00
2
0.04
1
0.02
500 10001500 2000 25003000 35004000 4500 5000
rank
0
0
0.01
0
0
1
0.8
rank
rank
0.9
Gisette
0.8
Madelon
0.7
FDR
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0
0
500 100015002000 250030003500 400045005000
500 100015002000 2500300035004000
45005000
1
0.9
FDR
– takes any feature ranking object as an
argument (e.g. s2n, relief, Ttest)
– pval~nsp/np
– FDR~pval *nsc/n
Dexter
0.05
0.03
0.02
• probe object:
Dorothea
0.07
0.08
FDR
0.1
• Ttest object:
0.08
0.1
500 1000 1500 2000 25003000 3500 40004500 5000
rank
0
0
50
100 150 200 250 300 350 400 450 500
rank
FDR (Fraction of features falsely found significant)
Relief
8
x 10
Relief vs. Ttest
-3
(Dashed blue: Ttest with probes)
7
6
5
4
3
2
1
0
0
5
10
15
20
25
30
35
40
45
50
Rank (Number of features found significant)
8