Machine Learning What is a good Decision Boundary?

Machine Learning
1010-701/15701/15-781, Spring 2010
2010
Support Vector Machines
Eric Xing
Lecture 17, March 15, 2010
Reading: Chap. 6&7, C.B book, and listed papers
© Eric Xing @ CMU, 2006-2010
1
What is a good Decision
Boundary?
z
Consider a binary classification
task with y = ±1 labels (not 0/1 as
before).
z
When the training examples are
linearly separable, we can set the
parameters of a linear classifier
so that all the training examples
are classified correctly
z
Many decision boundaries!
z
z
Generative classifiers
z
Logistic regressions …
Class 2
Class 1
Are all decision boundaries
equally good?
© Eric Xing @ CMU, 2006-2010
2
1
What is a good Decision
Boundary?
© Eric Xing @ CMU, 2006-2010
3
Not All Decision Boundaries Are
Equal!
z
Why we may have such boundaries?
z
Irregular distribution
z
Imbalanced training sizes
z
outliners
© Eric Xing @ CMU, 2006-2010
4
2
Classification and Margin
z
Parameterzing decision boundary
z
Let w denote a vector orthogonal to the decision boundary, and b denote a scalar
"offset" term, then we can write the decision boundary as:
wT x + b = 0
Class 2
Class 1
d-
d+
© Eric Xing @ CMU, 2006-2010
5
Classification and Margin
z
Parameterzing decision boundary
z
Let w denote a vector orthogonal to the decision boundary, and b denote a scalar
"offset" term, then we can write the decision boundary as:
wT x + b = 0
z
Margin
wTxi+b > +c
for all xi in class 2
< −c
for all xi in class 1
wTxi+b
Class 2
Or more compactly:
(wTxi+b)yi >c
Class 1
d-
d+
The margin between any two
points
m = d− + d+ =
© Eric Xing @ CMU, 2006-2010
6
3
Maximum Margin Classification
z
The "minimum" permissible margin is:
(
)
2c
wT
xi* − x j * =
m=
w
w
z
Here is our Maximum Margin Classification problem:
2c
w
yi ( wT xi + b) ≥ c, ∀i
max w
s.t
© Eric Xing @ CMU, 2006-2010
7
Maximum Margin Classification,
con'd.
z
The optimization problem:
max w,b
s.t
z
z
yi ( wT xi + b) ≥ c, ∀i
But note that the magnitude of c merely scales w and b, and does
not change the classification boundary at all! (why?)
So we instead work on this cleaner problem:
max w,b
s.t
z
c
w
1
w
yi ( wT xi + b) ≥ 1, ∀i
The solution to this leads to the famous Support Vector Machines -- believed by many to be the best "off-the-shelf" supervised learning
algorithm
© Eric Xing @ CMU, 2006-2010
8
4
Support vector machine
z
A convex quadratic programming problem
with linear constrains:
1
max w,b
w
s.t
yi ( wT xi + b) ≥ 1, ∀i
z
d+
d-
1
w
z
The attained margin is now given by
z
Only a few of the classification constraints are relevant Î support vectors
Constrained optimization
z
We can directly solve this using commercial quadratic programming (QP) code
z
But we want to take a more careful investigation of Lagrange duality, and the
solution of the above in its dual form.
Î deeper insight: support vectors, kernels …
Î more efficient algorithm
© Eric Xing @ CMU, 2006-2010
9
Digression to Lagrangian Duality
z
The Primal Problem
min w
Primal:
s.t.
f ( w)
g i ( w) ≤ 0, i = 1,K , k
hi ( w) = 0, i = 1,K , l
The generalized Lagrangian:
k
l
i =1
i =1
L ( w, α , β ) = f ( w) + ∑ α i g i ( w) + ∑ β i hi ( w)
the α's (αι≥0) and β's are called the Lagarangian multipliers
Lemma:
⎧ f ( w) if w satisfies primal constraints
maxα , β ,α i ≥0 L ( w, α , β ) = ⎨
o/w
⎩ ∞
A re-written Primal:
min w maxα , β ,α i ≥0 L ( w, α , β )
© Eric Xing @ CMU, 2006-2010
10
5
Lagrangian Duality, cont.
z
Recall the Primal Problem:
min w maxα , β ,α i ≥0 L ( w, α , β )
z
The Dual Problem:
maxα , β ,α i ≥0 min w L ( w, α , β )
z
Theorem (weak duality):
d * = maxα , β ,α i ≥0 min w L ( w, α , β ) ≤ min w maxα , β ,α i ≥0 L ( w, α , β ) = p *
z
Theorem (strong duality):
Iff there exist a saddle point of L ( w, α , β ), we have
d * = p*
© Eric Xing @ CMU, 2006-2010
11
A sketch of strong and weak
duality
z
Now, ignoring h(x) for simplicity, let's look at what's happening
graphically in the duality theorems.
d * = maxα i ≥0 min w f ( w) + α T g ( w) ≤ min w maxα i ≥0 f ( w) + α T g ( w) = p *
f (w)
g (w)
© Eric Xing @ CMU, 2006-2010
12
6
A sketch of strong and weak
duality
z
Now, ignoring h(x) for simplicity, let's look at what's happening
graphically in the duality theorems.
d * = maxα i ≥0 min w f ( w) + α T g ( w) ≤ min w maxα i ≥0 f ( w) + α T g ( w) = p *
f (w)
g (w)
© Eric Xing @ CMU, 2006-2010
13
A sketch of strong and weak
duality
z
Now, ignoring h(x) for simplicity, let's look at what's happening
graphically in the duality theorems.
d * = maxα i ≥0 min w f ( w) + α T g ( w) ≤ min w maxα i ≥0 f ( w) + α T g ( w) = p *
f (w)
f (w)
g (w)
© Eric Xing @ CMU, 2006-2010
g (w)
14
7
The KKT conditions
z
If there exists some saddle point of L, then the saddle point
satisfies the following "Karush-Kuhn-Tucker" (KKT)
conditions:
∂
L ( w, α , β ) = 0, i = 1,K, k
∂wi
∂
L ( w, α , β ) = 0, i = 1,K, l
∂β i
αi g i ( w) = 0, i = 1,K, m
g i ( w) ≤ 0, i = 1, K, m
α i ≥ 0, i = 1, K, m
z
Complementary slackness
Primal feasibility
Dual feasibility
Theorem: If w*, α* and β* satisfy the KKT condition, then it is also a
solution to the primal and the dual problems.
© Eric Xing @ CMU, 2006-2010
15
Solving optimal margin classifier
z
Recall our opt problem:
max w,b
s.t
1
w
yi ( wT xi + b) ≥ 1, ∀i
z
This is equivalent to
z
1 T
w w
2
s.t
1 − yi ( wT xi + b) ≤ 0, ∀i
Write the Lagrangian:
min w,b
L ( w, b, α ) =
z
(* )
m
1 T
w w − ∑ α i yi ( wT xi + b) − 1
2
i =1
[
Recall that (*) can be reformulated as min w,b maxα i ≥0
]
L ( w, b, α )
Now we solve its dual problem: maxα i ≥ 0 min w,b L ( w, b, α )
© Eric Xing @ CMU, 2006-2010
16
8
The Dual Problem
maxα i ≥0 min w,b L ( w, b, α )
z
We minimize L with respect to w and b first:
m
∇ w L ( w, b, α ) = w − ∑ α i yi xi = 0,
(* )
i =1
m
∇ b L ( w, b, α ) = ∑ α i yi = 0,
(
**)
i =1
m
w = ∑ α i yi xi
Note that (*) implies:
(
***)
i =1
z
Plus (***) back to L , and using (**), we have:
m
L ( w, b, α ) = ∑ α i −
i =1
1 m
α iα j yi y j (xTi x j )
∑
2 i , j =1
© Eric Xing @ CMU, 2006-2010
17
The Dual problem, cont.
z
Now we have the following dual opt problem:
m
maxα J (α ) = ∑ α i −
i =1
s.t.
α i ≥ 0, i = 1,K, k
m
∑α y
i =1
z
1 m
α iα j yi y j (xTi x j )
∑
2 i , j =1
i
i
= 0.
This is, (again,) a quadratic programming problem.
z
A global maximum of αi can always be found.
z
But what's the big deal??
z
Note two things:
1.
w can be recovered by
m
w = ∑ α i yi x i
See next …
xTi x j
More later …
i =1
2.
The "kernel"
© Eric Xing @ CMU, 2006-2010
18
9
Support vectors
z
Note the KKT condition --- only a few αi's can be nonzero!!
αi g i ( w) = 0, i = 1,K, m
Class 2
α8=0.6
Call the training data points
whose αi's are nonzero the
support vectors (SV)
α10=0
α7=0
α5=0
α4=0
α9=0
Class 1
α2=0
α1=0.8
α6=1.4
α3=0
© Eric Xing @ CMU, 2006-2010
19
Support vector machines
z
Once we have the Lagrange multipliers {αi}, we can
reconstruct the parameter vector w as a weighted combination
of the training examples:
w=
∑α y x
i∈SV
z
i
i
i
For testing with a new data z
z
Compute
wT z + b =
∑ α y (x z ) + b
i∈SV
i
i
T
i
and classify z as class 1 if the sum is positive, and class 2 otherwise
z
Note: w need not be formed explicitly
© Eric Xing @ CMU, 2006-2010
20
10
Interpretation of support vector
machines
z
z
z
The optimal w is a linear combination of a small number of
data points. This “sparse” representation can be viewed as
data compression as in the construction of kNN classifier
To compute the weights {αi}, and to use support vector
machines we need to specify only the inner products (or
T
kernel) between the examples x i x j
We make decisions by comparing each new example z with
only the support vectors:
⎛
⎞
y* = sign ⎜ ∑ α i yi xTi z + b ⎟
⎝ i∈SV
⎠
( )
© Eric Xing @ CMU, 2006-2010
21
Non-linearly Separable Problems
Class 2
Class 1
z
z
We allow “error” ξi in classification; it is based on the output of
the discriminant function wTx+b
ξi approximates the number of misclassified samples
© Eric Xing @ CMU, 2006-2010
22
11
Soft Margin Hyperplane
z
Now we have a slightly different opt problem:
min w,b
s.t
m
1 T
w w + C ∑ ξi
2
i =1
yi ( wT xi + b) ≥ 1 − ξ i , ∀i
ξ i ≥ 0, ∀i
z
ξi are “slack variables” in optimization
z
Note that ξi=0 if there is no error for xi
z
ξi is an upper bound of the number of errors
z
C : tradeoff parameter between error and margin
© Eric Xing @ CMU, 2006-2010
23
The Optimization Problem
z
The dual of this new constrained optimization problem is
m
maxα
s.t.
J (α ) = ∑ α i −
i =1
0 ≤ α i ≤ C , i = 1, K, m
m
∑α y
i =1
z
z
1 m
∑ α iα j yi y j (xTi x j )
2 i , j =1
i
i
= 0.
This is very similar to the optimization problem in the linear
separable case, except that there is an upper bound C on αi
now
Once again, a QP solver can be used to find αi
© Eric Xing @ CMU, 2006-2010
24
12
The SMO algorithm
z
Consider solving the unconstrained opt problem:
z
We’ve already see three opt algorithms!
z
z
?
z
?
z
?
Coordinate ascend:
© Eric Xing @ CMU, 2006-2010
25
Coordinate ascend
© Eric Xing @ CMU, 2006-2010
26
13
Sequential minimal optimization
z
Constrained optimization:
m
maxα
s.t.
J (α ) = ∑ α i −
i =1
0 ≤ α i ≤ C , i = 1, K, m
m
∑α y
i =1
z
1 m
α iα j yi y j (xTi x j )
∑
2 i , j =1
i
i
= 0.
Question: can we do coordinate along one direction at a time
(i.e., hold all α[-i] fixed, and update αi?)
© Eric Xing @ CMU, 2006-2010
27
The SMO algorithm
Repeat till convergence
1. Select some pair αi and αj to update next (using a heuristic that tries
to pick the two that will allow us to make the biggest progress
towards the global maximum).
2. Re-optimize J(α) with respect to αi and αj, while holding all the other
αk 's (k ≠ i; j) fixed.
Will this procedure converge?
© Eric Xing @ CMU, 2006-2010
28
14
Convergence of SMO
m
maxα
J (α ) = ∑ α i −
i =1
KKT:
s.t.
0 ≤ α i ≤ C , i = 1, K, k
m
∑α y
i =1
z
Let’s hold α3 ,…,
1 m
α iα j yi y j (xTi x j )
∑
2 i , j =1
i
i
= 0.
αm fixed and reopt J w.r.t. α1 and α2
© Eric Xing @ CMU, 2006-2010
29
Convergence of SMO
z
The constraints:
z
The objective:
z
Constrained opt:
© Eric Xing @ CMU, 2006-2010
30
15
Cross-validation error of SVM
z
The leave-one-out cross-validation error does not depend on
the dimensionality of the feature space but only on the # of
support vectors!
Leave - one - out CV error =
# support vectors
# of training examples
© Eric Xing @ CMU, 2006-2010
31
Summary
z
Max-margin decision boundary
z
Constrained convex optimization
z
Duality
z
The KTT conditions and the support vectors
z
Non-separable case and slack variables
z
The SMO algorithm
© Eric Xing @ CMU, 2006-2010
32
16

Download Report

Machine Learning What is a good Decision Boundary?

Paperzz.com

Your Paperzz