maximal margin classifier

STT592-002: Intro. to Statistical Learning
SUPPORT VECTOR
MACHINES (SVM)
Chapter 09
Disclaimer:
This PPT is modified based on IOM 530: Intro. to Statistical Learning
1
9.1 SUPPORT VECTOR
CLASSIFIER
Applied Modern Statistical Learning Methods
Separable Hyperplanes
• Two class classification problem with 2 predictors: X1 and X2.
• Consider “linearly separable”: one can draw a straight line in
which all points on one side belong to the first class and
points on the other side to the second class.
• Then a natural approach is to find straight line that gives
biggest separation between classes i.e. the points are as far
from the line as possible.
• This is the basic idea of a support vector classifier.
STT592-002: Intro. to Statistical Learning
Classification with a separating Hyperplanes
For a test observation x*, find the value of:
Q: what can the sign and magnitude of f(x*) tell us?
4
STT592-002: Intro. to Statistical Learning
5
Classification with a separating Hyperplanes
For a test observation x*, find the value of:
Q: what can the sign and magnitude of f(x*) tell us?
Magnitude of f(x∗): If f(x∗) is far from zero, then this means that x∗ lies far from
the hyperplane, and so we can be confident about our class assignment for
x∗. On the other hand, if f(x∗) is close to zero, then x∗ is located near the
hyperplane, and so we are less certain about the class assignment for x∗.
STT592-002: Intro. to Statistical Learning
6
Maximal Margin Classifier (Optimal separating Hyperplane)
• Consider distance from each training observation to a given
separating hyperplane;
• Margin: smallest distance from observations to hyperplane.
• Maximal margin hyperplane is a separating hyperplane with largest
margin—that is, it is the hyperplane that has the farthest minimum
distance to the training observations.
• maximal margin classifier: classify a test observation based on which
side of the maximal margin hyperplane it lies.
Widest “slab” b/w two classes
STT592-002: Intro. to Statistical Learning
7
Support Vectors
• Right panel figure: Three observations are known as support
vectors.
• They support the maximal margin hyperplane, in the sense that
if they move slightly, then the maximal margin hyperplane
move as well.
• Interestingly, the maximal margin hyperplane depends directly on the
support vectors, but not on the other observations.
Widest “slab” b/w two classes
STT592-002: Intro. to Statistical Learning
Construction of the Maximal Margin Classifier
separating Hyperplanes
Maximal Margin Classifier
M>0: Margin of hyperplane
8
STT592-002: Intro. to Statistical Learning
One… Support Vector Classifiers
9
Add a
single
point
• Maximal margin hyperplane is extremely sensitive to change of
a single point.
• We would like to pay a price for:
• Greater robustness to individual observations, and
• Better classification of most of the training observations.
STT592-002: Intro. to Statistical Learning
10
Support Vector Classifiers (soft margin classifier)
• Allow some observations to be on incorrect side of the margin,
or even the incorrect side of hyperplane.
STT592-002: Intro. to Statistical Learning
11
Support Vector Classifiers
Maximal Margin Classifier
M>0: Margin of hyperplane
Ɛi are slack variables.
Ɛi =0: correct side of margin
Ɛi >0: wrong side of margin
Ɛi >1: wrong side of hyperplane.
C: tuning parameters via Cross-Validation.
Support Vector: on margin or wrong side of
margin/hyperplane
C large, Low var; High bias
Support Vector Classifier
C large: Low var since many observations are support
vectors but potentially high bias.
C small, Low bias; High var.
Applied Modern Statistical Learning Methods
Its Easiest To See With A Picture (Grand. Book)
• M is minimum perpendicular
distance between each point and
the separating line.
• Find the line which maximizes M.
• This line is called the “optimal
separating hyperplane”.
• The classification of a point
depends on which side of the line
it falls on.
Applied Modern Statistical Learning Methods
Non-Separating Example (Grand. Book)
• Let ξ*i represent the amount that
ith point is on wrong side of
margin (dashed line).
• Then we want to maximize M
subject to some constraints
Applied Modern Statistical Learning Methods
A Simulation Example With A Small Constant
• This is the simulation
example from chapter 1.
• The distance between the
dashed lines represents the
margin or 2M.
• The purple lines represent
the Bayes decision
boundaries
• Eg: C=10000, 62% are
support vectors
Applied Modern Statistical Learning Methods
The Same Example With A Larger Constant
• Using a larger constant allows for a greater
margin and creates a slightly different
classifier.
• Notice, however, that the decision boundary
must always be linear.
C=10000, 62% are support vectors
C=0.01, 85% are support vectors
9.2 SUPPORT VECTOR
MACHINE CLASSIFIER
Applied Modern Statistical Learning Methods
Non-linear Separating Classes
• How about Non-linear class boundaries?
• In practice… may not find a hyper-plane to perfectly separates
two classes.
• Then find plane that gives best separation between points that
are correctly classified subject to points on wrong side of the
line not being off by too much.
• It is easier to see with a picture!
STT592-002: Intro. to Statistical Learning
18
Classification with Non-linear Decision Boundaries
• Consider enlarging the feature space using functions of the
predictors, such as quadratic and cubic terms, in order to
address this non-linearity.
• Or use Kernel functions.
Left: SVM with polynomial kernel of degree 3 to the non-linear data… appropriate decision rule.
Right: SVM with a radial kernel. In this example, either kernel is capable of capturing decision boundary.
Applied Modern Statistical Learning Methods
Non-Linear Classifier with Kernel functions
• The support vector classifier is fairly easy to think about.
However, because it only allows for a linear decision
boundary it may not be all that powerful.
• Recall that in chapter 3 we extended linear regression to non-
linear regression using a basis function i.e.
Yi   0  1b1 ( X i )   2b2 ( X i )     p bp ( X i )   i
Applied Modern Statistical Learning Methods
A Basis Approach
• Conceptually, take a similar approach with support vector
classifier.
• Support vector classifier finds optimal hyperplane in the space
spanned by X1, X2,…, Xp.
• Create transformations (or a basis) b1(x), b2(x), …, bM(x) and
find the optimal hyperplane in the transformed space spanned
by b1(X), b2(X), …, bM(X).
• This approach produces a linear plane in the transformed space
but a non-linear decision boundary in the original space.
• This is called the Support Vector Machine (SVM) Classifier.
STT592-002: Intro. to Statistical Learning
SVM for classification
• Inner Products
21
For a test observation x*, find the value of:
STT592-002: Intro. to Statistical Learning
22
SVM for classification
• Kernels: Replace inner product of the support vector classifier,
with a generalization of inner product of form:
• A kernel is a kernel function that quantifies the similarity of two
observations.
• linear kernel:
• polynomial kernel of degree d:
• radial kernel:
Applied Modern Statistical Learning Methods
In Reality
• While conceptually the basis approach is how the support
vector machine works, there is some complicated math (which
I will spare you) which means that we don’t actually choose
b1(x), b2(x), …, bM(x).
• Instead we choose something called a Kernel function which
takes the place of the basis.
• Common kernel functions include
• Linear
• Polynomial
• Radial Basis
• Sigmoid
STT592-002: Intro. to Statistical Learning
Summary: Support Vector Machines
Support Vector Classifier
Support Vector Machines
24
Applied Modern Statistical Learning Methods
Polynomial Kernel On Sim Data
• Using a polynomial kernel
we now allow SVM to
produce a non-linear
decision boundary.
• Notice that the test error
rate is a lot lower.
Applied Modern Statistical Learning Methods
Radial Basis Kernel
• Using a Radial Basis
Function (RBF) Kernel you
get an even lower error rate.
Applied Modern Statistical Learning Methods
More Than Two Predictors
• This idea works just as well with more than two predictors.
• For example, with three predictors you want to find the
plane that produces the largest separation between the
classes.
• With more than three dimensions it becomes hard to
visualize a plane but it still exists. In general they are
caller hyper-planes.
• One versus One: compare each pairs
• One versus All: one of K classes v.s. all remaining K-1
classes.
REVIEW: CHAP4
CHAP8: HOW TO DRAW ROC
CURVE
STT592-002: Intro. to Statistical Learning
29
LDA for Default
• Overall accuracy = 97.25%.
• Now the total number of mistakes is 252+23 = 275 (2.75%
misclassification error rate)
• But we miss-predicted 252/333 = 75.7% of defaulters
• Examine error rate with other thresholds: sensitivity and specificity.
Eg: Sensitivity = % of true defaulters that are identified = 24.3% (low).
Specificity = % of non-defaulters that are correctly identified = 99.8%.
STT592-002: Intro. to Statistical Learning
30
Use 0.2 as Threshold for Default
• Now the total number of mistakes is 138+235 = 373 (3.73%
misclassification error rate)
• But we miss-predicted 138/333 = 41.4% of defaulters
• Examine error rate with other thresholds: sensitivity and specificity.
Eg: Sensitivity = % of true defaulters that are identified = 58.6% (higher).
Specificity = % of non-defaulters that are correctly identified = 97.6%.
STT592-002: Intro. to Statistical Learning
31
Receiver Operating Characteristics (ROC ) curve
• Overall performance of a classifier, summarized over all
possible thresholds, is given by the area under the (ROC)
curve (AUC).
• An ideal ROC curve will hug the top left corner, so the larger
the AUC, the better the classifier.
For this data the AUC is 0.95, which
is close to the maximum of one so
would be considered very good.
STT592-002: Intro. to Statistical Learning
32
Receiver Operating Characteristics (ROC ) curve
https://en.wikipedia.org/wiki/Sensitivity_and_specificity
STT592-002: Intro. to Statistical Learning
33
Receiver Operating Characteristics (ROC ) curve
Eg: In the Default data, “+” indicates an individual who defaults, and “−”
indicates one who does not.
Connect to classical hypothesis testing literature, we think of “−” as the
null hypothesis and “+” as the alternative (non-null) hypothesis.
STT592-002: Intro. to Statistical Learning
34
Receiver Operating Characteristics (ROC ) curve
Eg: Sensitivity = % of true defaulters that are identified = 195/333=58.6%.
Specificity = % of non-defaulters that are correctly identified = 97.6%.
False positive rate = 1- Specificity = 1-97.6% = 2.4% (or 235/9667)
True positive rate= Sensitivity= 58.6% = 195/333
STT592-002: Intro. to Statistical Learning
35
An application to Heart Disease Data with training set
An optimal classifier will hug the top left corner of the ROC plot.
STT592-002: Intro. to Statistical Learning
36
An application to Heart Disease Data with testing data
An optimal classifier will hug the top left corner of the ROC plot.