Maximum Margin Linear Classifier

Computational BioMedical Informatics
SCE 5095: Special Topics Course
Instructor: Jinbo Bi
Computer Science and Engineering Dept.
1
Course Information

Instructor: Dr. Jinbo Bi
– Office: ITEB 233
– Phone: 860-486-1458
– Email: [email protected]

– Web: http://www.engr.uconn.edu/~jinbo/
– Time: Mon / Wed. 2:00pm – 3:15pm
– Location: CAST 204
– Office hours: Mon. 3:30-4:30pm
HuskyCT
– http://learn.uconn.edu
– Login with your NetID and password
– Illustration
2
Review of last chapter
General introduction to the topics in medical
informatics, and the data mining techniques
involved
 Review some basics of probability-statistics
 More slides on probability and linear algebra
uploaded to huskyCT

This class, we start to discuss supervised learning:
classification and regression
3
Regression and classification
Both regression and classification problems are
typically supervised learning problems
 The main property of supervised learning
– Training example contains the input variables
and the corresponding target label
– The goal is to find a good mapping from the
input variables to the target variable

4
Classification: Definition

Given a collection of examples (training set )
– Each example contains a set of variables
(features), and the target variable class.


Find a model for class attribute as a function
of the values of other variables.
Goal: previously unseen examples should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
5
Classification Application 1
Current data, want to use the model to predict
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Training
Set
Learn
Classifier
Test
Set
Model
Past transaction records, label them
Fraud detection – goals: Predict fraudulent cases in credit card transactions.
6
Classification: Application 2
Handwritten Digit Recognition
Goal: Identify the digit of a handwritten number
– Approach:
Align
all images to derive the features
Model
the class (identity) based on these features
7
Illustrating Classification Task
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Deduction
10
Test Set
8
Classification algorithms
K-Nearest-Neighbor classifiers
 Naïve Bayes classifier
 Neural Networks
 Linear Discriminant Analysis (LDA)
 Support Vector Machines (SVM)
 Decision Tree
 Logistic Regression
 Graphical models

9
Regression: Definition



Goal: predict the value of one or more
continuous target attributes give the values of
the input attributes
Difference between classification and regression
only lies in the target attribute
– Classification: discrete or categorical target
– Regression: continuous target
Greatly studied in statistics, neural network
fields.
10
Regression application 1
Current data, want to use the model to predict
Tid Refund
Marital
Status
Taxable
Income
Loss
Refund Marital
Status
Taxable
Income Loss
1
Yes
Single
125K
100
No
Single
75K
?
2
No
Married
100K
120
Yes
Married
50K
?
3
No
Single
70K
-200
No
Married
150K
?
4
Yes
Married
120K
-300
Yes
Divorced 90K
?
5
No
Divorced 95K
-400
No
Single
40K
?
6
No
Married
-500
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
-190
8
No
Single
85K
300
9
No
Married
75K
-240
10
No
Single
90K
90
10
Training
Set
Learn
Regressor
Test
Set
Model
Past transaction records, label them
goals: Predict the possible loss from a customer
11
Regression applications

Examples:
– Predicting sales amounts of new product
based on advertising expenditure.
– Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
– Time series prediction of stock market indices.
12
Regression algorithms
Least squares methods
 Regularized linear regression (ridge regression)
 Neural networks
 Support vector machines (SVM)
 Bayesian linear regression

13
Practical issues in the training

Underfitting

Overfitting
Before introducing these important concept, let us
study a simple regression algorithm – linear
regression
14
Least squares
We wish to use some real-valued input variables
x to predict the value of a target y
 We collect training data of pairs (xi,yi), i=1,…N
 Suppose we have a model f that maps each x
example to a value of y’
 Sum of squares function:
– Sum of the squares of the deviation between
the observed target value y and the predicted
N
value y’ N
2
2

 y
i 1
i
 yi '    yi  f ( xi ) 
i 1
15
Least squares

Find a function f such that the sum of squares is
minimized
N
min
 y
i 1

 f ( xi ) 
2
i
For example, your function is in the form of linear
functions f (x) = wTx
N

min w  yi  w xi
T

2
i 1

Least squares with a linear function of parameters
w is called “linear regression”
16
Linear regression

Linear regression has a closed-form solution for w
N

min w  yi  xi w
i 1
T

2
 y  Xw  y  Xw 
T
 E ( w)

The minimum is attained at the zero derivative
E ( w)
T
 2 X ( y  Xw)  0
w

w XTX

1
XT y
17
Polynomial Curve Fitting
x is evenly distributed from [0,1]
 y = f(x) + random error
 y = sin(2πx) + ε,
ε ~ N(0,σ)

18
Polynomial Curve Fitting
19
Sum-of-Squares Error Function
20
0th Order Polynomial
21
1st Order Polynomial
22
3rd Order Polynomial
23
9th Order Polynomial
24
Over-fitting
Root-Mean-Square (RMS) Error:
25
Polynomial Coefficients
26
Data Set Size:
9th Order Polynomial
27
Data Set Size:
9th Order Polynomial
28
Regularization

Penalize large coefficient values

Ridge regression
29
Regularization:
30
Regularization:
31
Regularization:
vs.
32
Polynomial Coefficients
33
Classification

Underfitting or Overfitting can also happen in
classification approaches

We will illustrate these practical issues on
classification problem

Before the illustration, we introduce a simple
classification technique – K-nearest neighbor
method
34
K-nearest neighbor (K-NN)
K-NN is one of the simplest machine learning
algorithm
 K-NN is a method for classifying test examples
based on closest training examples in the feature
space
 An example is classified by a majority vote of its
neighbors
 k is a positive integer, typically small. If k = 1,
then the example is simply assigned to the class
of its nearest neighbor.

35
K-NN
K=3
K=1
36
K-NN on real problem data
• Oil data set
• K acts as a smoother, choosing K is model selection
• For
, the error rate of the 1-nearest-neighbour
classifier is never more than twice the optimal error (obtained
from the true conditional class distributions).
37
Limitation of K-NN
K-NN is a nonparametric model (no any
particular function is fitted)
 Nonparametric models requires storing and
computing with the entire data set.
 Parametric models, once fitted, are much
more efficient in terms of storage and
computation.

38
Probabilistic interpretation of K-NN

Given a data set with Nk data points from class
Ck and
, we have

and correspondingly

Since
, Bayes’ theorem gives
39
Underfit and Overfit (Classification)
500 circular and 500
triangular data points.
Circular points:
0.5  sqrt(x12+x22)  1
Triangular points:
sqrt(x12+x22) > 1 or
sqrt(x12+x22) < 0.5
40
Underfit and Overfit (Classification)
500 circular and 500
triangular data points.
Circular points:
0.5  sqrt(x12+x22)  1
Triangular points:
sqrt(x12+x22) > 1 or
sqrt(x12+x22) < 0.5
41
Underfitting and Overfitting
Overfitting
Number of Iterations
Underfitting: when model is too simple, both training and test errors are large
42
Overfitting due to Noise
Decision boundary is distorted by noise point
43
Overfitting due to Insufficient Examples
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
neural nets to predict the test examples using other training records
that are irrelevant to the classification task
44
Notes on Overfitting

Overfitting results in classifiers (a neural net, or a
support vector machine) that are more complex
than necessary

Training error no longer provides a good estimate
of how well the classifier will perform on
previously unseen records

Need new ways for estimating errors
45
Occam’s Razor

Given two models of similar generalization errors,
one should prefer the simpler model over the
more complex model

For complex models, there is a greater chance
that it was fitted accidentally by errors in data

Therefore, one should include model complexity
when evaluating a model
46
How to Address Overfitting
Minimize training error no longer guarantees a
good model (a classifier or a regressor)
 Need better estimate of the error on the true
population – generalization error
Ppopulation( f(x) not equal to y )
 In practice, design a procedure that gives better
estimate of the error than training error
 In theoretical analysis, find an analytical bound to
bound the generalization error or use Bayesian
formula

47
Model Evaluation (pp. 295—304 of data mining)

Metrics for Performance Evaluation
– How to evaluate the performance of a model?

Methods for Performance Evaluation
– How to obtain reliable estimates?

Methods for Model Comparison
– How to compare the relative performance
among competing models?
48
Model Evaluation

Metrics for Performance Evaluation
– How to evaluate the performance of a model?

Methods for Performance Evaluation
– How to obtain reliable estimates?

Methods for Model Comparison
– How to compare the relative performance
among competing models?
49
Metrics for Performance Evaluation

Regression
– Sum of squares
– Sum of deviation
– Exponential function of the deviation
50
Metrics for Performance Evaluation
Focus on the predictive capability of a model
– Rather than how fast it takes to classify or
build models, scalability, etc.
 Confusion Matrix:

PREDICTED CLASS
Class=Yes
Class=Yes
ACTUAL
CLASS Class=No
a
c
Class=No
b
d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
51
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes
Class=Yes
ACTUAL
CLASS Class=No

Class=No
a
(TP)
b
(FN)
c
(FP)
d
(TN)
Most widely-used metric:
ad
TP  TN
Accuracy 

a  b  c  d TP  TN  FP  FN
52
Limitation of Accuracy

Consider a 2-class problem
– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10

If model predicts everything to be class 0,
accuracy is 9990/10000 = 99.9 %
– Accuracy is misleading because model does
not detect any class 1 example
53
Cost Matrix
PREDICTED CLASS
C(i|j)
Class=Yes
Class=Yes
C(Yes|Yes)
C(No|Yes)
C(Yes|No)
C(No|No)
ACTUAL
CLASS Class=No
Class=No
C(i|j): Cost of misclassifying class j example as class i
54
Computing Cost of Classification
Cost
Matrix
PREDICTED CLASS
ACTUAL
CLASS
Model
M1
ACTUAL
CLASS
PREDICTED CLASS
+
-
+
150
40
-
60
250
Accuracy = 80%
Cost = 3910
C(i|j)
+
-
+
-1
100
-
1
0
Model
M2
ACTUAL
CLASS
PREDICTED CLASS
+
-
+
250
45
-
5
200
Accuracy = 90%
Cost = 4255
55
Cost vs Accuracy
Count
PREDICTED CLASS
Class=Yes
Class=Yes
ACTUAL
CLASS
a
Class=No
Accuracy is proportional to cost if
1. C(Yes|No)=C(No|Yes) = q
2. C(Yes|Yes)=C(No|No) = p
b
N=a+b+c+d
Class=No
c
d
Accuracy = (a + d)/N
Cost
PREDICTED CLASS
Class=Yes
ACTUAL
CLASS
Class=No
Class=Yes
p
q
Class=No
q
p
Cost = p (a + d) + q (b + c)
= p (a + d) + q (N – a – d)
= q N – (q – p)(a + d)
= N [q – (q-p)  Accuracy]
56
Cost-Sensitive Measures
a
Precision (p) 
ac
a
Recall (r) 
ab


Count
ACTUAL
CLASS
PREDICTED CLASS
Class=
Yes
Class=
No
Class=
Yes
a
b
Class=
No
c
d
Precision is biased towards C(Yes|Yes) & C(Yes|No)
Recall is biased towards C(Yes|Yes) & C(No|Yes)
A model that declares every record to be the positive class: b = d = 0
Recall is high
A model that assigns a positive class to the (sure) test record: c is small
Precision is high
57
Cost-Sensitive Measures (Cont’d)
a
Precision (p) 
ac
a
Recall (r) 
ab
2rp
2a
F - measure (F) 

r  p 2a  b  c

Count
ACTUAL
CLASS
PREDICTED CLASS
Class=
Yes
Class=
No
Class=
Yes
a
b
Class=
No
c
d
F-measure is biased towards all except C(No|No)
wa  w d
Weighted Accuracy 
wa  wb wc  w d
1
1
4
2
3
4
58
Model Evaluation

Metrics for Performance Evaluation
– How to evaluate the performance of a model?

Methods for Performance Evaluation
– How to obtain reliable estimates?

Methods for Model Comparison
– How to compare the relative performance
among competing models?
59
Methods for Performance Evaluation

How to obtain a reliable estimate of
performance?

Performance of a model may depend on other
factors besides the learning algorithm:
– Class distribution
– Cost of misclassification
– Size of training and test sets
60
Learning Curve

Learning curve shows
how accuracy changes
with varying sample size

Requires a sampling
schedule for creating
learning curve:

Arithmetic sampling
(Langley, et al)

Geometric sampling
(Provost et al)
Effect of small sample size:
-
Bias in the estimate
-
Variance of estimate
61
Methods of Estimation





Holdout
– Reserve 2/3 for training and 1/3 for testing
Random subsampling
– Repeated holdout
Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
Stratified sampling
– oversampling vs undersampling
Bootstrap
– Sampling with replacement
62
Methods of Estimation (Cont’d)

Holdout method
– Given data is randomly partitioned into two independent sets
Training
Test
set (e.g., 2/3) for model construction
set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
Repeat
holdout k times, accuracy = avg. of the accuracies
obtained

Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized data
– Stratified cross-validation: folds are stratified so that class dist. in
each fold is approx. the same as that in the initial data
63
Methods of Estimation (Cont’d)

Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
i.e.,
each time a tuple is selected, it is equally likely to be
selected again and re-added to the training set

Several boostrap methods, and a common one is .632 boostrap
– Suppose we are given a data set of d examples. The data set is sampled d
times, with replacement, resulting in a training set of d samples. The data
points that did not make it into the training set end up forming the test set.
About 63.2% of the original data will end up in the bootstrap, and the
remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedure k times, overall accuracy of the
k
model:
acc( M )   (0.632  acc( M i )test _ set 0.368  acc( M i )train_ set )
i 1
64
Model Evaluation

Metrics for Performance Evaluation
– How to evaluate the performance of a model?

Methods for Performance Evaluation
– How to obtain reliable estimates?

Methods for Model Comparison
– How to compare the relative performance
among competing models?
65
ROC (Receiver Operating Characteristic)
Developed in 1950s for signal detection theory to
analyze noisy signals
– Characterize the trade-off between positive
hits and false alarms
 ROC curve plots TPR (on the y-axis) against FPR
(on the x-axis)
 Performance of each classifier represented as a
point on the ROC curve
 If the classifier returns a real-valued prediction,
– changing the threshold of algorithm, sample
distribution or cost matrix changes the location
of the point

66
ROC Curve
PREDICTED
CLASS
Class
=Yes
Class=
No
Class
a
(TP)
b
(FN)
Class
=No
c
(FP)
d
(TN)
ACTUAL =Yes
CLASS
TPR = TP/(TP+FN)
FPR = FP/(FP+TN)
At threshold t:
TP=50, FN=50, FP=12, TN=88
67
ROC Curve
PREDICTED
CLASS
Class
=Yes
Class=
No
Class
a
(TP)
b
(FN)
Class
=No
c
(FP)
d
(TN)
ACTUAL =Yes
CLASS
TPR = TP/(TP+FN)
FPR = FP/(FP+TN)
(TPR,FPR):
 (0,0): declare everything
to be negative class
– TP=0, FP = 0

(1,1): declare everything
to be positive class
– FN = 0, TN = 0

(1,0): ideal
– FN = 0, FP = 0
68
ROC Curve
(TPR,FPR):
 (0,0): declare everything
to be negative class
 (1,1): declare everything
to be positive class
 (1,0): ideal

Diagonal line:
– Random guessing
– Below diagonal line:
prediction is opposite of
the true class

69
How to Construct an ROC curve
Instance
P(+|A)
True Class
1
0.95
+
2
0.93
+
3
0.87
-
4
0.85
-
5
0.85
-
6
0.85
+
7
0.76
-
8
0.53
+
9
0.43
-
10
0.25
+
• Use classifier that produces
posterior probability for each
test instance P(+|A)
• Sort the instances according
to P(+|A) in decreasing order
• Apply threshold at each
unique value of P(+|A)
• Count the number of TP, FP,
TN, FN at each threshold
• TP rate, TPR = TP/(TP+FN)
• FP rate, FPR = FP/(FP + TN)
70
How to Construct an ROC curve
Instance
P(+|A)
True Class
1
0.95
+
2
0.93
+
3
0.87
-
4
0.85
-
5
0.85
-
6
0.85
+
7
0.76
-
8
0.53
+
9
0.43
-
10
0.25
+
• Use classifier that produces
posterior probability for each
test instance P(+|A)
• Sort the instances according
to P(+|A) in decreasing order
• Pick a threshold 0.85
• p>= 0.85, predicted to P
• p< 0.85, predicted to N
• TP = 3, FP=3, TN=2, FN=2
• TP rate, TPR = 3/5=60%
• FP rate, FPR = 3/5=60%
71
How to construct an ROC curve
+
-
+
-
-
-
+
-
+
+
0.25
0.43
0.53
0.76
0.85
0.85
0.85
0.87
0.93
0.95
1.00
TP
5
4
4
3
3
3
3
2
2
1
0
FP
5
5
4
4
3
2
1
1
0
0
0
TN
0
0
1
1
2
3
4
4
5
5
5
FN
0
1
1
2
2
2
2
3
3
4
5
TPR
1
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.2
0
FPR
1
1
0.8
0.8
0.6
0.4
0.2
0.2
0
0
0
Class
P
Threshold
>=
ROC Curve:
72
Using ROC for Model Comparison

No model consistently
outperforms the other
 M1 is better for
small FPR
 M2 is better for
large FPR

Area Under the ROC
curve (AUC)

Ideal:
 Area

=1
Random guess:
 Area
= 0.5
73
Revisit K-Nearest Neighbor

K-NN:
– Instance-based algorithm
Uses k “closest” points (nearest neighbors) for
performing classification

– k-NN classifiers are lazy learners (does not
build models explicitly)
– Classifying unknown examples are relatively
expensive than model-learning algorithms (or
parametric approaches)
74
Nearest Neighbor Classifiers

Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck
Compute
Distance
Training
Records
Test
Record
Choose k of the
“nearest” records
75
Nearest-Neighbor Classifiers
Unknown record

Requires three things
– The set of stored examples
– Distance Metric to compute
distance between examples
– The value of k, the number of
nearest neighbors to retrieve

To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
76
Definition of Nearest Neighbor
X
(a) 1-nearest neighbor
X
X
(b) 2-nearest neighbor
(c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
77
1 nearest-neighbor
Voronoi Diagram
78
Nearest Neighbor Classification

Compute distance between two points:
– Euclidean distance
d ( p, q ) 

 ( pi
i
q )
2
i
Determine the class from nearest neighbor list
– take the majority vote of class labels among
the k-nearest neighbors
– Weigh the vote according to distance

weight factor, w = 1/d2
79
Nearest Neighbor Classification…

Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes
X
80
Nearest Neighbor Classification…

Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
– Example:
height of a person may vary from 1.5m to 1.8m
 weight of a person may vary from 90lb to 300lb
 income of a person may vary from $10K to $1M

81
Nearest Neighbor Classification…

Problem with Euclidean measure:
– High dimensional data
curse of dimensionality
 solution is to do dimension reduction first

– Can produce counter-intuitive results
111111111110
100000000000
vs
011111111111
000000000001
d = 1.4142
d = 1.4142

Solution: Normalize data
82
Data normalization
Example-wise normalization
– Each example is normalized
and mapped to unit sphere
 Feature-wise normalization
– [0,1]-normalization:
normalize each feature into
a unit space
– Standard normalization:
normalize each feature to
have mean 0 and standard
deviation 1

1
1
1
1
83
Classification

Training data is given.
– Each object is associated with a class label Y  {1, 2,
…, K} and a feature vector of d measurements: X =
(X1, …, Xd).

Build a model from the training data.

Unseen objects are to be classified as belonging to one of
a number of predefined classes {1, 2, …, K}.

Linear Discriminant Analysis / Fisher’s linear disciminant
84
Two classes
Variable 2

u2
Variable 1

u1
85
Three classes
86
Classifiers

Classifiers are built from a training set (TS)
L = (X1, Y1), ..., (Xn,Yn)

Classifier C built from a learning set L:
C: X  {1,2, ... ,K}

Bayes classifier base on conditional densities p(Ck | X),
C(X) = arg maxk p(Ck | X)
This is a maximum a posterior, and p(Ck | X) is a
posterior density
87
The Rules of Probability

Sum Rule

Product Rule
= p(X|Y)p(Y)

Bayes’ Rule
p (Y  C | X  data )  p ( X | Y ) p (Y )
posterior  likelihood × prior
is irrelevant to Y=C
88
Maximum a posterior

p(Ck | X) = p(X | Ck) p(Ck) /p(X)

Find a class label C(X) so that
maxk p(Ck | X) = maxk p(X | Ck) p(Ck)

Naïve Bayes assumes independence among all
features (last class)
– p(X | Ck) = p(x1 | Ck) p(x2 | Ck) . . . p(xd | Ck)
Very strong assumption
89
Multivariate normal dist for each class
Assume multivariate Gaussian (normal) class densities
X|Y= k ~ N(k, k),
Maximizing posterior is equivalent to maximizing p(X|CK)p(CK), and
equivalent to maximizing the logorithm of p(X|CK)p(CK)
C(X) = arg mink {(X - k)’ k-1 (X - k) + log| k | -2log(p(Ck))}
90
Two-class case
If p( X | C1 ) p(C1 )  p( X | C2 ) p(C2 )
C(X) = C2
otherwise
Equivalently,
C(X) = C1
p( X | C1 ) p(C1 )
1
p( X | C2 ) p(C2 )
p ( X | C1 ) p (C2 )

p ( X | C2 ) p (C1 )
p( X | C1 )
p(C2 )
log
 log(
)
p ( X | C2 )
p(C1 )
91
Guassian discriminant rule

For multivariate Gaussian (normal) class densities X|Y= k
~ N(k, k), the classification rule is
C(X) = arg mink {(X - k)’ k-1 (X - k) + log| k |}


In general, this is a quadratic rule (Quadratic discriminant
analysis, or QDA)
In practice, population mean vectors k and covariance
matrices k are estimated by corresponding sample
quantities
92
Sample mean and variance
Class mean
1
i 
x

| Ci | xCi
Class covariance
1
i 
Ci
T
(
x


)(
x


)

i
i
xCi
93
Example
1
 
X 1   0 ,
1
 
 2
 
X 2   1 ,
1
 
0
 
X 3   2 .
1
 
 1
 
1
   X 1  X 2  X 3    1
3
 1
 

1
( X 1   )( X 1   ) T  ( X 2   )( X 2   ) T  ( X 3   )( X 3   ) T
3
 0 

1
  1








1
    10  1 0    0 1 0 0    1  1 1 0 
3  

 0
 0 
0
 
 
 


0
1 
 0
3 
0
0
1
0
0 1
 
0  0
0   0
0
0
0
0  1 1 0
 2 1 0
 
 1 

0   1 1 0   1 2 0
3




0  0
0 0
0 0 
 0

94
Two-class case


If the two classes have the same covariance matrix, k =
the discriminant rule is linear (Linear discriminant
analysis, or LDA; FLDA for k = 2):
Quadratic rule
become X T 1 (2  1 )  c
Usually,
1
XTw  c
where w   (2  1 )
1
  (n11  n2  2 )
n
95
Illustration
μ1
μ2
96
Two-class case

Maximize the signal-to-noise ratio
wT  between w
max w T
w  withinw
Between-class separation
Within-class cohesion
where  between  (  2  1 )(  2  1 )
T
1
 within  (n11  n2  2 )
n
Solution is
w
1
within
(2  1 )
97
Two-class case (illustration)

LDA gives the yellow direction
Two classes
overlap
Two classes are
separated
98
Two-class case (illustration)
2- 1
1
2
LDA axis
Best Threshold
99
Multi-class case

Two approaches
– Apply Fisher LDA to each “one-versus-rest”
class
100
Multi-class case
Second approach:
Similarly, find multiple directions that form a low dimensional
space
Transformation matrix G that projects the data to be most separable is the
matrix that maximizes
max W
Correct way to write it is

 W T SbW 
 T

 W S wW 
max trace (W T SbW )(W T SwW ) 1
W
Between-class matrix
Within-class matrix

1 K
Sb   nk (  k   )(  k   )T
n k 1
1 K
S w    ( x  i )( x  i )T
n i 1 xCi
101
Intuition
The goal is to simultaneously maximize the
between-class separation and minimize the withinclass cohesion

T
T
1
max
trace
(
W
S
W
)(
W
S
W
)
The solution to W
b
w

is the generalized eigenvalue problem Sw1Sb
The generalized eigenvectors are eigenvectors by
solving
Sb g  S w g
102
Graphic view of the transformation (projection)
K-1
K-1
d
n
n
d
A
A 
L
nd
n( k 1)
Reduced training data
Training data (matrix)
W 
d ( k 1)
Transformation matrix
103
Graphical view of classification
K-1
d
n
n
A
1
d
1d
h
A 
L
nd
K-1
1
d
n( k 1)
K-1
h L  1( k 1)
A test data point h
Find the nearest neighbor
Or nearest centroid
G 
d ( k 1)
104
Summary
First applied by M. Barnard at the suggestion of R. A. Fisher (1936),
Fisher linear discriminant analysis (FLDA):

Dimension reduction
– Finds linear combinations of the features X=X1,...,Xd with
large ratios of between-groups to within-groups sums of
squares - discriminant variables;

Classification
– Predicts the class of an observation X by the class whose
mean vector is closest to X in terms of the discriminant
variables
105
We just introduced Fisher discriminant analysis,
particularly linear discriminant analysis
 Now let us discuss Support Vector Machine

106
History of SVM
SVM is inspired from statistical learning theory [3].
 SVM was first introduced in 1992 [1].
 SVM becomes popular because of its success in
handwritten digit recognition [2].
 SVM is now regarded as an important example of
“kernel methods”, arguably the hottest area in
machine learning. http://www.kernel-machines.org/

[1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings
of the Fifth Annual Workshop on Computational Learning Theory 5 144-152,
Pittsburgh, 1992.
[2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten
digit recognition. Proceedings of the 12th IAPR International Conference on Pattern
Recognition, vol. 2, pp. 77-82. 1994.
[3] V. Vapnik. The Nature of Statistical Learning Theory. 1nd edition, Springer, 1996.
107
Support Vector Machines

Find a linear hyperplane (decision boundary) that will separate the data
108
Support Vector Machines
B1

One Possible Solution
109
Support Vector Machines
B2

Another possible solution
110
Support Vector Machines
B2

Other possible solutions
111
Support Vector Machines
B1
B2


Which one is better? B1 or B2?
How do you define better?
112
Support Vector Machines
B1
B2
b21
b22
margin
b11
b12

Find hyperplane maximizes the margin => B1 is better than B2
113
Support Vector Machines
B1
 
w x  b  0
 
w  x  b  1
 
w  x  b  1
b11
 
if w  x  b  1
1

f ( x)  
 
 1 if w  x  b  1
b12
2
Margin   2
|| w ||
114
Support Vector Machines

What if the problem is not linearly separable?
115
Nonlinear Support Vector Machines

What if decision boundary is not linear?
116
Nonlinear Support Vector Machines

Transform data into higher dimensional space
117
Outline of SVM lecture

Linear classifier

Maximum margin classifier
– Estimate the margin

SVM for separable data

SVM for non-separable data
118
Linear classifiers
x
denotes +1
a
f
y
f(x,w,b) = sign(w. x +b)
denotes -1
How would you
classify this data?
119
Linear classifiers
x
denotes +1
a
f
y
f(x,w,b) = sign(w. x +b)
denotes -1
How would you
classify this data?
120
a
Classifier Margin
x
denotes +1
denotes -1
f
y
f(x,w,b) = sign(w. x + b)
Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
121
Maximum Margin
a
x
denotes +1
f
y
f(x,w,b) = sign(w. x + b)
The maximum
margin linear
classifier is the
linear classifier
with the maximum
margin.
denotes -1
Linear SVM
This is the
simplest kind of
SVM (Called an
LSVM)
122
Maximum Margin
a
x
denotes +1
f
y
f(x,w,b) = sign(w. x + b)
The maximum
margin linear
classifier is the
linear classifier
with the maximum
margin.
denotes -1
Support Vectors
are those data
points that the
margin pushes
up against
Linear SVM
This is the
simplest kind of
SVM (Called an
LSVM)
123
Why Maximum Margin?
1. Intuitively this feels safest.
denotes +1
denotes -1
Support Vectors
are those
datapoints that
the margin
pushes up
against
f(x,w,b)
= sign(w.
- b)
2. If we’ve made
a small
error inxthe
location of the boundary this gives
maximum
us least chance ofThe
causing
a
misclassification. margin linear
classifier
is theof
3. The model is immune
to removal
any non-support-vector
linear datapoints.
classifier
with(using
the, VC
um,
4. There’s some theory
margin.
dimension) that ismaximum
related to (but
not
the same as) the proposition that this
is a good thing. This is the
simplest kind of
5. Empirically it works very very well.
SVM (Called an
LSVM)
124
Estimate the Margin
denotes +1
denotes -1

wx +b = 0
x
What is the distance expression for a point x to a
line wx+b= 0?
d ( x) 
xw b
w
2
2

xw b

d
2
w
i 1 i
125
Estimate the Margin
wx +b = 0
distance
y
x

 ( y  x) w
yw  xw
 y  x, w  


w 
w
w

Using yw  b  0, we have
d
 b  xw
w

b  xw
w
2

b  xw
d
w
i 1
2
i
126
Estimate the Margin
denotes +1
wx +b = 0
denotes -1
Margin

What is the expression for margin?
margin  arg min d (x)  arg min
xD
xD
xw b

d
2
w
i 1 i
127
Maximize Margin
denotes +1
wx +b = 0
denotes -1
Margin
argmax margin(w, b, D)
w ,b
= argmax arg min d (xi )
w ,b
xi D
 argmax arg min
w ,b
xi D
b  xi  w

d
2
w
i 1 i
128
Maximize Margin
denotes +1
wx +b = 0
denotes -1
Margin
argmax arg min
w ,b
xi D
b  xi  w

d
2
w
i 1 i
subject to xi  D : yi  xi  w  b   0

Min-max problem
129
Maximize Margin
denotes +1
denotes -1
Margin
wx +b = 0
argmax arg min
w ,b
xi D
b  xi  w

d
2
w
i
i 1
subject to xi  D : yi  xi  w  b   0
Strategy:
argmin  i 1 wi2
d
xi  D : b  xi  w  1
w ,b
subject to xi  D : yi  xi  w  b   1
130
Maximum Margin Linear Classifier
{w , b }= argmax 
*
*
w,b
d
2
w
k 1 k
subject to
y1  w  x1  b   1
y2  w  x2  b   1
....
y N  w  xN  b   1

How to solve it?
131
Learning via Quadratic Programming
QP is a well-studied class of optimization
algorithms to maximize a quadratic function of
some real-valued variables subject to linear
constraints.
 Availabel open-source solvers
– SVMLight http://svmlight.joachims.org/
– LibSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/
– Matlab optimization toolbox

132
Quadratic Programming
Find
arg max
u
T
u
Ru
T
cd u
2
Quadratic criterion
a11u1  a12u2  ...  a1mum  b1
Subject to
a21u1  a22u2  ...  a2 mum  b2
:
n additional linear
inequality
constraints
an1u1  an 2u2  ...  anmum  bn
a( n 1)1u1  a( n 1) 2u2  ...  a( n 1) mum  b( n 1)
a( n  2)1u1  a( n  2) 2u2  ...  a( n  2) mum  b( n  2)
:
a( n  e )1u1  a( n  e ) 2u2  ...  a( n  e ) mum  b( n  e )
e additional linear
equality
constraints
And subject to
133
Quadratic Programming of SVM
{w* , b* }= min i wi2
w,b
subject to yi  w  xi  b   1 for all training data ( xi , yi )


{w* , b* }= argmax 0  0  w  wT I n w
w,b
y1  w  x1  b   1


y2  w  x2  b   1 
 inequality constraints
....

y N  w  xN  b   1 
134
Non-separable
This is going to be a problem!
What should we do?
denotes +1
denotes -1
135
Non-separable
This is going to be a problem!
What should we do?
denotes +1
denotes -1
Idea 1:
Find minimum ||w||2, while
minimizing number of
training set errors.
Problemette: Two things
to minimize makes for an
ill-define optimization
Separable case
argmin  i 1 wi2
d
w ,b
subject to xi  D : yi  xi  w  b   1
136
Non-separable
This is going to be a problem!
What should we do?
denotes +1
denotes -1
Idea 1.1:
Minimize
||w||2 + C (#train errors)
Tradeoff parameter
Some points will violate
yi ( w  xi  b)  1
We allow errors to occur
yi ( w  xi  b)  1   i , εi  0
Hinge loss
137
Non-separable
This is going to be a problem!
What should we do?
denotes +1
denotes -1
Idea 2.0:
Minimize
||w||2 + C (distance of
error points to
their correct place)
N

i 1
i
yi ( w  xi  b)  1   i , εi  0
138
Linear inseparable case
{w* , b*}= min  i 1 wi2  c  j 1  j
d
N
w, b
y1  w  x1  b   1  1 , 1  0
y2  w  x2  b   1   2 ,  2  0
denotes +1
denotes -1
3
...
y N  w  xN  b   1   N ,  N  0

Balance the trade off between
margin and classification errors
1
2
139
Determining value for c
How do we determine the appropriate value for c ?
 Cross-validation on training data
– Take possible choices for c
– For each choice,

Run
a cross validation procedure
Calculate the error metric (chosen properly)
– Find the choice that achieves the best metric
– Use the best choice on all training data
140
A toy example on SVM (assignment 2)
X
1
2
2
1
1
2
2
1
2
1
Training Data
2.4
2.2
2
1.8
1.6
x2
1.3162 x1
1.1447 x2
2.2966
2.3856
0.5606
1.4693
1.3368
1.9389
2.1281
2.2641 x10
0.8281
2.0391
1.9653
0.4878
0.3570
1.4951
2.8792
1.0212
1.7558
0.6714
1.4
1.2
1
0.8
0.6
0.4
0
0.5
1
1.5
x1
2
2.5
3
y
141
Separable case
argmin  i 1 wi2
d
w ,b
subject to xi  D : yi  xi  w  b   1
Matlab scripts
[N,d] = size(X);
% constraints
A = diag(y) * [X ones(N,1)];
Rhs = ones(N,1);
% objective
H = [eye(d) zeros(d,1)];
H = [H; [zeros(1,d) 0]];
f = zeros(d+1, 1);
[X,FVAL,EXITFLAG,OUTPUT] = QUADPROG(H,f,-A,-Rhs)
142
Inseparable case
d
min
w
i 1
Matlab scripts
i
2
c
N
ε
i 1
i
subject t o yi(w  xi  b)  1  εi , εi  0
[N,d] = size(X);
% constraints
A = [diag(y) * [X ones(N,1)] eye(N)];
Rhs = ones(N,1);
% objective
H = [eye(d) zeros(d,1+N)];
H = [H; [zeros(1+N,d) zeros(N,N)]];
f = [zeros(d+1, 1); c*ones(N,1)];
% bound constraints
Lb = [-Inf * ones(d+1,1); zeros(N,1)];
[X,FVAL,EXITFLAG,OUTPUT] = QUADPROG(H,f,-A,-Rhs,[],[],Lb)
143

Next couple of slides are backup slides (not
required in this class)
144
Support Vector Machine for Noisy Data
i  1
0  i  1
i  0

yi ( wxi  b)  0, i.e., misclassif ication

xi is correctly classified , but lies inside the margin

xi is classified correctly, and lies outside the margin
Class 2
k

i 1
i
is an upper bound
on the number of
training errors.
Class 1
145
Support Vector Machine for Noisy Data
{w* , b* }= argmin  i wi2  c  j 1  j
N
w,b
y1  w  x1  b   1  1 ,1  0


y2  w  x2  b   1   2 , 2  0 
 inequality constraints
....

y N  w  xN  b   1   N , N  0 

How do we determine the appropriate value for c ?
•Cross-validation
146
Support Vector Machine for Noisy Data
General optimization problem
minimize f(w)
Define the Lagrangian
k
subject to g i ( w)  0, i  1, , k .
L p (w, a )  f(w)   a i g i ( w)  f(w)  a T g ( w)
i 1
Lagrangian dual problem
Weak duality theorem
subject to a  0
LD (a )  f(w)
Duality gap f(w) - LD (a )
maximize LD (a )  inf w L p (w, a ) 
*
Let w be the minimum of the Lagrangian with respect to w, and let
a * be the maximum of the lagrangian dual with respect to a
If the constrains g are linear functions of w, then the duality gap is 0.
ai * gi (w* )  0, for all i.
147
Support Vector Machine for Noisy Data
Karush-Kuhn-Tucker Conditions

L p (w * , a * )  0
w
a i*g i (w * )  0, i  1, k
g i (w * )  0, i  1, k

L p (w * , a * )  0
a
Complementarity condition
Feasibility condition
a i*  0, i  1, k
148
Support Vector Machine for Noisy Data
Use the Lagrangian formulation for the optimization problem.
Introduce a positive Lagrangian multiplier for each inequality constraint.
yi xi  w  b  1   i  0, for all i.
ai
 i  0, for all i.
Lagrangian multipliers
i
Get the following Lagrangian:
Lp  w  c  i  a i yi xi  w  b  1   i   i i
2
i
i
i
149
Support Vector Machine for Noisy Data
Lp  w  c  i  a i yi xi  w  b  1   i   i i
2
i
Lp
w
Lp
b
L p
 i
i
i
 2w  a i yi xi  0  w 
i

1
ai yi  0 
2 i
a y
i
i
1
ai yi xi
2 i
i
with respect to w, b, and ε i .
0
i
 c  i  a i  0  c  i  a i
LD   a i 
Take the derivative s of L p
1
a ia j yi y j xi  x j 

2 i, j
0  αi  c i
Both ε i and its multiplier i are
not involved in the function.
150
The Dual Form of QP
R
1 R R
Maximize  αk   αk αl Qkl where Qkl  yk yl (xk  xl )
2 k 1 l 1
k 1
Subject to these
constraints:
0  αk  c k
R
α
k 1
k
yk  0
Then define:
1 R
w   αk yk x k
2 k 1
151
The Dual Form of QP
R
1 R R
Maximize  αk   αk αl Qkl where Qkl  yk yl (xk  xl )
2 k 1 l 1
k 1
Subject to these
constraints:
0  αk  C k
R
α
k 1
k
yk  0
Then define:
1 R
w   αk yk x k
2 k 1
Then classify with:
f(x,w,b) = sign(w. x + b)
152
An Equivalent QP
R
1 R R
Maximize  αk   αk αl Qkl where Qkl  yk yl (x k .xl )
2 k 1 l 1
k 1
Subject to these
constraints:
0  αk  c k
Then define:
1 R
w   αk yk x k
2 k 1
R
α
k 1
k
yk  0
Datapoints with ak > 0
will be the support
vectors
..so this sum only needs
to be over the
support vectors.
153
Support Vectors
Support Vectors
denotes +1
denotes -1
w x  b 1
i : a i  yi  w  xi  b   1   i    0
ai = 0 for non-support vectors
ai  0 for support vectors
w
1 R
w   αk yk x k
2 k 1
w  x  b  1
Decision boundary is
determined only by those
support vectors !
154
The Dual Form of QP
R
1 R R
Maximize  αk   αk αl Qkl where Qkl  yk yl (xk  xl )
2 k 1 l 1
k 1
Subject to these
constraints:
0  αk  c k
R
α
k 1
k
yk  0
Then define:
1 R
w   αk yk x k
2 k 1
Then classify with:
f(x,w,b) = sign(w. x + b)
How to determine b ?
155
An Equivalent QP: Determine b
b* = argmin  j 1  j
{w* , b*}= argmin  i wi2  c  j 1  j
N
N
b , i i 1
N
w, b
y1  w  x1  b   1  1 ,1  0
y2  w  x2  b   1   2 , 2  0
Fix w
y2  w  x2  b   1   2 , 2  0
....
....
y N  w  x N  b   1   N , N  0
A linear programming
problem !
Another approach based on
support vectors: 0  a i  c
y1  w  x1  b   1  1 ,1  0
y N  w  x N  b   1   N , N  0
i : a i  yi  w  xi  b   1   i    0
 i i  0
i  c  a i  0   i  0
yi xi  w  b   1  0
b   yi  yi xi  w  1
156