Linear Discriminant Functions and SVM

Linear Discriminant Functions and SVM
Seong-Wook Joo
1. Fisher linear discriminant
(1) Motivation
y Want dimensionality reduction in a classification problem.
y PCA seeks directions (principal directions) that best represent the original data.
y Fisher linear discriminant analysis (FDA) seeks directions that are efficient for
discrimination. (below: which w is better for discrimination?)
(2) Projection onto one direction w, two-class problem
y Samples: n d-dimensional vectors x1…xn , consisting of two subsets D1, D2
y Projected samples: y = wtx
two subsets Y1, Y2
y Criterion: maximize the Fisher linear discriminant
~ −m
~ 2
m
1
2
J (w ) = ~ 2 ~ 2
s +s
1
J (w ) =
2
~ = mean of y∈Y , ~
~ 2
m
1 s1 = ∑ y∈Y ( y − m1 ) : scatter of Y1
1
1
2
wtS Bw
w t SW w
S B = (m 1 − m 2 )(m 1 − m 2 ) t : between scatter matrix (mi = mean of x∈Di)
SW = S1 + S 2 : within scatter matrix, S i = ∑ x∈D (x − m i )(x − m i ) t
i
1
y
Solution
w = SW
y
−1
(m1 − m 2 )
Threshold (decision boundary)
Not given by FDA. Use “Bayesian decision theory” (minimize risk or error rate), or any
general classifier.
(3) Projection onto multiple directions wi, c-class problem: Multiple Discriminant Analysis (MDA)
y
y
Reduction to (c-1) dimensions: There can be at most c-1 solutions for wi
Projected samples: yi = witx, i=1,…,c-1 or in matrix notation
y = Wt x
y
Generalization of scatter matrices
c
SW = ∑ S i
i =1
c
S B = ∑ ni (m i − m)(m i − m) t comes from the total (all classes) scatter matrix
i =1
y
Criterion: maximize the Fisher linear discriminant
J (W ) =
y
WtS B W
W t SW W
where | | denotes matrix determinant (we want a scalar measure of matrix and
determinant (product of eigenvals) of the scatter is a measure of scattering volume).
Solution: solve the generalized eigenvector problem
S B w i = λi SW w i
Since SB is at most rank c-1, there can be at most c-1 nonzero eigenvalues λ’s and their
corresponding w’s.
2
2. Linear discriminant functions (linear classifiers [1])
(1) Problem formulation
y
Linear discriminant function for a two-class (ω1 or ω2) problem
g (x) = w t x + w0
−
−
−
−
y
The goal is to find w: weight vector, w0: threshold (bias)
so that if g(x) > 0 then x is ω1, if g(x) < 0 then x is ω2.
Decision boundary is a hyperplane.
Note: The distance from x to the hyperplane is g(x)/||w||. The margin is the minimum of
the distances.
For a multi-category case, define a linear machine
g i (x) = w i x + wi 0 i = 1,", c
assign ωi if g i(x) > g j(x) for all j≠i.
Generalized linear discriminant function
g (x) = a t y + w0
t
y
y = ϕ (x )
y
− Allows for nonlinear decision boundary.
− If y is of higher dimension we can have multiply connected regions.
− But beware of “the curse of dimensionality”
Augmented feature vector
⎡1 ⎤
⎡w ⎤
y = ⎢ ⎥, a = ⎢ 0 ⎥
⎣x ⎦
⎣w⎦
g ( x) = a t y
y
Note: The distance from y to the hyperplane is g(y)/||a||
Normalization
− Replace all y∈{ y is in ω2 } by –y
− Problem becomes finding a such that g (x) = a t y > 0 for all y
(2) Criterion functions and algorithms
y Linearly separable case: Perceptron criterion
−
(
)
J P = ∑ y∈Y − a t y , Y = {y ' s misclassified by a}
− Algorithm: keep adding misclassified y to a
− Finite convergence only if linearly separable, else does not terminate
3
y
Non-separable case: Minimum squared error
−
J S = Ya − b = ∑ (a t y i − bi ) for some given b (e.g., all ones)
2
n
2
i =1
− Solution: a = (Y t Y ) Y t b
−1
y
− Resulting a is same as FDA if b = [n/n1…(n1times) n/n2…(n2times)]t
− In addition this gives the threshold w0 = -mtw
Ho-Kashyap procedure
y
− Combination of MSE (both a and b are unknown) and Perceptron
− Finite convergence if linearly separable, else provides proof of non-separability (but
with no bound on the number of iterations needed)
Linear programming
− Finite convergence in both separable and non-separable (but solution useful only if
separable?)
3. Support Vector Machines (SVM) [2]
(1) Capacity of a classifier
y Need to be accurate on the training samples but also generalize well on testing samples
y A classifier with too much capacity will not generalize well e.g., high degree polynomial
y “VC dimension” is a measure of the capacity of a class of function
− definition of VC dimension: largest number of pts that can be shattered (classify all
possible labeling combinations) by the function.
− A line on R2 has VC dim = 3, hyperplane in Rn has VC dim = n+1
4
y
VC bound
− The expected risk (error) of a classifier is bounded by the following
R ≤ Remp + φ(h,…)
Remp: training error, h: VC dimension
− φ(h,…) monotonically increases with h
(2) Linear SVM
y Training data: {(xi , yi)| i=1,…,l}, xi: training samples, yi : label (+1 or –1)
y SVM looks for the separating hyperplane (w, b) that maximizes the margin
y
Say (w, b) is scaled so that
⎧x i ⋅ w + b ≥ +1 for yi = +1
⎨
⎩x i ⋅ w + b ≤ −1 for yi = −1
Then margin = 2 / w . Therefore minimize ||w||2 subject to above constraint
y
y
− Solve for Lagrange multipliers αi ≥ 0 from the “dual” problem (details omitted)
− w = ∑i αiyixi
The support vectors are the xi’s satisfying the equality.
The capacity decreases as margin increases [2]
y
Slack variables (ξi > 0) can be added to allow for outliers
⎧x i ⋅ w + b ≥ +1 − ξ i for yi = +1
⎨
⎩x i ⋅ w + b ≤ −1 + ξ i for yi = −1
with some penalty: minimize ||w||2 + C(∑i ξi)
5
(3) Non-linear SVM
y
y
Use non-linear mapping Φ that maps x into higher (possibly infinite) dimensional space
Note xi appears only in the form of dot products in the training phase
y
y
If there exists a “kernel function” K such that K(xi, xj) = Φ(xi)⋅Φ(xj), then we don’t need to
know what Φ(x) is.
We don’t need to know what w is in the testing phase either since
y
x⋅w + b = ∑i αiyi(x⋅xi) + b
Examples of such kernels
K (x, y ) = (x ⋅ y + 1) p
K (x, y ) = e
(polymomial)
− x − y / 2σ 2
2
(Gaussian)
K (x, y ) = tanh(κx ⋅ y − δ )
(sigmoid)
(4) Notes
y SVM training always finds a global solution (neural network doesn’t)
y SVM depends on the choice of kernel. Best choice not known
y Speed and size (large datasets) are problems yet to be solved
References
[1] Duda, Hart, and Stork, "Pattern Classification" Chapter 5, Wiley, 2000
[2] C. J. C. Burges. “A Tutorial on Support Vector Machines for Pattern” Recognition. Knowledge
Discovery and Data Mining, 2(2), 1998
[3] Bernhard Scholkopf and Alexander J. Smola, "Learning with Kernels - Support Vector Machines,
Regularization, Optimization, and Beyond", MIT Press, 2002, Chapter 1
6