Neural networks and support vector machines

Neural networks and
support vector machines
Perceptron
Input
x1
Weights
w1
x2
w2
Output: sgn(w⋅x + b)
x3
.
.
.
xD
w3
wD
Can incorporate bias as
component of the weight
vector by always
including a feature with
value set to 1
Loose inspiration: Human
neurons
Linear separability
Perceptron training algorithm
•  Initialize weights
•  Cycle through training examples in multiple
passes (epochs)
•  For each training example:
–  If classified correctly, do nothing
–  If classified incorrectly, update weights
Perceptron update rule
•  For each training instance x with label y:
–  Classify with current weights: y’ = sgn(w⋅x)
–  Update weights: w ß w + α(y-y’)x
–  α is a learning rate that should decay as a
function of epoch t, e.g., 1000/(1000+t)
–  What happens if y’ is correct?
–  Otherwise, consider what happens to individual
weights wi ß wi + α(y-y’)xi
•  If y = 1 and y’ = -1, wi will be increased if xi is positive
or decreased if xi is negative à w⋅x will get bigger
•  If y = -1 and y’ = 1, wi will be decreased if xi is positive
or increased if xi is negative à w⋅x will get smaller
Convergence of perceptron
update rule
•  Linearly separable data: converges to a
perfect solution
•  Non-separable data: converges to a
minimum-error solution assuming learning
rate decays as O(1/t) and examples are
presented in random sequence
Implementation details
•  Bias (add feature dimension with value fixed to 1)
vs. no bias
•  Initialization of weights: all zeros vs. random
•  Learning rate decay function
•  Number of epochs (passes through the training
data)
•  Order of cycling through training examples
(random)
Multi-class perceptrons
•  One-vs-others framework: Need to keep a
weight vector wc for each class c
•  Decision rule: c = argmaxc wc⋅ x
•  Update rule: suppose example from class
c gets misclassified as c’
–  Update for c: wc ß wc + α x
–  Update for c’: wc’ ß wc’ – α x
Multi-class perceptrons
•  One-vs-others framework: Need to keep a
weight vector wc for each class c
•  Decision rule: c = argmaxc wc⋅ x
Softmax:
P(c | x) =
Max
Inputs
Perceptrons
w/ weights wc
exp(w c ⋅ x)
C
∑ exp(w
k=1
k
⋅ x)
Visualizing linear classifiers
Source: Andrej Karpathy, http://cs231n.github.io/linear-classify/
Differentiable perceptron
Input
x1
Weights
w1
x2
w2
Output: σ(w⋅x + b)
x3
.
.
.
xd
w3
Sigmoid function:
wd
1
σ (t ) =
1 + e −t
Update rule for differentiable perceptron
•  Define total classification error or loss on the
training set:
N
E (w ) = ∑ (y j − f w (x j ) ) ,
2
f w (x j ) = σ (w ⋅ x j )
j =1
•  Update weights by gradient descent:
w1
w2
∂E
w ← w −α
∂w
Update rule for differentiable perceptron
•  Define total classification error or loss on the
training set:
N
E (w ) = ∑ (y j − f w (x j ) ) ,
2
f w (x j ) = σ (w ⋅ x j )
j =1
•  Update weights by gradient descent:
∂E
w ← w −α
∂w
N
∂E
∂
⎡
⎤
= ∑ ⎢− 2(y j − f (x j ) )σ ' (w ⋅ x j )
(w ⋅ x j )⎥
∂w j =1 ⎣
∂w
⎦
N
[
= ∑ − 2(y j − f (x j ) )σ (w ⋅ x j )(1 − σ (w ⋅ x j ))x j
j =1
•  For a single training point, the update is:
w ← w + α ( y − f (x) )σ (w ⋅ x)(1 − σ (w ⋅ x))x
]
Update rule for differentiable perceptron
•  For a single training point, the update is:
w ← w + α ( y − f (x) )σ (w ⋅ x)(1 − σ (w ⋅ x))x
–  This is called stochastic gradient descent
•  Compare with update rule with non-differentiable
perceptron:
w ← w + α ( y − f ( x) ) x
Network with a hidden layer
weights w’
weights wj
•  Can learn nonlinear functions (provided each perceptron is nonlinear
and differentiable)
f (x)= σ #$∑ w!jσ
j
(
%
w
x
∑k jk k &
)
Network with a hidden layer
•  Hidden layer size and network capacity:
Source: http://cs231n.github.io/neural-networks-1/
Training of multi-layer networks
• 
Find network weights to minimize the error between true and
estimated labels of training examples:
N
E(w) = ∑ ( y j − fw (x j ))
2
j=1
∂E
•  Update weights by gradient descent: w ← w − α
∂w
• 
This requires perceptrons with a differentiable nonlinearity
Sigmoid: g(t) =
1
1+ e−t
Rectified linear unit (ReLU): g(t) = max(0,t)
Training of multi-layer networks
• 
Find network weights to minimize the error between true and
estimated labels of training examples:
N
E(w) = ∑ ( y j − fw (x j ))
2
j=1
∂E
•  Update weights by gradient descent: w ← w − α
∂w
• 
• 
Back-propagation: gradients are computed in the direction
from output to input layers and combined using chain rule
Stochastic gradient descent: compute the weight update
w.r.t. one training example (or a small batch of examples) at
a time, cycle through training examples in random order in
multiple epochs
Regularization
•  It is common to add a penalty on weight magnitudes to
the objective function:
N
λ
E( f ) = ∑ ( yi − f (x i )) + ∑ w 2j
2 j
i=1
2
–  This encourages network to use all of its inputs “a little” rather
than a few inputs “a lot”
Source: http://cs231n.github.io/neural-networks-1/
Multi-Layer Network Demo
http://playground.tensorflow.org/
Neural networks: Pros and cons
•  Pros
–  Flexible and general function approximation
framework
–  Can build extremely powerful models by adding more
layers
•  Cons
–  Hard to analyze theoretically (e.g., training is prone to
local optima)
–  Huge amount of training data, computing power
required to get good performance
–  The space of implementation choices is huge
(network architectures, parameters)
Support vector machines
•  When the data is linearly separable, there may
be more than one separator (hyperplane)
Which separator
is best?
Support vector machines
•  Find hyperplane that maximizes the margin
between the positive and negative examples
Support vectors
Margin
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
SVM parameter learning
1
•  Separable data: min w
w ,b 2
Maximize
margin
• 
2
subject to
yi (w ⋅ xi + b) ≥ 1
Classify training data correctly
Non-separable data:
min
w,b
n
1
2
w + C ∑ max ( 0,1− yi (w ⋅ x i + b))
2
i=1
Maximize
margin
Minimize classification mistakes
SVM parameter learning
n
min
w,b
1
2
w + C ∑ max ( 0,1− yi (w ⋅ x i + b))
2
i=1
+1
0
Margin
-1
Demo: http://cs.stanford.edu/people/karpathy/svmjs/demo
Nonlinear SVMs
•  General idea: the original input space can
always be mapped to some higher-dimensional
feature space where the training set is separable
Φ: x → φ(x)
Image source
Nonlinear SVMs
•  Linearly separable dataset in 1D:
x
0
•  Non-separable dataset in 1D:
x
0
•  We can map the data to a higher-dimensional space:
x2
0
x
Slide credit: Andrew Moore
The kernel trick
•  General idea: the original input space can
always be mapped to some higher-dimensional
feature space where the training set is separable
•  The kernel trick: instead of explicitly computing
the lifting transformation φ(x), define a kernel
function K such that
K(x , y) = φ(x) · φ(y)
(to be valid, the kernel function must satisfy
Mercer’s condition)
The kernel trick
•  Linear SVM decision function:
w ⋅ x + b = ∑i α i yi xi ⋅ x + b
learned
weight
Support
vector
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
The kernel trick
•  Linear SVM decision function:
w ⋅ x + b = ∑i α i yi xi ⋅ x + b
•  Kernel SVM decision function:
∑ α y ϕ ( x ) ⋅ ϕ ( x) + b = ∑ α y K ( x , x) + b
i
i
i
i
i
i
i
i
•  This gives a nonlinear decision boundary in the
original feature space
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
Polynomial kernel: K (x, y ) = (c + x ⋅ y )
d
Gaussian kernel
•  Also known as the radial basis function (RBF)
kernel:
2 ⎞
⎛ 1
K (x, y ) = exp⎜ − 2 x − y ⎟
⎝ σ
⎠
K(x, y)
||x – y||
Gaussian kernel
SV’s
SVMs: Pros and cons
•  Pros
•  Kernel-based framework is very powerful, flexible
•  Training is convex optimization, globally optimal solution can
be found
•  Amenable to theoretical analysis
•  SVMs work very well in practice, even with very small
training sample sizes
•  Cons
•  No “direct” multi-class SVM, must combine two-class SVMs
(e.g., with one-vs-others)
•  Computation, memory (esp. for nonlinear SVMs)
Neural networks vs. SVMs
(a.k.a. “deep” vs. “shallow” learning)