Neural networks and support vector machines Perceptron Input x1 Weights w1 x2 w2 Output: sgn(w⋅x + b) x3 . . . xD w3 wD Can incorporate bias as component of the weight vector by always including a feature with value set to 1 Loose inspiration: Human neurons Linear separability Perceptron training algorithm • Initialize weights • Cycle through training examples in multiple passes (epochs) • For each training example: – If classified correctly, do nothing – If classified incorrectly, update weights Perceptron update rule • For each training instance x with label y: – Classify with current weights: y’ = sgn(w⋅x) – Update weights: w ß w + α(y-y’)x – α is a learning rate that should decay as a function of epoch t, e.g., 1000/(1000+t) – What happens if y’ is correct? – Otherwise, consider what happens to individual weights wi ß wi + α(y-y’)xi • If y = 1 and y’ = -1, wi will be increased if xi is positive or decreased if xi is negative à w⋅x will get bigger • If y = -1 and y’ = 1, wi will be decreased if xi is positive or increased if xi is negative à w⋅x will get smaller Convergence of perceptron update rule • Linearly separable data: converges to a perfect solution • Non-separable data: converges to a minimum-error solution assuming learning rate decays as O(1/t) and examples are presented in random sequence Implementation details • Bias (add feature dimension with value fixed to 1) vs. no bias • Initialization of weights: all zeros vs. random • Learning rate decay function • Number of epochs (passes through the training data) • Order of cycling through training examples (random) Multi-class perceptrons • One-vs-others framework: Need to keep a weight vector wc for each class c • Decision rule: c = argmaxc wc⋅ x • Update rule: suppose example from class c gets misclassified as c’ – Update for c: wc ß wc + α x – Update for c’: wc’ ß wc’ – α x Multi-class perceptrons • One-vs-others framework: Need to keep a weight vector wc for each class c • Decision rule: c = argmaxc wc⋅ x Softmax: P(c | x) = Max Inputs Perceptrons w/ weights wc exp(w c ⋅ x) C ∑ exp(w k=1 k ⋅ x) Visualizing linear classifiers Source: Andrej Karpathy, http://cs231n.github.io/linear-classify/ Differentiable perceptron Input x1 Weights w1 x2 w2 Output: σ(w⋅x + b) x3 . . . xd w3 Sigmoid function: wd 1 σ (t ) = 1 + e −t Update rule for differentiable perceptron • Define total classification error or loss on the training set: N E (w ) = ∑ (y j − f w (x j ) ) , 2 f w (x j ) = σ (w ⋅ x j ) j =1 • Update weights by gradient descent: w1 w2 ∂E w ← w −α ∂w Update rule for differentiable perceptron • Define total classification error or loss on the training set: N E (w ) = ∑ (y j − f w (x j ) ) , 2 f w (x j ) = σ (w ⋅ x j ) j =1 • Update weights by gradient descent: ∂E w ← w −α ∂w N ∂E ∂ ⎡ ⎤ = ∑ ⎢− 2(y j − f (x j ) )σ ' (w ⋅ x j ) (w ⋅ x j )⎥ ∂w j =1 ⎣ ∂w ⎦ N [ = ∑ − 2(y j − f (x j ) )σ (w ⋅ x j )(1 − σ (w ⋅ x j ))x j j =1 • For a single training point, the update is: w ← w + α ( y − f (x) )σ (w ⋅ x)(1 − σ (w ⋅ x))x ] Update rule for differentiable perceptron • For a single training point, the update is: w ← w + α ( y − f (x) )σ (w ⋅ x)(1 − σ (w ⋅ x))x – This is called stochastic gradient descent • Compare with update rule with non-differentiable perceptron: w ← w + α ( y − f ( x) ) x Network with a hidden layer weights w’ weights wj • Can learn nonlinear functions (provided each perceptron is nonlinear and differentiable) f (x)= σ #$∑ w!jσ j ( % w x ∑k jk k & ) Network with a hidden layer • Hidden layer size and network capacity: Source: http://cs231n.github.io/neural-networks-1/ Training of multi-layer networks • Find network weights to minimize the error between true and estimated labels of training examples: N E(w) = ∑ ( y j − fw (x j )) 2 j=1 ∂E • Update weights by gradient descent: w ← w − α ∂w • This requires perceptrons with a differentiable nonlinearity Sigmoid: g(t) = 1 1+ e−t Rectified linear unit (ReLU): g(t) = max(0,t) Training of multi-layer networks • Find network weights to minimize the error between true and estimated labels of training examples: N E(w) = ∑ ( y j − fw (x j )) 2 j=1 ∂E • Update weights by gradient descent: w ← w − α ∂w • • Back-propagation: gradients are computed in the direction from output to input layers and combined using chain rule Stochastic gradient descent: compute the weight update w.r.t. one training example (or a small batch of examples) at a time, cycle through training examples in random order in multiple epochs Regularization • It is common to add a penalty on weight magnitudes to the objective function: N λ E( f ) = ∑ ( yi − f (x i )) + ∑ w 2j 2 j i=1 2 – This encourages network to use all of its inputs “a little” rather than a few inputs “a lot” Source: http://cs231n.github.io/neural-networks-1/ Multi-Layer Network Demo http://playground.tensorflow.org/ Neural networks: Pros and cons • Pros – Flexible and general function approximation framework – Can build extremely powerful models by adding more layers • Cons – Hard to analyze theoretically (e.g., training is prone to local optima) – Huge amount of training data, computing power required to get good performance – The space of implementation choices is huge (network architectures, parameters) Support vector machines • When the data is linearly separable, there may be more than one separator (hyperplane) Which separator is best? Support vector machines • Find hyperplane that maximizes the margin between the positive and negative examples Support vectors Margin C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 SVM parameter learning 1 • Separable data: min w w ,b 2 Maximize margin • 2 subject to yi (w ⋅ xi + b) ≥ 1 Classify training data correctly Non-separable data: min w,b n 1 2 w + C ∑ max ( 0,1− yi (w ⋅ x i + b)) 2 i=1 Maximize margin Minimize classification mistakes SVM parameter learning n min w,b 1 2 w + C ∑ max ( 0,1− yi (w ⋅ x i + b)) 2 i=1 +1 0 Margin -1 Demo: http://cs.stanford.edu/people/karpathy/svmjs/demo Nonlinear SVMs • General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable Φ: x → φ(x) Image source Nonlinear SVMs • Linearly separable dataset in 1D: x 0 • Non-separable dataset in 1D: x 0 • We can map the data to a higher-dimensional space: x2 0 x Slide credit: Andrew Moore The kernel trick • General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable • The kernel trick: instead of explicitly computing the lifting transformation φ(x), define a kernel function K such that K(x , y) = φ(x) · φ(y) (to be valid, the kernel function must satisfy Mercer’s condition) The kernel trick • Linear SVM decision function: w ⋅ x + b = ∑i α i yi xi ⋅ x + b learned weight Support vector C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 The kernel trick • Linear SVM decision function: w ⋅ x + b = ∑i α i yi xi ⋅ x + b • Kernel SVM decision function: ∑ α y ϕ ( x ) ⋅ ϕ ( x) + b = ∑ α y K ( x , x) + b i i i i i i i i • This gives a nonlinear decision boundary in the original feature space C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 Polynomial kernel: K (x, y ) = (c + x ⋅ y ) d Gaussian kernel • Also known as the radial basis function (RBF) kernel: 2 ⎞ ⎛ 1 K (x, y ) = exp⎜ − 2 x − y ⎟ ⎝ σ ⎠ K(x, y) ||x – y|| Gaussian kernel SV’s SVMs: Pros and cons • Pros • Kernel-based framework is very powerful, flexible • Training is convex optimization, globally optimal solution can be found • Amenable to theoretical analysis • SVMs work very well in practice, even with very small training sample sizes • Cons • No “direct” multi-class SVM, must combine two-class SVMs (e.g., with one-vs-others) • Computation, memory (esp. for nonlinear SVMs) Neural networks vs. SVMs (a.k.a. “deep” vs. “shallow” learning)
© Copyright 2026 Paperzz