CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 31 – Neural networks 1 Schedule • Last class – Markov random fields (MRFs) in vision • Today – Introduction to neural networks • Readings for this week: – Deep learning tutorial on neural networks (optional) • See online schedule for link 2 Brief history • late-1800's - neural networks appeared to be an analogy to biological systems • 1960's and 70's – simple neural networks appeared – Fell out of favor because the perceptron was not effective by itself, and there were no good algorithms for multilayer nets • 1986 – backpropagation algorithm appeared – Neural Networks have had a resurgence in popularity • 2006 – contrastive divergence for training deep neural nets Basic idea • Modeled on biological systems – This association has become much looser • Learn to classify objects – Can do more than this • Learn from given training data of the form (x1...xn, output) Properties • Inputs are flexible – any real values – Highly correlated or independent • Target function may be discrete-valued, realvalued, or vectors of discrete or real values – Outputs are real numbers between 0 and 1 • • • • Resistant to errors in the training data Long training time Fast evaluation The function produced can be difficult for humans to interpret Diagram x1 x0 w1 x2 . . . . . . xn w2 w0 Σ wn y= wixi Threshold 1 if y >0 -1 otherwise Perceptrons • Basic unit in a neural network • Linear separator • Parts – N inputs, x1 ... xn – Weights for each input, w1 ... wn – A bias input x0 (constant) and associated weight w0 – Weighted sum of inputs, y = w0x0 + w1x1 + ... + wnxn – A threshold function, i.e 1 if y > 0, -1 if y <= 0 Linear Separator This... But not this (XOR) x2 x2 + + + - + - x1 x1 - + Perceptron training rule 1. Initialize the weights (either to zero or to a small random value) 2. Pick a learning rate h ( this is a number between 0.0 and 1.0) 3. Until stopping condition is satisfied (e.g. weights don't change): 4. For each training pattern (x, t): – – – – – compute output activation y = f(w x) If y = t, don't change weights If y != t, update the weights: wj(new) = wj (old) + 2 h t xj; OR w(new) = w(old) + h (t - y ) xj, for all t • y = wj xj − 𝑏 ; The output y of the perceptron is +1 if y > 0, and 0 otherwise. Perceptron training rule (2) P1 and p2 are initially misclassified If we choose p1 to update the weights (left), weight is moved slightly in the direction of p1 If we choose p2 (right), weight is moved slightly in the direction of –p2 Either case, we get a new improved boundary (blue dashed line) 10 In summary 11 Gradient Descent • Perceptron training rule may not converge if points are not linearly separable • Gradient descent will try to fix this by changing the weights by the total error for all training points, rather than the individual – If the data is not linearly separable, then it will converge to the best fit LMS Learning LMS = Least Mean Square learning systems, more general than the previous perceptron learning rule. The concept is to minimize the total error, as measured over all training examples, D. O is the raw output, as calculated by wi I i i 1 2 Error ( LMS ) TD OD 2 D e.g. if we have two patterns i.e. D=2; then T1=1, O1=0.8, T2=0, O2=0.5 then E=(0.5)[(1-0.8)2 + (0-0.5)2]=.145 We want to minimize the LMS: C-learning rate E W(old) W(new) W Gradient descent 14 Gradient descent (2) 15 Gradient descent (3) 16 Gradient descent (4) 17 So far…. • Perceptron training rule guaranteed to succeed if – Training examples are linearly separable – Sufficiently small learning rate • Linear unit training rule uses gradient descent – Guaranteed to converge to hypothesis with minimum squared error – Given sufficiently small learning rate – Even when training data contains noise – Even when training data not separable by plane H 18 LMS vs. Limiting Threshold • With the new sigmoidal function that is differentiable, we can apply the delta rule toward learning. • Perceptron Method – Forced output to 0 or 1, while LMS uses the net output – Guaranteed to separate, if no error and is linearly separable • Otherwise it may not converge • Gradient Descent Method: – May oscillate and not converge – May converge to wrong answer – Will converge to some minimum even if the classes are not linearly separable, unlike the earlier perceptron training method Gradient Descent Issues • Converging to a local minimum can be very slow – The while loop may have to run many times • May converge to a local minima • Stochastic Gradient Descent – Update the weights after each training example rather than all at once – Takes less memory – Can sometimes avoid local minima – η must decrease with time in order for it to converge Differentiable Threshold Unit • Our old threshold function (1 if y > 0, 0 otherwise) is not differentiable • Gradient descent requires that we need a differentiable threshold unit in order to continue • One solution is the sigmoid unit Sigmoid unit Graph of Sigmoid Function Multi-layer Neural Networks • A single perceptron can only learn linearly separable functions • We would like to have a network of perceptrons, but how do we determine the error of the output for an internal node? • Solution: Backpropogation algorithm Multilayer network of sigmoid units Backpropagation Networks • Attributed to Rumelhart and McClelland, late 70’s • To bypass the linear classification problem, we can construct multilayer networks. Typically we have fully connected, feedforward networks. Input Layer Output Layer Hidden Layer x1 O1 H1 x2 H2 O2 x3 1 e 1 1 1’s - bias O( x) Wi,j Wj,k H ( x) 1 e 1 wi , x I i i 1 w j , x H j j Backpropagation - learning Backpropagation Algorithm • Initialize all weights to small random numbers • Until stopping condition, do – For each training input, do 1. 2. 3. Input the training example to the network and propagate computations to output Adjust weights according to the delta rule, propagating the errors back; the weights will begin to converge to a place where the network learns to give the desired output Repeat; stop when no errors, or enough epochs are completed 26 Backpropagation - Modifying Weights We had computed: wd cxd T j O f ' ActivationFunction ; j Hence, 1 f sum 1 e wd cxd T j O j f ( sum )(1 f ( sum ) For the Output unit d, f(sum)=Od. For the output units, this is: w j ,d cH j Td Od Od (1 Od ) For the Hidden units, this is: wi , j cH j (1 H j ) xi (Td Od )Od (1 Od )w j ,d d x Wi,j H Wj,k O Properties of backpropagation • Is very powerful - can learn any function, given enough hidden units! With enough hidden units, we can generate any function. • Has the standard problem of generalization vs. Memorization. With too many units, the network will tend to memorize the input and not generalize well. Some schemes exist to “prune” the neural network. • Networks require extensive training, many parameters to fiddle with. Can be extremely slow to train. May also fall into local minima. • Inherently parallel algorithm, ideal for multiprocessor hardware. • Despite the cons, is a very powerful algorithm that has seen widespread successful deployment. Momentum • Add the a fraction 0 <= α < 1 of the previous update for a weight to the current update • May allow the learner to avoid local minimums • May speed up convergence to global minimum When to Stop Learning • Learn until error on the training set is below some threshold – Bad idea! Can result in overfitting • If you match the training examples too well, your performance on the real problems may suffer • Learn trying to get the best result on some validation data – Data from your training set that is not trained on, but instead used to check the function – Stop when the performance seems to be decreasing on this, while saving the best network seen so far. – There may still be local minimums, so watch out! Example: Face Recognition • From Machine Learning by Tom M. Mitchell • Input: 30 by 32 pictures of people with the following properties: – Wearing eyeglasses or not – Facial expression: happy, sad, angry, neutral – Direction in which they are looking: left, right, up, straight ahead • Output: Determine which category it fits into for one of these properties Input Encoding • Each pixel is an input – 30*32 = 960 inputs • The value of the pixel (0 – 255) is linearly mapped onto the range of reals between 0 and 1 Output Encoding • Could use a single output node with the classifications assigned to 4 values (e.g. 0.2, 0.4, 0.6, and 0.8) • Instead, use 4 output nodes (one for each value) – 1-of-N output encoding – Provides more degrees of freedom to the network • Use values of 0.1 and 0.9 instead of 0 and 1 – The sigmoid function can never reach 0 or 1! • Example: (0.9, 0.1, 0.1, 0.1) = left, (0.1, 0.9, 0.1, 0.1) = right, etc. Network structure Inputs x1 3 Hidden Units x2 . . . . x960 Outputs Other Parameters • training rate: η = 0.3 • momentum: α = 0.3 • Used full gradient descent (as opposed to stochastic) • Weights in the output units were initialized to small random variables, but input weights were initialized to 0 – Yields better visualizations • Result: 90% accuracy on test set! Try it out! • Get the code from http://www.cs.cmu.edu/~tom/mlbook.html – Go to the Software and Data page, then follow the “Neural network learning to recognize faces” link – Follow the documentation Slide Credits • Greg Grudic – University of Colorado • Kenrick Mock – UAA • Joel Anderson - UIUC 37 Next 2 classes • Deep networks and vision applications • Readings for next lecture: – None (come and listen…) • Readings for today: – Deep learning tutorial on neural networks (optional) 38 Questions 39
© Copyright 2024 Paperzz