Backpropagation The Plague of Linear Separability • The good news is: – Learn-Perceptron is guaranteed to converge to a correct assignment of weights if such an assignment exists • The bad news is: – Learn-Perceptron can only learn classes that are linearly separable (i.e., separable by a single hyperplane) • The really bad news is: – There is a very large number of interesting problems that are not linearly separable (e.g., XOR) Linear Separability • Let d be the number of inputs Lim(number of LS functions) = ¥ d ®¥ number of LS functions Lim(total number of functions) = 0 d ®¥ Hence, there are too many functions that escape the algorithm Historical Perspective • The result on linear separability (Minsky & Papert, 1969) virtually put an end to connectionist research • The solution was obvious: Since multi-layer networks could in principle handle arbitrary problems, one only needed to design a learning algorithm for them • This proved to be a major challenge • AI would have to wait over 15 years for a general purpose NN learning algorithm to be devised by Rumelhart in 1986 Towards a Solution • Main problem: – Learn-Perceptron implements discrete model of error (i.e., identifies the existence of error and adapts to it) • First thing to do: – Allow nodes to have real-valued activations (amount of error = difference between computed and target output) • Second thing to do: – Design learning rule that adjusts weights based on error • Last thing to do: – Use the learning rule to implement a multi-layer algorithm Real-valued Activation • Replace the threshold unit (step function) with a linear unit, where: n out = å wi xi i =1 Error no longer discrete: E = (Ot -Op ) Î Â Training Error • We define the training error of a hypothesis, or weight vector, by: 2 1 E( w) = å (td - od ) 2 dÎD which we will seek to minimize The Delta Rule • Implements gradient descent (i.e., steepest) on the error surface: Dwi = h å (td - od )xid då D Note how the xid multiplicative factor implicitly identifies “active” lines as in Learn-Perceptron Gradient-descent Learning (b) • Initialize weights to small random values • Repeat – Initialize each wi to 0 – For each training example <x,t> • Compute output o for x • For each weight wi – wi wi + (t – o)xi – For each weight wi • wi wi + wi Gradient-descent Learning (i) • Initialize weights to small random values • Repeat – For each training example <x,t> • Compute output o for x • For each weight wi – wi wi + (t – o)xi Discussion • Gradient-descent learning (with linear units) requires more than one pass through the training set • The good news is: – Convergence is guaranteed if the problem is solvable • The bad news is: – Still produces only linear functions – Even when used in a multi-layer context • Needs to be further generalized! Non-linear Activation • Introduce non-linearity with a sigmoid function: n net = å wi xi i =1 out = 1 - net 1+ e 1. Differentiable (required for gradient-descent) 2. Most unstable in the middle Sigmoid Function s ( x) = 1 -x 1+ e ds ( x) = s ( x).(1 - s ( x)) dx • Derivative reaches maximum when output is most unstable. Hence, change will be largest when output is most uncertain. Multi-layer Feed-forward NN i k i k i j k i Backpropagation (i) • Repeat – Present a training instance – Compute error k of output units – For each hidden layer • Compute error j using error from next layer – Update all weights: wij wij + wij where wij = Oij • Until (E < CriticalError) Error Computation Oj = f(netj) where f is a sigmoid f’(netj). Maximum when Oj is most unstable. i k i j k i k i Output: Ek = (Tk - Ok)f’(netk) Weight change: DWjk = C Oj Ek Hidden: Ej = Sk(E k Wj k)f’(netj) Weight change: DWij = C Oi Ej Computation of errors Example (I) • Consider a simple network composed of: – 3 inputs: a, b, c – 1 hidden node: h – 2 outputs: q, r • Assume =0.5, all weights are initialized to 0.2 and weight updates are incremental • Consider the training set: – 101–01 – 011–11 • 4 iterations over the training set Example (II) a 1 b intialization 0 c W a-h W b-h W c-h 1 0.2 0.2 0.2 0.2 0.2 0.2 0.199642 0.2 0.199642 1 0.199642 0.199642 0.2 0.205555 0.199642 0.205197 1 0.199642 0.205555 0.205197 1 0.200234 0.200234 0.205555 0.205555 0.205789 0.205789 0.200234 0.211888 0.212122 1 0.200234 0.2017 0.211888 0.211888 0.212122 0.213588 1 0.2017 0.211888 0.213588 1 0.2017 0.2017 0.218922 0.218922 0.220621 0.220621 0.203963 0.218922 0.222884 0.203963 0.203963 0.218922 0.226579 0.222884 0.230542 update weights 0 1 0 1 update weights 0 update weights 1 update weights 1 0 1 0 update weights 1 update weights 0 update weights 0 1 update weights 1 dq D W h-q dr D W h-r dh D W a-h D W b-h 1 -0.132 -0.03951 0.117105 0.035055 -0.00072 -0.00036 0 1 1 0.118726 0.035535 0.115647 0.034613 0.01111 0 0.540352 0 1 -0.13188 -0.03956 0.114164 0.03424 0.001184 0.000592 0.523508 0.545567 1 1 0.11886 0.035742 0.112665 0.033879 0.012666 0 0.337787 0.371236 0.528878 0.550634 0 1 -0.13178 -0.03964 0.11119 0.033449 0.002931 0.001466 0.152565 0.371236 0.523051 0.555896 1 1 0.118984 0.03598 0.109639 0.033154 0.014067 0 0.188545 0.188545 0.40439 0.40439 0.528441 0.560765 0 1 -0.13168 -0.03977 0.108187 0.032675 0.004526 0.002263 0.148775 0.437065 0.148775 0.185022 0.437065 0.4695 0.522624 0.56612 1 1 0.1191 0.036247 0.106573 0.032435 0.015315 0 h W h-q W h-r Out q Out r 0.598688 0.2 0.2 0.2 0.2 0.529899 0.529899 0 0.160486 0.235055 0.598602 0.160486 0.196021 0.235055 0.269668 0.523998 0.535118 0.59985 0.196021 0.269668 0.529362 0.60141 0.156466 0.156466 0.303908 0.303908 0.192208 0.337787 0.601653 0.192208 0.152565 0.604793 0.604039 0.608689 Target q Target r 0.005555 0 0.006333 0 0.007034 0 0.007658 Dealing with Local Minima • No guarantee of convergence to the global minimum – Use a momentum term: • Keep moving through small local (global!) minima or along flat regions Dwij (n) = hOi d j + aDwij (n - 1) – Use the incremental/stochastic version of the algorithm – Train multiple networks with different starting weights • Select best on hold-out validation set • Combine outputs (e.g., weighted average) Discussion • 3-layer backpropagation neural networks are Universal Function Approximators • Backpropagation is the standard – Extensions have been proposed to automatically set the various parameters (i.e., number of hidden layers, number of nodes per layer, learning rate) – Dynamic models have been proposed (e.g., ASOCS) • Other neural network models exist: Kohonen maps, Hopfield networks, Boltzmann machines, etc.
© Copyright 2026 Paperzz