In'today’s'lecture' • Mul*Blayer'perceptrons'(MLPs)' • Sigmoid'ac*va*on'func*on' • Backpropaga*on' • Backpropagated'least'squares' COSC343:'Ar*ficial'Intelligence' Lecture'10':'Ar*ficial'neural'networks' Lech Szymanski Dept. of Computer Science, University of Otago Lech'Szymanski' COSC343'Lecture'10' Lech Szymanski (Otago) Lech Szymanski (Otago) Recall:'Limit'of'the'perceptron'' Mul*Blayer'perceptrons'(MLP)' • Perceptron splits data space into two half-spaces • If the required separation boundary is non-linear, perceptron cannot form a reasonable separation boundary • But perceptron corresponds to single artificial neurons with many inputs – what if we created a networks of neurons? Epoch 286, # of errors: 0 2 Epoch 500, # of errors: 44 2 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0 -0.5 -0.5 -0.5 -0.5 0 0.5 1 x2 Lech Szymanski (Otago) 1.5 2 -1 -1 -0.5 0 0.5 1 x2 Lech'Szymanski' COSC343'Lecture'10' 1.5 2 -1 -1 Minsky and Papert’s book Perpectrons (1969) set out the following ideas: • Multi-layer perceptrons can compute any function • 1-layer perceptrons can only compute linearly separable functions • The perceptron learning rule only works for 1-layer perceptrons Epoch 500, # of errors: 5 2 1.5 -1 -1 Lech'Szymanski' COSC343'Lecture'10' . . . -0.5 0 0.5 1 1.5 2 x2 Lech Szymanski (Otago) Lech'Szymanski' COSC343'Lecture'10' Mul*Blayer'perceptrons'(MLP)' Mul*Blayer'perceptrons'(MLP)' Minsky and Papert’s book Perpectrons (1969) set out the following ideas: • \Minsky and Papert’s book Perpectrons (1969) set out the following ideas: • Multi-layer perceptrons can compute any function • 1-layer perceptrons can only compute linearly separable functions • The perceptron learning rule only works for 1-layer perceptrons • Multi-layer perceptrons can compute any function • 1-layer perceptrons can only compute linearly separable functions • The perceptron learning rule only works for 1-layer perceptrons Layer 1 neurons (hidden) Layer 2 neurons (hidden) Layer 1 output Layer 2 input Layer 2 output Layer 3 input Input . . . Lech Szymanski (Otago) Model output . . . Lech'Szymanski' COSC343'Lecture'10' Mul*Blayer'perceptrons'(MLP)' MLP'computa*on' Minsky and Papert’s book Perpectrons (1969) set out the following ideas: • Multi-layer perceptrons can compute any function • 1-layer perceptrons can only compute linearly separable functions • The perceptron learning rule only works for 1-layer perceptrons weight layer neuron # in the layer neuron # in previous layer neuron # in the layer activity Layer 1 parameters Layer 3 parameters Lech Szymanski (Otago) Programmer makes a decision how many layers and how many neurons per layer! Lech'Szymanski' COSC343'Lecture'10' . . . neuron # in the layer output number of neurons in layer l . . . Neuron in layer neuron # in the layer bias Layer 2 parameters Lech'Szymanski' COSC343'Lecture'10' Lech Szymanski (Otago) Lech Szymanski (Otago) • Activity is the weighted sum of inputs from the previous layer minus the bias • Output is a function of activity where and Lech'Szymanski' COSC343'Lecture'10' MLP'computa*on'(matrix'form)' MLP'learning' weight Recall the perceptron learning (delta) rule: • Weights change by the value of input times the resulting error • Bias changes by negative value of error (same rule as above with -1 input) layer neuron # in the layer neuron # in previous layer Neuron in layer neuron # in the layer . . . neuron # in the layer bias activity neuron # in the layer ? ? ? ? output number of neurons in layer l where weight vector of the 1st neuron in the layer weight vector of the Ul neuron in the layer Programmer makes a decision how many layers and how many neurons per layer! Lech Szymanski (Otago) . . . Lech'Szymanski' COSC343'Lecture'10' MLP'learning' Yes!' • • Lech Szymanski (Otago) Lech'Szymanski' COSC343'Lecture'10' The'basic'idea'behind'error$backpropaga+on$is'to'take' the'error'associated'with'the'output'unit,'and'distribute) it'amongst'the'units'which'provided'input.' ' The'input'units'which'are'connected'with'strongest' weights'need'to'take'more'‘responsibility’'for'the'error.' ' First'discovered'by'mathema*cians'(e.g.'Bryson'and'Ho,' 1969)...' First'applied'to'neural'networks'by'Werbos'(1981)...' Made'famous'by'Rumelhart,'Hinton'and'Williams'(1986).' Lech Szymanski (Otago) Perceptron rule will work on these parameters (input is the output of the previous layer) Backpropaga*on' Is'there'a'learning'rule'that'explains'how'to'update' the'weights'of'hidden'units'in'a'mul*Blayer' perceptron?' ' ' ' • How do we train the hidden layer parameters? Lech'Szymanski' COSC343'Lecture'10' Lech Szymanski (Otago) Lech'Szymanski' COSC343'Lecture'10' MLP'learning' Backpropaga*on' The'basic'idea'behind'error$backpropaga+on$is'to'take'the'blame'associated' with'the'error'in'an'output'unit,'and'distribute)it'amongst'the'units'which' provided'input.' The'input'units'which'are'connected'with'strongest'weights'need'to'take'more' ‘responsibility’'for'the'error.' ' “blame” ' assigned to neuron j from Let’s assume that we are minimising some cost function that depends on the network output and the desired output Steepest gradient descent update for bias on neuron j in layer l is : where... . . . layer l for the error in the output Steepest gradient descent update for weight connecting input i with neuron j in layer l is : Neuron in layer The change in the weight connecting input i with neuron j in layer l is: . . . The change in the bias for neuron j in layer l is: where... The cost blame for neuron j in the output layer is a derivative of the overall cost with respect to the neuron’s output: The cost blame for neuron i in layer l-1 is: . . . Lech Szymanski (Otago) Lech'Szymanski' COSC343'Lecture'10' Lech Szymanski (Otago) Backpropaga*on' Let’s assume that we are minimising some cost function that depends on the network output and the desired output Lech'Szymanski' COSC343'Lecture'10' Backpropaga*on' Steepest gradient descent update for weight connecting input i with neuron j in layer l is : Steepest gradient descent update for bias on neuron j in layer l is : Let’s assume that we are minimising some cost function that depends on the network output and the desired output Steepest gradient descent update for weight connecting input i with neuron j in layer l is : where... . . . Neuron in layer . . . The change in the weight connecting input i with neuron j in layer l is: Output of neuron i from the previous layer Derivative of the activation function where... The cost blame for neuron i in layer l-1 is: Sum over neurons in layer l Lech Szymanski (Otago) Lech'Szymanski' COSC343'Lecture'10' where... The change in the bias for neuron j in layer l is: Blame on neuron j Weight connecting neuron i in layer l-1 to neuron j in layer l Steepest gradient descent update for bias on neuron j in layer l is : Derivative of the activation function Blame on neuron j The cost blame for neuron j in the output layer is a derivative of the overall cost with respect to the neuron’s output: . . . Neuron in layer . . . Hard limiting function has no derivative The change in the weight connecting input i with neuron j in layer l is: Where does this derivative come from? where... The cost blame for neuron j in the The cost blame for neuron i in output layer is a derivative of the layer l-1 is: overall cost with respect to the neuron’s output: ? Lech Szymanski (Otago) The change in the bias for neuron j in layer l is: Lech'Szymanski' COSC343'Lecture'10' Logis*c'sigmoid'ac*va*on' Backpropaga*on' Let’s assume that we are minimising some cost function that depends on the network output and the desired output Logis*c'sigmoid'func*on' Steepest gradient descent update for weight connecting input i with neuron j in layer l is : Steepest gradient descent update for bias on neuron j in layer l is : where... 1 hardlim sigmoid 0.9 . . . 0.8 0.7 0.6 0.5 0.4 . . . 0.3 0.2 0.1 0 -8 The change in the weight connecting input i with neuron j in layer l is: Neuron in layer -6 -4 -2 0 2 4 6 where... 8 Lech'Szymanski' COSC343'Lecture'10' Backpropaga*on' Steepest gradient descent update for weight connecting input i with neuron j in layer l is : Steepest gradient descent update for bias on neuron j in layer l is : Let’s assume that we are minimising some cost function that depends on the network output and the desired output where... . . . Neuron in layer The change in the weight connecting input i with neuron j in layer l is: . . . Sigmoid function has a derivative Lech Szymanski (Otago) Lech'Szymanski' COSC343'Lecture'10' Lech Szymanski (Otago) Backpropaga*on' Let’s assume that we are minimising some cost function that depends on the network output and the desired output The cost blame for neuron j in the output layer is a derivative of the overall cost with respect to the neuron’s output: The cost blame for neuron i in layer l-1 is: Sigmoid function has a derivative Lech Szymanski (Otago) The change in the bias for neuron j in layer l is: Where does this derivative come from? where... The cost blame for neuron i in layer l-1 is: Lech'Szymanski' COSC343'Lecture'10' Steepest gradient descent update for weight connecting input i with neuron j in layer l is : Let’s use least squares error The change in the bias for neuron j in layer l is: . . . Neuron in layer where... The change in the weight connecting input i with neuron j in layer l is: . . . The cost blame for neuron j in the output layer is a derivative of the overall cost with respect to the neuron’s output: Sigmoid function has a derivative Lech Szymanski (Otago) Steepest gradient descent update for bias on neuron j in layer l is : Where does this derivative come from? where... The cost blame for neuron i in layer l-1 is: Lech'Szymanski' COSC343'Lecture'10' The change in the bias for neuron j in layer l is: The cost blame for neuron j in the output layer is a derivative of the overall cost with respect to the neuron’s output: An'example:'a'2B2B2B1'neural'network'' Backpropaga*on' Let’s assume that we are minimising some cost function that depends on the network output and the desired output Steepest gradient descent update for weight connecting input i with neuron j in layer l is : Let’s use least squares error . . . Then... Neuron in layer Lech Szymanski (Otago) The change in the weight connecting input i with neuron j in layer l is: #nuerons #nuerons layer 1 layer 2 #output neurons The change in the bias for neuron j in layer l is: Final classification where... The cost blame for neuron i in layer l-1 is: Lech'Szymanski' COSC343'Lecture'10' An'example:'a'2B2B2B3'neural'network'' Lech Szymanski (Otago) #inputs where... . . . Sigmoid function has a derivative Steepest gradient descent update for bias on neuron j in layer l is : Lech'Szymanski' COSC343'Lecture'10' The cost blame for neuron j in the output layer is a derivative of the overall cost with respect to the neuron’s output: Lech Szymanski (Otago) Lech'Szymanski' COSC343'Lecture'10' An'example:'a'2B8B8B1'neural'network'' Lech Szymanski (Otago) Lech'Szymanski' COSC343'Lecture'10' Summary'and'reading' • Mul*Blayer'perceptron'is'a'universal'func*on' approximator' • • • In'theory'it'can'model'anything' In'prac*ce,'it’s'not'obvious'what'the'best'architecture'is' Backpropaga*on'allows'training'of'hidden'weights'and' biases' • Requires'differen*able'ac*va*on'func*ons'–'sigmoid'is'a' very'common'choice' Reading'for'the'lecture:'AIMA'Chapter'18'Sec*ons'7.1,7.2' Reading'for'next'lecture:'No'reading' Lech Szymanski (Otago) Lech'Szymanski' COSC343'Lecture'10'
© Copyright 2026 Paperzz