COSC343:`Ar*ficial`Intelligence` Mul*Blayer`perceptrons`(MLPs

In'today’s'lecture'
• 
Mul*Blayer'perceptrons'(MLPs)'
• 
Sigmoid'ac*va*on'func*on'
• 
Backpropaga*on'
• 
Backpropagated'least'squares'
COSC343:'Ar*ficial'Intelligence'
Lecture'10':'Ar*ficial'neural'networks'
Lech Szymanski
Dept. of Computer Science, University of Otago
Lech'Szymanski'
COSC343'Lecture'10'
Lech Szymanski (Otago)
Lech Szymanski (Otago)
Recall:'Limit'of'the'perceptron''
Mul*Blayer'perceptrons'(MLP)'
•  Perceptron splits data space into two half-spaces
•  If the required separation boundary is non-linear,
perceptron cannot form a reasonable separation
boundary
•  But perceptron corresponds to single artificial neurons
with many inputs – what if we created a networks of
neurons?
Epoch 286, # of errors: 0
2
Epoch 500, # of errors: 44
2
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
-0.5
-0.5
-0.5
-0.5
0
0.5
1
x2
Lech Szymanski (Otago)
1.5
2
-1
-1
-0.5
0
0.5
1
x2
Lech'Szymanski'
COSC343'Lecture'10'
1.5
2
-1
-1
Minsky and Papert’s book Perpectrons (1969) set out the
following ideas:
•  Multi-layer perceptrons can compute any function
•  1-layer perceptrons can only compute linearly separable functions
•  The perceptron learning rule only works for 1-layer perceptrons
Epoch 500, # of errors: 5
2
1.5
-1
-1
Lech'Szymanski'
COSC343'Lecture'10'
.
.
.
-0.5
0
0.5
1
1.5
2
x2
Lech Szymanski (Otago)
Lech'Szymanski'
COSC343'Lecture'10'
Mul*Blayer'perceptrons'(MLP)'
Mul*Blayer'perceptrons'(MLP)'
Minsky and Papert’s book Perpectrons (1969) set out the
following ideas:
•  \Minsky and Papert’s book Perpectrons (1969) set out
the following ideas:
•  Multi-layer perceptrons can compute any function
•  1-layer perceptrons can only compute linearly separable functions
•  The perceptron learning rule only works for 1-layer perceptrons
•  Multi-layer perceptrons can compute any function
•  1-layer perceptrons can only compute linearly separable functions
•  The perceptron learning rule only works for 1-layer perceptrons
Layer 1 neurons
(hidden)
Layer 2 neurons
(hidden)
Layer 1 output
Layer 2 input
Layer 2 output
Layer 3 input
Input
.
.
.
Lech Szymanski (Otago)
Model output
.
.
.
Lech'Szymanski'
COSC343'Lecture'10'
Mul*Blayer'perceptrons'(MLP)'
MLP'computa*on'
Minsky and Papert’s book Perpectrons (1969) set out the
following ideas:
•  Multi-layer perceptrons can compute any function
•  1-layer perceptrons can only compute linearly separable functions
•  The perceptron learning rule only works for 1-layer perceptrons
weight
layer
neuron # in
the layer
neuron # in
previous layer
neuron # in
the layer
activity
Layer 1 parameters
Layer 3 parameters
Lech Szymanski (Otago)
Programmer
makes a decision
how many layers
and how many
neurons per layer!
Lech'Szymanski'
COSC343'Lecture'10'
.
.
.
neuron # in
the layer
output
number of
neurons in layer l
.
.
.
Neuron
in layer
neuron # in
the layer
bias
Layer 2 parameters
Lech'Szymanski'
COSC343'Lecture'10'
Lech Szymanski (Otago)
Lech Szymanski (Otago)
• 
Activity is the weighted sum of
inputs from the previous layer
minus the bias
• 
Output is a function of activity
where
and
Lech'Szymanski'
COSC343'Lecture'10'
MLP'computa*on'(matrix'form)'
MLP'learning'
weight
Recall the perceptron learning (delta) rule:
•  Weights change by the value of input times the resulting
error
•  Bias changes by negative value of error (same rule as
above with -1 input)
layer
neuron # in
the layer
neuron # in
previous layer
Neuron
in layer
neuron # in
the layer
.
.
.
neuron # in
the layer
bias
activity
neuron # in
the layer
?
?
?
?
output
number of
neurons in layer l
where
weight vector of the 1st
neuron in the layer
weight vector of the Ul
neuron in the layer
Programmer
makes a decision
how many layers
and how many
neurons per layer!
Lech Szymanski (Otago)
.
.
.
Lech'Szymanski'
COSC343'Lecture'10'
MLP'learning'
Yes!'
• 
• 
Lech Szymanski (Otago)
Lech'Szymanski'
COSC343'Lecture'10'
The'basic'idea'behind'error$backpropaga+on$is'to'take'
the'error'associated'with'the'output'unit,'and'distribute)
it'amongst'the'units'which'provided'input.'
'
The'input'units'which'are'connected'with'strongest'
weights'need'to'take'more'‘responsibility’'for'the'error.'
'
First'discovered'by'mathema*cians'(e.g.'Bryson'and'Ho,'
1969)...'
First'applied'to'neural'networks'by'Werbos'(1981)...'
Made'famous'by'Rumelhart,'Hinton'and'Williams'(1986).'
Lech Szymanski (Otago)
Perceptron rule will work on
these parameters (input is the
output of the previous layer)
Backpropaga*on'
Is'there'a'learning'rule'that'explains'how'to'update'
the'weights'of'hidden'units'in'a'mul*Blayer'
perceptron?'
'
'
'
• 
How do we train
the hidden layer
parameters?
Lech'Szymanski'
COSC343'Lecture'10'
Lech Szymanski (Otago)
Lech'Szymanski'
COSC343'Lecture'10'
MLP'learning'
Backpropaga*on'
The'basic'idea'behind'error$backpropaga+on$is'to'take'the'blame'associated'
with'the'error'in'an'output'unit,'and'distribute)it'amongst'the'units'which'
provided'input.'
The'input'units'which'are'connected'with'strongest'weights'need'to'take'more'
‘responsibility’'for'the'error.'
'
“blame”
' assigned
to neuron j from
Let’s assume that we are minimising some cost
function that depends on the network output and
the desired output
Steepest gradient descent update
for bias on neuron j in layer l is :
where...
.
.
.
layer l for the
error in the output
Steepest gradient descent update
for weight connecting input i
with neuron j in layer l is :
Neuron
in layer
The change in the weight
connecting input i with neuron
j in layer l is:
.
.
.
The change in the bias for
neuron j in layer l is:
where...
The cost blame for neuron j in the
output layer is a derivative of the
overall cost with respect to the
neuron’s output:
The cost blame for neuron i in
layer l-1 is:
.
.
.
Lech Szymanski (Otago)
Lech'Szymanski'
COSC343'Lecture'10'
Lech Szymanski (Otago)
Backpropaga*on'
Let’s assume that we are minimising some cost
function that depends on the network output and
the desired output
Lech'Szymanski'
COSC343'Lecture'10'
Backpropaga*on'
Steepest gradient descent update
for weight connecting input i
with neuron j in layer l is :
Steepest gradient descent update
for bias on neuron j in layer l is :
Let’s assume that we are minimising some cost
function that depends on the network output and
the desired output
Steepest gradient descent update
for weight connecting input i
with neuron j in layer l is :
where...
.
.
.
Neuron
in layer
.
.
.
The change in the weight
connecting input i with neuron
j in layer l is:
Output of
neuron i from
the previous
layer
Derivative of
the activation
function
where...
The cost blame for neuron i in
layer l-1 is:
Sum over
neurons in
layer l
Lech Szymanski (Otago)
Lech'Szymanski'
COSC343'Lecture'10'
where...
The change in the bias for
neuron j in layer l is:
Blame on
neuron j
Weight connecting
neuron i in layer
l-1 to neuron j in
layer l
Steepest gradient descent update
for bias on neuron j in layer l is :
Derivative of
the activation
function
Blame on
neuron j
The cost blame for neuron j in the
output layer is a derivative of the
overall cost with respect to the
neuron’s output:
.
.
.
Neuron
in layer
.
.
.
Hard limiting function
has no derivative
The change in the weight
connecting input i with neuron
j in layer l is:
Where does this
derivative come from?
where...
The cost blame for neuron j in the
The cost blame for neuron i in
output layer is a derivative of the
layer l-1 is:
overall cost with respect to the
neuron’s output:
?
Lech Szymanski (Otago)
The change in the bias for
neuron j in layer l is:
Lech'Szymanski'
COSC343'Lecture'10'
Logis*c'sigmoid'ac*va*on'
Backpropaga*on'
Let’s assume that we are minimising some cost
function that depends on the network output and
the desired output
Logis*c'sigmoid'func*on'
Steepest gradient descent update
for weight connecting input i
with neuron j in layer l is :
Steepest gradient descent update
for bias on neuron j in layer l is :
where...
1
hardlim
sigmoid
0.9
.
.
.
0.8
0.7
0.6
0.5
0.4
.
.
.
0.3
0.2
0.1
0
-8
The change in the weight
connecting input i with neuron
j in layer l is:
Neuron
in layer
-6
-4
-2
0
2
4
6
where...
8
Lech'Szymanski'
COSC343'Lecture'10'
Backpropaga*on'
Steepest gradient descent update
for weight connecting input i
with neuron j in layer l is :
Steepest gradient descent update
for bias on neuron j in layer l is :
Let’s assume that we are minimising some cost
function that depends on the network output and
the desired output
where...
.
.
.
Neuron
in layer
The change in the weight
connecting input i with neuron
j in layer l is:
.
.
.
Sigmoid function has a
derivative
Lech Szymanski (Otago)
Lech'Szymanski'
COSC343'Lecture'10'
Lech Szymanski (Otago)
Backpropaga*on'
Let’s assume that we are minimising some cost
function that depends on the network output and
the desired output
The cost blame for neuron j in the
output layer is a derivative of the
overall cost with respect to the
neuron’s output:
The cost blame for neuron i in
layer l-1 is:
Sigmoid function has a
derivative
Lech Szymanski (Otago)
The change in the bias for
neuron j in layer l is:
Where does this
derivative come from?
where...
The cost blame for neuron i in
layer l-1 is:
Lech'Szymanski'
COSC343'Lecture'10'
Steepest gradient descent update
for weight connecting input i
with neuron j in layer l is :
Let’s use least
squares error
The change in the bias for
neuron j in layer l is:
.
.
.
Neuron
in layer
where...
The change in the weight
connecting input i with neuron
j in layer l is:
.
.
.
The cost blame for neuron j in the
output layer is a derivative of the
overall cost with respect to the
neuron’s output:
Sigmoid function has a
derivative
Lech Szymanski (Otago)
Steepest gradient descent update
for bias on neuron j in layer l is :
Where does this
derivative come from?
where...
The cost blame for neuron i in
layer l-1 is:
Lech'Szymanski'
COSC343'Lecture'10'
The change in the bias for
neuron j in layer l is:
The cost blame for neuron j in the
output layer is a derivative of the
overall cost with respect to the
neuron’s output:
An'example:'a'2B2B2B1'neural'network''
Backpropaga*on'
Let’s assume that we are minimising some cost
function that depends on the network output and
the desired output
Steepest gradient descent update
for weight connecting input i
with neuron j in layer l is :
Let’s use least
squares error
.
.
.
Then...
Neuron
in layer
Lech Szymanski (Otago)
The change in the weight
connecting input i with neuron
j in layer l is:
#nuerons #nuerons
layer 1
layer 2
#output neurons
The change in the bias for
neuron j in layer l is:
Final classification
where...
The cost blame for neuron i in
layer l-1 is:
Lech'Szymanski'
COSC343'Lecture'10'
An'example:'a'2B2B2B3'neural'network''
Lech Szymanski (Otago)
#inputs
where...
.
.
.
Sigmoid function has a
derivative
Steepest gradient descent update
for bias on neuron j in layer l is :
Lech'Szymanski'
COSC343'Lecture'10'
The cost blame for neuron j in the
output layer is a derivative of the
overall cost with respect to the
neuron’s output:
Lech Szymanski (Otago)
Lech'Szymanski'
COSC343'Lecture'10'
An'example:'a'2B8B8B1'neural'network''
Lech Szymanski (Otago)
Lech'Szymanski'
COSC343'Lecture'10'
Summary'and'reading'
• 
Mul*Blayer'perceptron'is'a'universal'func*on'
approximator'
• 
• 
• 
In'theory'it'can'model'anything'
In'prac*ce,'it’s'not'obvious'what'the'best'architecture'is'
Backpropaga*on'allows'training'of'hidden'weights'and'
biases'
• 
Requires'differen*able'ac*va*on'func*ons'–'sigmoid'is'a'
very'common'choice'
Reading'for'the'lecture:'AIMA'Chapter'18'Sec*ons'7.1,7.2'
Reading'for'next'lecture:'No'reading'
Lech Szymanski (Otago)
Lech'Szymanski'
COSC343'Lecture'10'