Gradient Descent for a sigmoid unit

Chapter 4
Artificial Neural
Networks (Cont.)
•
•
•
•
Perceptron
Gradient Descent
Multilayer Networks
Backpropagation Algorithm
1
The problem of linear threshold function
• However, to do gradient descent, we need the output
of a unit to be a differentiable function of its input
and weights.
• Standard linear threshold function is not
differentiable at the threshold.
O=sign(net)
1
0
net
-1
2
Transfer functions
3
Suitable transfer Function
• Sigmoid function:
– Continuous
– Non-decreasing
– Limited output value
– Well-formed derivative equation
1
1
1  ex
o' ( x)  o( x)(1  o( x))
o( x ) 
x
0
4
Sigmoid Unit
5
Training sigmoid unit
6
Gradient Descent
7
Gradient Descent
8
Gradient Descent for a sigmoid unit
9
Gradient Descent for a sigmoid unit
10
Batch vs. Incremental Gradient Descent
12
Batch vs. Incremental Gradient Descent
13
Multi-Layer Networks
• Multi-layer networks can represent arbitrary functions,
but an effective learning algorithm for such networks
was thought to be difficult.
• A typical multi-layer network consists of an input,
hidden and output layer, each fully connected to the
next, with activation feeding forward.
output
hidden
activation
input
• The weights determine the function computed. Given an
arbitrary number of hidden units, any boolean function
can be computed with a single hidden layer.
14
Signal flow in multi-layer networks
15
Forward Equation
𝑠 (𝑙) : 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑒𝑢𝑟𝑜𝑛𝑠 𝑖𝑛 𝑙𝑡ℎ𝑙𝑎𝑦𝑒𝑟
𝑓 (𝑙) : the transfer function of 𝑙𝑡ℎ𝑙𝑎𝑦𝑒𝑟
𝑂(𝑙) : the output of 𝑙𝑡ℎ𝑙𝑎𝑦𝑒𝑟
X: the n × 1 input array
𝑊 (1) : the 𝑠 (1) ×n weight matrix for the first layer
𝑊 (2) : the 𝑠 (2) × 𝑠 (1) weight matrix for the second layer
𝑛𝑒𝑡 (1) = 𝑊 (1) 𝑋
𝑂(1) = 𝑓 (1) (𝑁𝑒𝑡 1 )
𝑛𝑒𝑡 (2) = 𝑊 (2) 𝑂(1)
𝑂(2) = 𝑓 (2) (𝑁𝑒𝑡 2 )
16
Multi-layer linear network is meaningless
• Need non-linear transfer function to move beyond
linear functions.
– A multi-layer linear network is still a single linear layer.
– Suppose transfer functions of both layers are linear
𝑂(2) = 𝑓 2 𝑛𝑒𝑡 2 = 𝑛𝑒𝑡 2
= 𝑊 (2) 𝑂(1) = 𝑊 (2) 𝑓 1 𝑛𝑒𝑡
= 𝑊 (2) 𝑛𝑒𝑡 1 = 𝑊 (2) 𝑊 (1) 𝑋
= 𝑊𝑋
17
1
Backpropagation Learning Rule
• Each weight changed by (if all the neurons are sigmoid
units):
w(jil )   (j l ) oi(l 1)
 ( 2)  o ( 2) (1  o ( 2) )(t j  o ( 2) )
j
j
j
j
j
j
 (1)  o (1) (1  o (1) )  ( 2) w( 2)
j
k
kj
if j is an output unit
if j is a hidden unit
k
where
l is the layer number
η is a constant called the learning rate
tj is the correct teacher output for output unit j
δ(l)j is the error measure for unit j in the lth layer
18
Proof
𝑠 𝑙 : 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑒𝑢𝑟𝑜𝑛𝑠 𝑖𝑛 𝑙𝑡ℎ𝑙𝑎𝑦𝑒𝑟
𝑓 (𝑙) : the transfer function of 𝑙𝑡ℎ𝑙𝑎𝑦𝑒𝑟
𝑂(𝑙) : the output of 𝑙𝑡ℎ𝑙𝑎𝑦𝑒𝑟
X: the n × 1 input array
𝑊 (1) : the 𝑠1 ×n weight matrix for the first layer
𝑊 (2) : the 𝑠 2 × 𝑠1 weight matrix for the second layer
Backpropagation = gradient descent + derivative chain rule
Pages 101-103
19
Error Backpropagation
• First calculate error of output units and use this
to change the top layer of weights.
Current output: oj=0.2
Correct output: tj=1.0
 ( 2)  o ( 2) (1  o ( 2) )(t j  o ( 2) )
j
j
j
output
j
= 0.2(1–0.2)(1–0.2)=0.128
Update weights into j
hidden
w(ji2 )   (j 2 ) o (i 1)
input
20
Error Backpropagation
• Next calculate error for hidden units based on
errors on the output units it feeds into.
output
 (1)  o (1) (1  o (1) )  ( 2) w( 2)
j
j
j
k
kj
k
hidden
input
21
Error Backpropagation
• Finally update bottom layer of weights based
on errors calculated for hidden units.
output
 (1)  o (1) (1  o (1) )  ( 2) w( 2)
j
j
j
k
kj
k
Update weights into j
w(ji1)   (j1) o (i 1)
hidden
input
22
Backpropagation Training Algorithm
Create the 2-layer network with H hidden units with full connectivity
between layers. Set weights to small random real values.
Until all training examples produce the correct value (within ε), or
mean squared error stops to decrease, or other termination criteria:
Begin epoch
For each training example, d, do:
Calculate network output for d’s input values
Compute error between current output and correct output for d
Update weights by backpropagating error and using learning rule
End epoch
23
Mini Project 2
• Find the best NN architecture to train
the mushroom data set.
• What is the average network final error
in 10 runs of the training process (each
time with randomly initial weights)?
• Compare the classification performance
of the train, validation and the test data
sets.
• Due date: to be announced
24