Chapter 4 Artificial Neural Networks (Cont.) • • • • Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1 The problem of linear threshold function • However, to do gradient descent, we need the output of a unit to be a differentiable function of its input and weights. • Standard linear threshold function is not differentiable at the threshold. O=sign(net) 1 0 net -1 2 Transfer functions 3 Suitable transfer Function • Sigmoid function: – Continuous – Non-decreasing – Limited output value – Well-formed derivative equation 1 1 1 ex o' ( x) o( x)(1 o( x)) o( x ) x 0 4 Sigmoid Unit 5 Training sigmoid unit 6 Gradient Descent 7 Gradient Descent 8 Gradient Descent for a sigmoid unit 9 Gradient Descent for a sigmoid unit 10 Batch vs. Incremental Gradient Descent 12 Batch vs. Incremental Gradient Descent 13 Multi-Layer Networks • Multi-layer networks can represent arbitrary functions, but an effective learning algorithm for such networks was thought to be difficult. • A typical multi-layer network consists of an input, hidden and output layer, each fully connected to the next, with activation feeding forward. output hidden activation input • The weights determine the function computed. Given an arbitrary number of hidden units, any boolean function can be computed with a single hidden layer. 14 Signal flow in multi-layer networks 15 Forward Equation 𝑠 (𝑙) : 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑒𝑢𝑟𝑜𝑛𝑠 𝑖𝑛 𝑙𝑡ℎ𝑙𝑎𝑦𝑒𝑟 𝑓 (𝑙) : the transfer function of 𝑙𝑡ℎ𝑙𝑎𝑦𝑒𝑟 𝑂(𝑙) : the output of 𝑙𝑡ℎ𝑙𝑎𝑦𝑒𝑟 X: the n × 1 input array 𝑊 (1) : the 𝑠 (1) ×n weight matrix for the first layer 𝑊 (2) : the 𝑠 (2) × 𝑠 (1) weight matrix for the second layer 𝑛𝑒𝑡 (1) = 𝑊 (1) 𝑋 𝑂(1) = 𝑓 (1) (𝑁𝑒𝑡 1 ) 𝑛𝑒𝑡 (2) = 𝑊 (2) 𝑂(1) 𝑂(2) = 𝑓 (2) (𝑁𝑒𝑡 2 ) 16 Multi-layer linear network is meaningless • Need non-linear transfer function to move beyond linear functions. – A multi-layer linear network is still a single linear layer. – Suppose transfer functions of both layers are linear 𝑂(2) = 𝑓 2 𝑛𝑒𝑡 2 = 𝑛𝑒𝑡 2 = 𝑊 (2) 𝑂(1) = 𝑊 (2) 𝑓 1 𝑛𝑒𝑡 = 𝑊 (2) 𝑛𝑒𝑡 1 = 𝑊 (2) 𝑊 (1) 𝑋 = 𝑊𝑋 17 1 Backpropagation Learning Rule • Each weight changed by (if all the neurons are sigmoid units): w(jil ) (j l ) oi(l 1) ( 2) o ( 2) (1 o ( 2) )(t j o ( 2) ) j j j j j j (1) o (1) (1 o (1) ) ( 2) w( 2) j k kj if j is an output unit if j is a hidden unit k where l is the layer number η is a constant called the learning rate tj is the correct teacher output for output unit j δ(l)j is the error measure for unit j in the lth layer 18 Proof 𝑠 𝑙 : 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑒𝑢𝑟𝑜𝑛𝑠 𝑖𝑛 𝑙𝑡ℎ𝑙𝑎𝑦𝑒𝑟 𝑓 (𝑙) : the transfer function of 𝑙𝑡ℎ𝑙𝑎𝑦𝑒𝑟 𝑂(𝑙) : the output of 𝑙𝑡ℎ𝑙𝑎𝑦𝑒𝑟 X: the n × 1 input array 𝑊 (1) : the 𝑠1 ×n weight matrix for the first layer 𝑊 (2) : the 𝑠 2 × 𝑠1 weight matrix for the second layer Backpropagation = gradient descent + derivative chain rule Pages 101-103 19 Error Backpropagation • First calculate error of output units and use this to change the top layer of weights. Current output: oj=0.2 Correct output: tj=1.0 ( 2) o ( 2) (1 o ( 2) )(t j o ( 2) ) j j j output j = 0.2(1–0.2)(1–0.2)=0.128 Update weights into j hidden w(ji2 ) (j 2 ) o (i 1) input 20 Error Backpropagation • Next calculate error for hidden units based on errors on the output units it feeds into. output (1) o (1) (1 o (1) ) ( 2) w( 2) j j j k kj k hidden input 21 Error Backpropagation • Finally update bottom layer of weights based on errors calculated for hidden units. output (1) o (1) (1 o (1) ) ( 2) w( 2) j j j k kj k Update weights into j w(ji1) (j1) o (i 1) hidden input 22 Backpropagation Training Algorithm Create the 2-layer network with H hidden units with full connectivity between layers. Set weights to small random real values. Until all training examples produce the correct value (within ε), or mean squared error stops to decrease, or other termination criteria: Begin epoch For each training example, d, do: Calculate network output for d’s input values Compute error between current output and correct output for d Update weights by backpropagating error and using learning rule End epoch 23 Mini Project 2 • Find the best NN architecture to train the mushroom data set. • What is the average network final error in 10 runs of the training process (each time with randomly initial weights)? • Compare the classification performance of the train, validation and the test data sets. • Due date: to be announced 24
© Copyright 2026 Paperzz