Multilayer Perceptrons CS679 Lecture Note by Jin Hyung Kim Computer Science Department KAIST Multilayer Perceptron Hidden layers of computation nodes input propagates in a forward direction, layer-by-layer basis also called Multilayer Feedforward Network, MLP Error back-propagation algorithm supervised learning algorithm error-correction learning algorithm Forward pass input vector is applied to input nodes its effects propagate through the network layer-by-layer with fixed synaptic weights backward pass synaptic weights are adjusted in accordance with error signal error signal propagates backward, layer-by-layer fashion FF NN MLP Distinctive Characteristics Non-linear activation function differentiable sigmoidal 1 yi 1 exp( v j ) function, logistic function nonlinearity prevent reduction to single-layer perceptron One or more layers of hidden neurons progressively extracting more meaningful features from input patterns High degree of connectivity Nonlinearity and high degree of connectivity makes theoretical analysis difficult Learning process is hard to visualize BP is a landmark in NN: computationally efficient training Preliminaries Function signal input signals comes in at the input end of the network propagates forward to output nodes Error signal originates from output neuron propagates backward to input nodes Two computations in Training computation of function signal computation of an estimate of gradient vector gradient of error surface with respect to the weights Back Propagation Algorithm 1. Initialization: Randomize the weights to small values. 2. Presentation: Apply a pattern to the input and calculate the network output. 3. Error Computation: Compare the output with the desired output and compute the error (difference). 4. Backward Computation: Backpropagate the error through the network and adjust the weights to minimize the error. 5. Iteration: Repeat steps 2-4 until a desired error goal is reached. Back-Propagation Algorithm Error signal for neuron j at iteration n Total error energy C is set of the output nodes e j ( n) d j ( n) y j ( n) 1 E (n) e 2j (n) 2 jC Average squared error energy average over all training sample cost function as a measure of learning performance N E ( n) n 1 Objective of Learning process adjust 1 Eav N NN parameters (synaptic weights) to minimize Eav Weights updated pattern-by-pattern basis until one epoch complete presentation of the entire training set BPA m Induced local field v j (n) w ji (n) yi (n) i 0 output of neuron j y j (n) j (v j (n)) Gradient E (n) E (n) e j (n) y j (n) v j (n) w ji (n) e j (n) y j (n) v j (n) w ji (n) Sensitivity factor determine the direction of search in weight space according to chain rule E (n) e j ( n) e j (n) e j (n) y j (n) 1 y j (n) v j (n) j (v(n)) v j (n) w ji (n) yi ( n ) Gradient Descent E (n) e j (n) j (v j (n)) yi (n) w ji (n) Therefore, By delta rule which is gradient descent in weight space Local gradient E (n) w ji (n) η w ji (n) E (n) E (n) e j (n) y j (n) δj (n) v j (n) e j (n) y j (n) v j (n) e(n) j (v j (n)) w ji (n) ηδ j (n) yi (n) Local Gradient (I) Neuron j is an output node e j (n) d j (n) y j (n) Neuron j is a hidden node credit assignment problem how to determine their share of responsibility for output neuron k E (n) y j (n) E (n) δ j ( n) j (v j (n)) y j (n) v j (n) y j (n) E ( n) 1 2 e k ( n) 2 kC e (n) e (n) vk (n) E (n) ek (n) k ek (n) k y j (n) k y j (n) k vk (n) y j (n) Local Gradient (II) Error in neuron k Hence e (n) k vk (n) ek (n) d k (n) yk (n) d k (n) k (vk (n)) k (vk (n)) m kj j since k , j 0 desired partial derivative v ( n) w ( n) y ( n) vk (n) wkj (n) y j (n) E (n) ek (n) k (vk (n)) wkj (n) δ k (n)wkj (n) y j (n) k k back-propagation formula for hidden neuron j δ j (n) j (v j (n))δ k (n)wkj (n) k BP Summary Weight learning local correction parameter gradient w (n) η δ (n) j ji forward pass m v j (n) w ji (n) yi (n) i 0 y j (n) j (v j (n)) backward pass compute local gradient from output layer toward input layer synaptic weight change by delta rule recursively input signal to neuron j y ( n) i Activation Function(logistic function) 1 j (v j (n)) 1 exp( av j (n)) j (v j (n)) a 0 and - v j (n) a exp( av j (n)) [1 exp( av j (n))] 2 ay j (n)[1 y j (n)] local gradient for output node δj (n) ei j (vi (n)) a[d j (n) o j (n)]o j (n)[1 o j (n)] for hidden node δ j (n) j (vi (n))δ k (n)wkj (n) k ay j (n)[1 y j (n)]δ k (n)wkj (n) k Activation Function(Hyperbolic tangent function) j (vi (n)) a tanh( bv j (n)) (a,b) 0 j (vi (n)) ab sec h 2 (bv j (n)) ab(1 tanh 2 (bv j (n))) b [a y j (n)][ a y j (n)] a local gradient for output node b δ j (n) ei j (vi (n)) [d j (n) o j (n)][ a o j (n)[ a o j (n)] a for hidden node δ j (n) j (vi (n))δ k (n)wkj (n) k b [a y j (n)][ a y j (n)]δ k (n)wkj (n) a k Moment term BP approximate the trajectory of steepest descent smaller learning-rate parameter makes smoother path increase rate of learning yet avoiding danger of instability w ji (n) α w ji (n 1) ηδ j (n) y j (n) where is momentum constant n w ji (n) η α t 0 n -1 n E (n) η α n -1δ j (t ) yi (t ) w ji (n) t 0 if 0 | | 1 the patial deriviative has the same sign on consecutive iterations, grows in magnitude - accelerate descebt opposite sign - shrinks; stabilizing effect converge benefit of preventing the learning process from terminating in a shallow local minimum Mode of Training Epoch : one complete presentation of training data randomize the order of presentation for each epoch Sequential mode for each training sample, synaptic weights are updated require less storage converge much fast, particularly training data is redundant random order makes trapping at local minimum less likely Batch mode at the end of one epoch, synaptic weights are updated Eavg (n) e j (n) η N w ji (n) η e j ( n) w ji (n) N n 1 w ji (n) may be robust with outliers Stopping Criteria No well-defined stopping criteria Terminate when Gradient vector g(W) = 0 located at local or global minimum Terminate when error measure is stastionary Terminate if NN’s generalization performance is adequate XOR Problem McCulloch-Pitts Model (threshold model) -1.5 (.) +1 -2 +1 +1 (.) +1 -0.5 (.) +1 -0.5 Heuristics for making BP Better (I) Training with BP is more an art than science result of own experience Sequential vs. Batch update Maximizing information content examples of largest training error examples of radically different from previous ones Randomize the order of presentation successive examples rarely belongs to the same class Activation function antisymmetric function learns fast (v) a tanh( bv) Target value is within range of sigmoid activation function target value should be offset by some e from limiting value Heuristics for making BP Better (II) Normalizing the inputs preprocessed so that its mean value is closed to zero input variables should be uncorrelated by principal component analysis scaled so that covariance are equal Fig 4.11 Weight Initialization large weight value => saturation local gradient value is small = slow learning small weight value => operate on flat area = slow learning somewhere between two extremes. For the hyperbolic tangent function, set = 0, 2 = 1/m Heuristics for making BP Better (III) Learning from hints prior information should be included in the learning process invariant properties, symmetries, etc. Learning Rate all the neurons learn at the same rate last layer has large local gradient (by limiting effect) last layer learns fast of last layer should be assigned smaller one LeCun’s suggestion : learning rate is inversely proportional to square root of the number of synaptic connection ( m-1/2) Output Representation and Decision rule For M pattern classification problem, the kth element of desired response vector is 1 if input vector x belong to Ck 0 otherwise conditional expectation of the desired response vector equals the posteriori class probability P(Ck| x), k= 1,2,…, M Bayes Classification Rule Classify x to class Ck if P (Ck |x) > P (Cj |x) for all j k Approximated Bayes Rule by MLP Classify x to class Ck if Fk(x) > Fj(x) for all j k Multiple class assignment Classify x to class Ck if Fk(x) > t 0 1 kth element 0 Computer Experiment Bayesian Decision Pr(x | C1 ) (x) Pr(x | C 2 ) Likelihood ratio Decision Boundary Probability of Error (x) Pr(x | C1 ) Pr(C 2 ) Pr(x | C 2 ) Pr(C1 ) Pe P(C1 )P(e | C1 ) P(C 2 )P(e | C2 ) Optimal number of hidden neuron although small mean-square-error does not necessarily imply good generalization, Number of training samples : Chernoff bound P(| pN -p ) ε ) 2exp(-2ε 2 N ) δ e = 0.01, = 0.01 yields N=26,500 => picked N as 32000 Computer Experiment Optimal Learning and Momentum constants Small learning rate results in slower convergence Small learning rate locates deeper local minimum Increasing momentum constant results in faster learning with small learning rate With large learning rate, small momentum constant is required to ensure learning stability Figure 4.15, 4.16 Decision boundary by the BPA Figure 4.17 Decision boundaries are convex Feature Detection Hidden neurons act as feature detectors As learning progress, hidden neuron gradually discover “salient” features that characterize training data Nonlinear transformation of input data to feature space Observe Role of Hidden Neurons with linear output node for one-from-M coding scheme, MLP maximizes a discriminant function that is trace of product of two matrices weighted between-class covariance pseudo-inverse of total covariance matrix Close resemblance to Fisher’s linear discriminant Duda and Hart, page 114 - 118 Fisher’s linear discriminant Aim is reduction of dimensionality of feature space project d-dimensional data onto a line in order to be wellseparated a b Fisher’s linear discriminant Sample x1, …, xn; n1 of 1 class; n2 of 2 class linear combination of the component x; scalar y1, …, yn divided into subset 1 and 2 yi is projection of xi onto a line in direction of w Measure of separation ~ ~ | m1 m2 | | w t (m1 m 2 ) | scatter ~ ~ )2 s2 (y m i yi y wt x i ~ m ~ | |m J (w ) ~ 12 ~ 22 s1 s2 criterion function t Fisher’s linear discriminant is a linear function w x for which J (w ) is maximum Fisher’s linear discriminant Scatter matrix then ~ 2 si define Si (x mi )(x mi ) t SW S1 S 2 x i t t ( w x w mi ) 2 x i t t t w ( x m )( x m ) w w Si w i i x i S B (m1 m 2 )(m1 m 2 )t wt SBw J (w ) t w SW w rewrite vector w that maximize J must satisfy SW1S B w λw Since SBw is always in the direction of m1-m2 w SW1 (m1 m 2 ) maximum ratio of between-class scatter to within-class scatter Generalization Input-output mapping is correct for data never seen before Learning process is curve fitting - non-linear mapping Overfitting - Overtraining memorize training data, not the essence of the training data learns idiosyncrasy and noise loose the ability of generalize Occam’s Razer find the simplest function among those which satisfy given conditions smoothest function Figure 4.19 Training Set Size for Generalization Genralization is influenced size of training set architecture of Neural Network Given architecture, determine the size of training set for good generalization Given set of training samples, determine the best architecture for good generalization VC dimension - theoretical basis W N O ε Approximation of Functions Non-linear input-output mapping M0 input space to ML output space What is the minimum number of hidden layers in a MLP that provide approximate any continuous mapping ? Universal Approximation Theorem existence of approximation of arbitrary continuous function single hidden layer is sufficient for MLP to compute a uniform e approximation to a given training set not saying single layer is optimum in the sense of training time, easy of implementation, or generalization Bound of Approximation Errors of single hidden node NN larger the number of hidden nodes, more accurate the approximation smaller the number of hidden nodes, more accurate the empirical fit Curse of Dimensionality For good generalization, N > m0m1/ e = W / e where W is total number of synaptic weights We need dense sample points to learn it well. Dense samples are hard to find in high dimensions exponential growth in complexity as increase of dimensions Practical Consideration Single hidden layer vs double(multiple) hidden layer single HL NN is good for any approximation of continuous function double HL NN may be good some times double(multiple) hidden layer first hidden layer - local feature detection second hidden layer - global feature detection Cross-Validation Validate learned model on different set to assess the generalization performance guarding against overfitting Partition Training set into Estimation subset validation subset cross-validation for best model selection determine when to stop training Model selection Choosing MLP with the best number of free parameters with given N training samples Issue is to choose r that determines split of training set between estimation set and validation set to minimize classification error of model trained by the estimation set when it tested with the validation set Kearns(1996) : Qualitative properties of optimum r Analysis with VC Dim for small complexity problem (desired response is small compared to N), performance of cross-validation is insensitive to r single fixed r nearly optimal for wide range of target function suggest r = 0.2 80% of training set is estimation set Stopping method of training Right time to stop training to avoid overfitting Early stopping method after some training, with fixed synaptic weights computed validation error resume training after computing validation error Mean squared error Validation sample Training sample Early stopping point Number of epoch Stopping method Amari(1996) for N<W early stopping improves generalization for N<30W 2W 1 1 overfitting occurs ropt 1 2(W 1) ropt 1 example: 1 2W for large W w=100, r=0.07 93% for estimation, 7% for validation for N>30W early stopping improvement is small Leave-one-out method Network Pruning Minimizing network improves generalization less likely to learn idiosyncrasies or noise Network growing Network pruning weakening or eliminate synaptic weights Complexity-regularization tradeoff between reliability of training data and goodness of the model supervised learning by minimizing the risk function where R(w) Es (W ) λEc (W ) Es (W ) : standard performance measure Ec (W ) : complexity penalty Complexity-regularization Weight Decays Ec (W ) || W ||2 wi2 some weights are forced to take value zero weights in network are grouped into two categories those of large influence those of little or no influence : excess weights Weight Elimination ( wi / wo ) 2 Ec (W ) 1 ( wi / wo ) 2 when wi << w0, eliminated Approximate Smoother Hessian-based Network Pruning Identify parameters whose deletion will cause the least increase in Eav by Tayer series 1 t Eav (w w ) Eav (w ) g (w )w w Hw O(|| w ||3 ) 2 t Parameters are deleted after training process has converged quadratic approximation 1 Eav E (w w ) E (w ) wt Hw 2 eliminate the weights of wi wi 0 wi 1 w H 1i Solve the constrained optimization problem : 1 [H ] if 1 [H ]i,i is small, even small weight is important i ,i Optimal Brain Surgeon Saliency of wi represent of small saliency will be deleted Optimal Brain Damage with the increase in the mean-squared error from delete of wi OBS procedure weight wi2 Si 2[H 1 ]i ,i assumption of the Hessian matrix is diagonal computation of the inverse of Hessian Accelerated Convergence Heuristics 1. Adjustable weights should have own learning rate parameter 2. Learning rate parameters should be allowed to vary on iteration 3.If sign of the derivative is same for several iteration, learning rate parameter should be increased Apply the Momentum idea even on learning rate parameters 4. If sign of the derivative is alternating for several iteration, learning rate parameter should be decreased
© Copyright 2026 Paperzz