L. Orseau Neural Networks Neural Networks EFREI 2010 Laurent Orseau ([email protected]) AgroParisTech based on slides by Antoine Cornuejols 1 Neural Networks Plan 1. Introduction 2. The perceptron 3. The multi-layer perceptron (MLP) 4. Learning in MLP 5. Computational aspects 6. Methodological aspects of learning 7. Applications 8. Developments and perspectives 9. Conclusions L. Orseau 2 Neural Networks Plan 1. Introduction 2. The perceptron 3. The multi-layer perceptron (MLP) 4. Learning in MLP 5. Computational aspects 6. Methodological aspects of learning 7. Applications 8. Developments and perspectives 9. Conclusions L. Orseau 3 L. Orseau Neural Networks Introduction: Why neural networks? • Biological inspiration Natural brain: a very seductive model – Robust and fault tolerant – Flexible. Easily adaptable – Can work with incomplete, uncertain, noisy data ... – Massively parallel – Can learn Neurons – ≈ 1011 neurons in the human brain – ≈ 104 connections (synapses + axons) / neuron – Action potential / refractory period / neurotransmitters – Excitatory / inhibitory signals 4 L. Orseau Neural Networks Introduction: Why neural networks? • Some properties Parallel computation Directly implementable on dedicated circuits Robust and fault tolerant (distributed representation) Simple algorithms Very general • Some defects Opacity of acquired knowledge 5 Neural Networks Historical notes (quickly) Premises – Mc Culloch & Pitts (1943): 1st formal neuron model. neuron and logical calculus: base of artificial intelligence. – Hebb rule (1949): learning by reinforcing synaptic coupling First realizations – ADALINE (Widrow-Hoff, 1960) – PERCEPTRON (Rosenblatt, 1958-1962) – Analysis of Minsky & Papert (1969) New models – Kohonen (competitive learning), ... – Hopfield (1982) (recurrent net) – multi-layer perceptron(1985) Analysis and developments – Control theory, generalization (Vapnik), ... L. Orseau 6 Neural Networks The perceptron Rosenblatt (1958-1962) L. Orseau 7 Neural Networks Plan 1. Introduction 2. The perceptron 3. The multi-layer perceptron (MLP) 4. Learning in MLP 5. Computational aspects 6. Methodological aspects of learning 7. Applications 8. Developments and perspectives 9. Conclusions L. Orseau 8 L. Orseau Neural Networks Linear discrimination: the perceptron [Rosenblatt, 1957,1962] Decision function: Bias node Output node Input nodes 9 L. Orseau Neural Networks Linear discrimination: the perceptron • Geometry - 2 classes 11 L. Orseau Neural Networks Linear discrimination: the perceptron Discrimination contre les autres againsttous all others • Geometry - multiclass Ambiguous region 12 Neural Networks Linear discrimination: the perceptron against all others Discrimination entre deux classes • Geometry – multiclass •N(N-1)/2 discriminant functions L. Orseau 13 L. Orseau Neural Networks The perceptron: Performance criterion • Optimization criterion (error function): Total # classification errors: NO Perceptron criterion: For all forms of learning, we want: 0 w x < 0 T 1 x 2 Proportional to the distance to the decision surface (for all wrongly classified examples) Piecewise linear and continuous function 14 L. Orseau Neural Networks Direct learning: pseudo-inverse method • Direct solution (pseudo-inverse method) requires: Knowledge of all pairs (xi,yi) Matrix inversion (often ill defined) (only for linear network and quadratic error function) • Requires an iterative method without matrix inversion Gradient descent 15 L. Orseau Neural Networks The perceptron: algorithm • Exploration method of H Gradient search – Minimization of error function – Principle: in the spirit of the Hebb rule: modify connection proportionally to input and output – Learn only if classification error Algorithm: if example is correctly classified: do nothing otherwise: w(t 1) w(t) xi ui Loop over all training examples until a stopping criterion Convergence? 16 L. Orseau Neural Networks The perceptron: convergence memory capacity • Questions: What can be learned? – Result from [Minsky & Papert,68]: linear separators Convergence guaranties? – Perceptron convergence theorem [Rosenblatt,62] Reliability of learning and number of examples – How many examples do we need to have some guaranty about what should be learned? 17 Neural Networks Expressive power: Linear separations L. Orseau 18 Neural Networks Expressive power: Linear separations L. Orseau 19 Neural Networks Plan 1. Introduction 2. The perceptron 3. The multi-layer perceptron (MLP) 4. Learning in MLP 5. Computational aspects 6. Methodological aspects of learning 7. Applications 8. Developments and perspectives 9. Conclusions L. Orseau 20 L. Orseau Neural Networks The multi-layer perceptron • Usual topology Input layer Hidden layer Input : xk Output layer Output: y k Desired output: uk Signal flow 22 L. Orseau Neural Networks The multi-layer perceptron: propagation • For each neuron: yl g w jk j g(ak ) j 0, d wjk : weight of the connection from node j to node k ak : activation of node k g : activation function +1 Threshold function +1 g(a) 1 1 e rail function a g’(a) = g(a)(1-g(a)) Sortie z i +1 Radial Basis Function sigmoïdal function Activation ai 23 L. Orseau Neural Networks The multi-layer perceptron: the XOR example -0.5 -0.5 -1.5 1 A 1 x1 1 C y -1 1 1 B x2 Bias Weight Weigth A B C 24 Neural Networks Example of network (JavaNNS) L. Orseau 25 Neural Networks Plan 1. Introduction 2. The perceptron 3. The multi-layer perceptron (MLP) 4. Learning in MLP 5. Computational aspects 6. Methodological aspects of learning 7. Applications 8. Developments and perspectives 9. Conclusions L. Orseau 26 L. Orseau Neural Networks 27 The MLP: learning • Find weights such that the network makes a input-output mapping consistent with the given examples (same old generalization problem) • Learning: Minimize loss function E(w,{xl,ul}) in function of w Use a gradient descent method wij E wij (gradient back-propagation algorithm ) Inductive principle: We suppose that what works on training examples (empirical risk minimization) should work on test (unseen) examples (real risk minimization) L. Orseau Neural Networks Learning: gradient descent • learning = search in the multidimensional parameter space (synaptic weights) to minimize loss function • Almost all learning rules = gradient descent method Optimal solution w* so that E(w * ) 0 T = , , ..., w w w 1 2 N E ( 1) ( 1) ij w E ( ) ( ) ij w w E E wij w( ) 28 L. Orseau Neural Networks The multi-layer perceptron: learning 2 1 m w * arg min y(x l ; w) u(x l ) m l1 w Goal: Algorithm (gradient back-propagation): gradient descent Iterative algorithm: (t ) w Off-line case (total gradient): where: (t 1) w Ew (t ) 1 m RE (xk ,w) wij (t) wij (t 1) (t) m k 1 wij 2 RE (xk ,w) [t k f (xk ,w)] On-line case: (stochastic gradient): RE (x k ,w) wij (t) wij (t 1) (t) wij 29 Neural Networks L. Orseau The multi-layer perceptron: learning 1. Take one example from training set 2. Compute output state of network 3. Compute error = fct(output – desired output) (e.g. = (yl - ul)2) 4. Compute gradients With gradient back-propagation algorithm 5. Modify synaptic weights 6. Stopping criterion Based on global error, number of examples, etc. 7. Go back to 1 30 Neural Networks L. Orseau MLP: gradient back-propagation • The problem: Determine responsibilities (“credit assignment problem”) What connection is responsible, and of how much, on error E ? • Principle: Compute error on a connection in function of the error on the next layer • Two steps: 1. Evaluation of error derived relative to weights 2. Use of these derivates to compute the modification on each weight 31 L. Orseau Neural Networks MLP: gradient back-propagation l E 1. Evaluation of error E (or E) due to each connection: wij j Idea: compute error on connection wji in function of error after node j El El a j j zi wij a j wij For nodes in the output layer: k El El g' (ak ) g' (ak ) uk (xl ) y k ak yk For nodes in the hidden layer: j El aj E l ak a a k k j ak zj k z a g' (a j ) w jk k k k j j 32 L. Orseau Neural Networks MLP: gradient back-propagation ai : activation ofnode i zi : sortie of node i i : error attached to node i Hidden node Output node aj i zi w ij ak zj j j wjk k yk k 33 L. Orseau Neural Networks MLP: gradient back-propagation • 2. Modification of weights We suppose a step gradient (constant or not): (t) If stochastic learning (after presentation of each example) wji (t) j ai If batch learning (after presentation of the whole set of examples) wji (t) a n j n n i 34 L. Orseau Neural Networks MLP: forward and backward passes (resume) ys (x) ys(x) y1 (x) Bia s x0 w1 x1 w js yj j1 wis w0 k ai (x) yi(x) x2 w3 x3 x . . . w x w j 1 j j yi (x) g(ai (x)) . . . w2 d wd xd k neurons on the hidden layer 0 35 L. Orseau Neural Networks MLP: forward and backward passes (resume) s g' (as ) (usys ) ys(x) wis (t 1) wis (t) (t) sai wis y1 (x) j g' ( aj ) w js s nodes of next layer yi(x) . . . wei (t 1) wei (t) (t ) iae Biais w0 x0 w1 x1 w2 x2 w3 x3 x wd . . . xd 36 L. Orseau Neural Networks 37 MLP: gradient back-propagation • Learning efficiency O(|w|) for each learning pass, |w| = # weights Usually several hundreds of passes (see below) And learning must typically be done several dozens of times with different initial random weights • Recognition efficiency Possibility of real time L. Orseau Neural Networks Applications: multi-objective optimization • cf [Tom Mitchell] Predict both class and color Instead of class only 40 Neural Networks Role of the hidden layer L. Orseau 41 Neural Networks Role of the hidden layer L. Orseau 42 Neural Networks Role of the hidden layer L. Orseau 43 Neural Networks L. Orseau MLP: Applications • Control: identification and control of processes (e.g. Robot control) • Signal Processing (filtering, data compression, speech processing (recognition, prediction, production),…) • Pattern recognition, image processing (hand-writing recognition, automated postal code recognition (Zip codes, USA), face recognition...) • Prediction (water, electricity consumption, meteorology, stock market, ...) • Diagnostic (industry, medical, science, ...) 44 Neural Networks Application to postal Zip codes • [Le Cun et al., 1989, ...] (ATT Bell Labs: very smart team) • ≈ 10000 examples of handwritten numbers • Segment et rescales on a 16 x 16 matrix • Weigh sharing • Optimal brain damage • 99% correct recognition (on training set) • 9% reject (delegated to human recognition) L. Orseau 45 Neural Networks The database L. Orseau 46 L. Orseau Neural Networks Application to postal Zip codes 16 x 16 Matrix 12 segment detectors (8x8) 12 segment detectors (4x4) 30 nodes 10 output nodes 1 2 3 4 5 6 7 8 9 0 47 Neural Networks Some mistakes made by the network L. Orseau 48 Neural Networks Regression L. Orseau 49 Neural Networks A failure: QSAR • Quantitative Structure Activity Relations Predire certaines proprietes de molecules (par example activite biologique) à partir de descriptions: - chimiques - geometriques - electriques L. Orseau 50 Neural Networks Plan 1. Introduction 2. The perceptron 3. The multi-layer perceptron (MLP) 4. Learning in MLP 5. Computational aspects 6. Methodological aspects of learning 7. Applications 8. Developments and perspectives 9. Conclusions L. Orseau 51 L. Orseau Neural Networks MLP: Practical view (1) • Technical problems: how to improve the algorithm performance? MLP as an optimization method: variants • • • • Momentum Second order methods Hessian Conjugate gradient Heuristics • • • • • Sequential learning vs batch learning Choice of activation function Normalization of inputs Weights initializations Learning gains 52 L. Orseau Neural Networks MLP: gradient back-propagation (variants) • Momentum wji (t 1) E w ji (t) w ji 53 Neural Networks Convergence • Learning step tweaking: L. Orseau 54 L. Orseau Neural Networks MLP: Convergence problems • Local minimums Add momentum (inertia) Conditioning of parameters Noising learning data Online algorithm (stochastic vs. total) Variable gradient step (in time and for each node) Use of second derivate (Hessien). Conjugate gradient 55 L. Orseau Neural Networks MLP: Convergence problems (variables gradient) • Adaptive gain if gradient does not change sign, otherwise Much lower gain for stochastic than for total gradient Specific gain for each layer (e.g. 1 / (# input node)1/2 ) • More complex algorithms Conjugate gradients – Idea: Try to minimize independently on each dimension, using a momentum of search Second order methods (Hessian) – Faster convergence but slower computations 56 Neural Networks Plan 1. Introduction 2. The perceptron 3. The multi-layer perceptron (MLP) 4. Learning in MLP 5. Computational aspects 6. Methodological aspects of learning 7. Applications 8. Developments and perspectives 9. Conclusions L. Orseau 57 L. Orseau Neural Networks Overfitting Real Risk Overfitting Emprirical Risk Data quantity 58 L. Orseau Neural Networks Prenventing overfitting: regularisation • Principle: limit expressiveness of H • New empirical risk: 1 Remp ( ) m m l 1 • Some useful regularizers: – Control of NN architecture – Parameter control • Soft-weight sharing • Weight decay • Convolution network – Noisy examples l l L(h(x , ),u ) [h(. , )] Penalization term 59 Neural Networks Control by limiting the exploration of H • Early stopping • Weight decay L. Orseau 60 L. Orseau Neural Networks Generalization: optimize the network structure • Progressive growth Cascade correlation [Fahlman,1990] • Pruning Optimal brain damage [Le Cun,1990] Optimal brain surgeon [Hassibi,1993] 61 L. Orseau Neural Networks Introduction of prior knowledge Invariances • Symmetries in the example space Translation / rotation / dilatation • Cost function having derivates 62 Neural Networks Plan 1. Introduction 2. The perceptron 3. The multi-layer perceptron (MLP) 4. Learning in MLP 5. Computational aspects 6. Methodological aspects of learning 7. Applications 8. Developments and perspectives 9. Conclusions L. Orseau 63 Neural Networks ANN Application Areas • Classification • Clustering • Associative memory • Control • Function approximation L. Orseau 64 L. Orseau Neural Networks Applications for ANN Classifiers • Pattern recognition Industrial inspection Fault diagnosis Image recognition Target recognition Speech recognition Natural language processing • Character recognition Handwriting recognition Automatic text-to-speech conversion 65 L. Orseau Neural Networks ALVINN Neural Network Approaches ALVINN - Autonomous Land Vehicle In a Neural Network Presented by Martin Ho, Eddy Li, Eric Wong and Kitty Wong - Copyright© 2000 66 L. Orseau Neural Networks ALVINN Output units - Developed in 1993. Hidden layer - Performs driving with Neural Networks. - An intelligent VLSI image sensor for road following. Input units - Learns to filter out image details not relevant to driving. Presented by Martin Ho, Eddy Li, Eric Wong and Kitty Wong - Copyright© 2000 67 Neural Networks Plan 1. Introduction 2. The perceptron 3. The multi-layer perceptron (MLP) 4. Learning in MLP 5. Computational aspects 6. Methodological aspects of learning 7. Applications 8. Developments and perspectives 9. Conclusions L. Orseau 68 L. Orseau Neural Networks MLP with Radial Basis Functions (RBF) • Definition Hidden layer uses radial basis activation function (e.g. Gaussian) – Idea: “pave” the input space with “receptive fields” Output layer: linear combination upon the hidden layer • Properties Still universal approximator ([Hartman et al.,90], ...) But not parsimonious (combinatorial explosion of input dimension) Only for small input dimension problems Strong links with fuzzy inference systems and neuro-fuzzy systems 69 L. Orseau Neural Networks 70 MLP with Radial Basis Functions (RBF) • Parameters to tune: # hidden nodes Initial positions if receptive fields Diameter of receptive fields Output weights • Methodes Adaptation of back-propagation Determination of each type of parameters with a specific method (usually more effective) – Centers determined by “clustering” methofs (k-means, ...) – Diameters determined by covering rate optimization (PPV, ...) – Output weights by linear optimization (calcul de pseudo-inverse, ...) L. Orseau Neural Networks Neural Networks for sequence processing • Tasks : Take the Time dimension into account Sequence recognition E.g. recognize a word corresponding to a vocal signal Reproduction of sequence E.g. predict next values of the sequence (ex: electricity consumption prediction) Temporal association Production of another in response to the recognition of another sequence Time Delay Neural Networks (TDNNs) Duplicate inputs for several past time steps Recurrent Neural Networks 71 L. Orseau Neural Networks Recurrent ANN Architectures • Feedback connections • Dynamic memory: y(t+1)=f(x(τ),y(τ),s(τ)) τ(t,t-1,...) • Models: Jordan/Elman ANNs Hopfield Adaptive Resonance Theory (ART) 72 L. Orseau Neural Networks Recurrent Neural Networks • Can learn regular grammars Finite State Machines Back Propagation Through Time • Can even model full computers with 11 neurons (!) Very special use of RNNs… Uses the property that a weight can be any number, i.e. it is an unlimited memory + Chaotic dynamics No learning algorithm for this 73 L. Orseau Neural Networks Recurrent Neural Networks • Problems Complex trajectories – Chaotic dynamics Limited memory of past Learning is very difficult! – Exponential decay of error signal in time 75 L. Orseau Neural Networks Long Short Term Memory (Hochreiter 1997) • Idea: Only some nodes are recurrent Only self-recurrence Linear activation function – Error decays linearly, not exponentially • Can learn Regular languages (FSM) Some Context-free (stack machine) and Context-sensitive grammars – anbn, anbncn 76 L. Orseau Neural Networks Reservoir computing • Idea: Random recurrent neural network, Learn only output layer weights • Many internal dynamics • Output layer selects Input Output interesting ones • And combinations thereof 77 Neural Networks Plan 1. Introduction 2. The perceptron 3. The multi-layer perceptron (MLP) 4. Learning in MLP 5. Computational aspects 6. Methodological aspects of learning 7. Applications 8. Developments and perspectives 9. Conclusions L. Orseau 79 L. Orseau Neural Networks Conclusions • Avantages Can learn a wide variety of problems • Limits Learning is slow and difficult Result is opaque – Difficult to extract knowledge – Difficult to use prior knowledge (but KBANN) Incremental learning of new concepts is difficult: catastrophic forgetting 80 L. Orseau Neural Networks 81 Bibliography • Ouvrages / articles Bishop C. (06): Neural networks for pattern recognition. Clarendon Press Oxford, 1995. Haykin (98): Neural Networks. Prentice Hall, 1998. Hertz, Krogh & Palmer (91): Introduction to the theory of neural computation. Addison Wesley, 1991. Thiria, Gascuel, Lechevallier & Canu (97): Statistiques et methodes neuronales. Dunod, 1997. Vapnik (95): The nature of statistical learning. Springer Verlag, 1995. • Sites web http://www.lps.ens.fr/~nadal/
© Copyright 2026 Paperzz