Neural Networks

L. Orseau
Neural Networks
Neural Networks
EFREI 2010
Laurent Orseau
([email protected])
AgroParisTech
based on slides by Antoine Cornuejols
1
Neural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
L. Orseau
2
Neural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
L. Orseau
3
L. Orseau
Neural Networks
Introduction: Why neural networks?
• Biological inspiration

Natural brain: a very seductive model
– Robust and fault tolerant
– Flexible. Easily adaptable
– Can work with incomplete, uncertain, noisy data ...
– Massively parallel
– Can learn

Neurons
– ≈ 1011 neurons in the human brain
– ≈ 104 connections (synapses + axons) / neuron
– Action potential / refractory period / neurotransmitters
– Excitatory / inhibitory signals
4
L. Orseau
Neural Networks
Introduction: Why neural networks?
• Some properties

Parallel computation

Directly implementable on dedicated circuits

Robust and fault tolerant (distributed representation)

Simple algorithms

Very general
• Some defects

Opacity of acquired knowledge
5
Neural Networks
Historical notes (quickly)

Premises
– Mc Culloch & Pitts (1943): 1st formal neuron model.
neuron and logical calculus: base of artificial intelligence.
– Hebb rule (1949): learning by reinforcing synaptic coupling

First realizations
– ADALINE (Widrow-Hoff, 1960)
– PERCEPTRON (Rosenblatt, 1958-1962)
– Analysis of Minsky & Papert (1969)

New models
– Kohonen (competitive learning), ...
– Hopfield (1982) (recurrent net)
– multi-layer perceptron(1985)

Analysis and developments
– Control theory, generalization (Vapnik), ...
L. Orseau
6
Neural Networks
The perceptron

Rosenblatt (1958-1962)
L. Orseau
7
Neural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
L. Orseau
8
L. Orseau
Neural Networks
Linear discrimination: the perceptron
[Rosenblatt, 1957,1962]
Decision function:
Bias node
Output node
Input nodes
9
L. Orseau
Neural Networks
Linear discrimination: the perceptron
• Geometry - 2 classes
11
L. Orseau
Neural Networks
Linear discrimination: the perceptron
Discrimination contre
les autres
againsttous
all others
• Geometry - multiclass
Ambiguous
region
12
Neural Networks
Linear discrimination: the perceptron
against
all others
Discrimination entre
deux
classes
• Geometry – multiclass
•N(N-1)/2 discriminant functions
L. Orseau
13
L. Orseau
Neural Networks
The perceptron: Performance criterion
• Optimization criterion (error function):

Total # classification errors: NO

Perceptron criterion:
For all forms of learning, we want:
  0
w x 
 < 0
T

  1
x  
  2
 Proportional to the distance to the decision surface (for all wrongly classified
examples)
 Piecewise linear and continuous function
14
L. Orseau
Neural Networks
Direct learning: pseudo-inverse method
•
Direct solution (pseudo-inverse method) requires:

Knowledge of all pairs (xi,yi)

Matrix inversion (often ill defined)
 (only for linear network and quadratic error function)
•
Requires an iterative method without matrix inversion

Gradient descent
15
L. Orseau
Neural Networks
The perceptron: algorithm
• Exploration method of H

Gradient search
– Minimization of error function
– Principle: in the spirit of the Hebb rule:
modify connection proportionally to input and output
– Learn only if classification error

Algorithm:
if example is correctly classified: do nothing
otherwise:
w(t  1)  w(t)   xi ui
Loop over all training examples until a stopping criterion

Convergence?
16
L. Orseau
Neural Networks
The perceptron: convergence memory capacity
• Questions:

What can be learned?
– Result from [Minsky & Papert,68]: linear separators

Convergence guaranties?
– Perceptron convergence theorem [Rosenblatt,62]

Reliability of learning and number of examples
– How many examples do we need to have some guaranty about what should
be learned?
17
Neural Networks
Expressive power: Linear separations
L. Orseau
18
Neural Networks
Expressive power: Linear separations
L. Orseau
19
Neural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
L. Orseau
20
L. Orseau
Neural Networks
The multi-layer perceptron
• Usual topology
Input layer
Hidden layer
Input : xk
Output layer
Output: y k
Desired output: uk
Signal flow
22
L. Orseau
Neural Networks
The multi-layer perceptron: propagation
• For each neuron:


yl  g  w jk  j   g(ak )
j  0, d


wjk : weight of the connection from node j to node k

ak : activation of node k

g : activation function
+1
Threshold function
+1
g(a) 
1
1 e
rail function
a
g’(a) = g(a)(1-g(a))
Sortie z i
+1
Radial Basis Function
sigmoïdal function
Activation ai
23
L. Orseau
Neural Networks
The multi-layer perceptron: the XOR example
-0.5
-0.5
-1.5
1
A
1
x1
1
C
y
-1
1
1
B
x2
Bias
Weight
Weigth
A
B
C
24
Neural Networks
Example of network (JavaNNS)
L. Orseau
25
Neural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
L. Orseau
26
L. Orseau
Neural Networks
27
The MLP: learning
• Find weights such that the network makes a input-output mapping
consistent with the given examples
(same old generalization problem)
• Learning:

Minimize loss function E(w,{xl,ul}) in function of w

Use a gradient descent method
wij   E wij
(gradient back-propagation algorithm )

Inductive principle: We suppose that what works on training examples
(empirical risk minimization) should work on test (unseen) examples (real risk
minimization)
L. Orseau
Neural Networks
Learning: gradient descent
• learning = search in the multidimensional parameter space (synaptic weights) to
minimize loss function
• Almost all learning rules
= gradient descent method

Optimal solution w* so that E(w * )  0
T
 

 
 = 
,
, ...,


w

w

w
 1
2
N 
E
(  1)
(  1)
ij
w
 E
( )
( )
ij
 w
 w E
E
 
wij
w(  )
28
L. Orseau
Neural Networks
The multi-layer perceptron: learning
2
1 m
w *  arg min
y(x l ; w)  u(x l )


m l1
w
Goal:

Algorithm (gradient back-propagation): gradient descent
Iterative algorithm:
(t )
w

Off-line case
(total gradient):
where:
(t 1)
 w
  Ew
(t )
1 m RE (xk ,w)
wij (t)  wij (t  1)  (t) 
m k 1
wij
2
RE (xk ,w)  [t k  f (xk ,w)]
On-line case:
(stochastic gradient):
 RE (x k ,w)
wij (t)  wij (t  1)  (t)
 wij
29
Neural Networks
L. Orseau
The multi-layer perceptron: learning
1. Take one example from training set
2. Compute output state of network
3. Compute error = fct(output – desired output) (e.g. = (yl - ul)2)
4. Compute gradients
With gradient back-propagation algorithm
5. Modify synaptic weights
6. Stopping criterion
Based on global error, number of examples, etc.
7. Go back to 1
30
Neural Networks
L. Orseau
MLP: gradient back-propagation
• The problem:
Determine responsibilities (“credit assignment problem”)
What connection is responsible, and of how much, on error E ?
• Principle:
Compute error on a connection in function of the error on the next
layer
• Two steps:
1. Evaluation of error derived relative to weights
2. Use of these derivates to compute the modification on each weight
31
L. Orseau
Neural Networks
MLP: gradient back-propagation
l

E
1. Evaluation of error E (or E) due to each connection:
 wij
j
Idea: compute error on connection wji in function of error after node j
 El
 El  a j

  j zi
 wij
 a j  wij

For nodes in the output layer:
k

 El
 El

 g' (ak )
 g' (ak ) uk (xl )  y k 
 ak
 yk
For nodes in the hidden layer:
j
 El


 aj
 E l  ak
 a  a 
k
k
j
 ak  zj
 k  z  a  g' (a j )  w jk k
k
k
j
j
32
L. Orseau
Neural Networks
MLP: gradient back-propagation
ai : activation ofnode i
zi : sortie of node i
i : error attached to node i
Hidden node
Output node
aj
i zi w
ij
ak
zj
j
j
wjk
k
yk
k
33
L. Orseau
Neural Networks
MLP: gradient back-propagation
• 2. Modification of weights

We suppose a step gradient (constant or not): (t)

If stochastic learning (after presentation of each example)
 wji   (t)  j ai

If batch learning (after presentation of the whole set of examples)
 wji   (t)   a
n
j
n
n
i
34
L. Orseau
Neural Networks
MLP: forward and backward passes (resume)
ys (x) 
ys(x)
y1 (x)
Bia s
x0
w1
x1
w
js
yj
j1
wis
w0
k
ai (x) 
yi(x)
x2
w3
x3
x
. . .
 w x w
j 1
j
j
yi (x)  g(ai (x))
. . .
w2
d
wd
xd
k neurons on the
hidden layer
0
35
L. Orseau
Neural Networks
MLP: forward and backward passes (resume)
s  g' (as ) (usys )
ys(x)
wis (t 1)  wis (t)  (t) sai
wis
y1 (x)
j
 g' ( aj )
w
js
s
nodes
of next layer
yi(x)
. . .
wei (t 1)  wei (t)  (t ) iae
Biais
w0
x0
w1
x1
w2
x2
w3
x3
x
wd
. . .
xd
36
L. Orseau
Neural Networks
37
MLP: gradient back-propagation
• Learning efficiency

O(|w|) for each learning pass, |w| = # weights

Usually several hundreds of passes (see below)

And learning must typically be done several dozens of times with different initial
random weights
• Recognition efficiency

Possibility of real time
L. Orseau
Neural Networks
Applications: multi-objective optimization
• cf [Tom Mitchell]

Predict both class and color

Instead of class only
40
Neural Networks
Role of the hidden layer
L. Orseau
41
Neural Networks
Role of the hidden layer
L. Orseau
42
Neural Networks
Role of the hidden layer
L. Orseau
43
Neural Networks
L. Orseau
MLP: Applications
• Control: identification and control of processes
(e.g. Robot control)
• Signal Processing (filtering, data compression, speech processing
(recognition, prediction, production),…)
• Pattern recognition, image processing (hand-writing recognition,
automated postal code recognition (Zip codes, USA), face
recognition...)
• Prediction (water, electricity consumption, meteorology, stock market,
...)
• Diagnostic (industry, medical, science, ...)
44
Neural Networks
Application to postal Zip codes
• [Le Cun et al., 1989, ...] (ATT Bell Labs: very smart team)
• ≈ 10000 examples of handwritten numbers
• Segment et rescales on a 16 x 16 matrix
• Weigh sharing
• Optimal brain damage
• 99% correct recognition (on training set)
• 9% reject (delegated to human recognition)
L. Orseau
45
Neural Networks
The database
L. Orseau
46
L. Orseau
Neural Networks
Application to postal Zip codes
16 x 16 Matrix
12 segment
detectors (8x8)
12 segment
detectors (4x4)
30 nodes
10 output
nodes
1
2
3
4
5
6
7
8
9
0
47
Neural Networks
Some mistakes made by the network
L. Orseau
48
Neural Networks
Regression
L. Orseau
49
Neural Networks
A failure: QSAR
• Quantitative Structure Activity Relations
Predire certaines
proprietes de molecules
(par example activite
biologique) à partir de
descriptions:
- chimiques
- geometriques
- electriques
L. Orseau
50
Neural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
L. Orseau
51
L. Orseau
Neural Networks
MLP: Practical view (1)
• Technical problems:
how to improve the algorithm performance?

MLP as an optimization method: variants
•
•
•
•

Momentum
Second order methods
Hessian
Conjugate gradient
Heuristics
•
•
•
•
•
Sequential learning vs batch learning
Choice of activation function
Normalization of inputs
Weights initializations
Learning gains
52
L. Orseau
Neural Networks
MLP: gradient back-propagation (variants)
• Momentum
 wji (t  1)   
E
   w ji (t)
 w ji
53
Neural Networks
Convergence
• Learning step tweaking: 
L. Orseau
54
L. Orseau
Neural Networks
MLP: Convergence problems
• Local minimums

Add momentum (inertia)

Conditioning of parameters

Noising learning data

Online algorithm (stochastic vs. total)

Variable gradient step (in time and for each node)

Use of second derivate (Hessien). Conjugate gradient
55
L. Orseau
Neural Networks
MLP: Convergence problems (variables gradient)
• Adaptive gain

if gradient does not change sign,
otherwise

Much lower gain for stochastic than for total gradient

Specific gain for each layer (e.g. 1 / (# input node)1/2 )
• More complex algorithms

Conjugate gradients
– Idea: Try to minimize independently on each dimension, using a momentum of
search

Second order methods (Hessian)
– Faster convergence but slower computations
56
Neural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
L. Orseau
57
L. Orseau
Neural Networks
Overfitting
Real Risk
Overfitting
Emprirical Risk
Data quantity
58
L. Orseau
Neural Networks
Prenventing overfitting: regularisation
• Principle: limit expressiveness of H
• New empirical risk:
1
Remp (  ) 
m
m

l 1
• Some useful regularizers:
– Control of NN architecture
– Parameter control
• Soft-weight sharing
• Weight decay
• Convolution network
– Noisy examples
l
l
L(h(x ,  ),u )   [h(. ,  )]
Penalization term
59
Neural Networks
Control by limiting the exploration of H
• Early stopping
• Weight decay
L. Orseau
60
L. Orseau
Neural Networks
Generalization: optimize the network structure
• Progressive growth

Cascade correlation [Fahlman,1990]
• Pruning

Optimal brain damage [Le Cun,1990]

Optimal brain surgeon [Hassibi,1993]
61
L. Orseau
Neural Networks
Introduction of prior knowledge
Invariances
• Symmetries in the example space

Translation / rotation / dilatation
• Cost function having derivates
62
Neural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
L. Orseau
63
Neural Networks
ANN Application Areas
• Classification
• Clustering
• Associative memory
• Control
• Function approximation
L. Orseau
64
L. Orseau
Neural Networks
Applications for ANN Classifiers
• Pattern recognition

Industrial inspection

Fault diagnosis

Image recognition

Target recognition

Speech recognition

Natural language processing
• Character recognition

Handwriting recognition

Automatic text-to-speech conversion
65
L. Orseau
Neural Networks
ALVINN
Neural Network Approaches
ALVINN - Autonomous Land Vehicle In a Neural Network
Presented by Martin Ho, Eddy Li, Eric Wong and Kitty Wong - Copyright© 2000
66
L. Orseau
Neural Networks
ALVINN
Output units
- Developed in 1993.
Hidden layer
- Performs driving with
Neural Networks.
- An intelligent VLSI image
sensor for road following.
Input units
- Learns to filter out image
details not relevant to
driving.
Presented by Martin Ho, Eddy Li, Eric Wong and Kitty Wong - Copyright© 2000
67
Neural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
L. Orseau
68
L. Orseau
Neural Networks
MLP with Radial Basis Functions (RBF)
• Definition

Hidden layer uses radial basis activation function (e.g. Gaussian)
– Idea: “pave” the input space with “receptive fields”

Output layer: linear combination upon the hidden layer
• Properties

Still universal approximator ([Hartman et al.,90], ...)

But not parsimonious (combinatorial explosion of input dimension)
 Only for small input dimension problems

Strong links with fuzzy inference systems and neuro-fuzzy systems
69
L. Orseau
Neural Networks
70
MLP with Radial Basis Functions (RBF)
• Parameters to tune:

# hidden nodes

Initial positions if receptive fields

Diameter of receptive fields

Output weights
• Methodes

Adaptation of back-propagation

Determination of each type of parameters with a specific method (usually
more effective)
– Centers determined by “clustering” methofs (k-means, ...)
– Diameters determined by covering rate optimization (PPV, ...)
– Output weights by linear optimization (calcul de pseudo-inverse, ...)
L. Orseau
Neural Networks
Neural Networks for sequence processing
• Tasks : Take the Time dimension into account

Sequence recognition
E.g. recognize a word corresponding to a vocal signal

Reproduction of sequence
E.g. predict next values of the sequence (ex: electricity consumption prediction)

Temporal association
Production of another in response to the recognition of another sequence
 Time Delay Neural Networks (TDNNs)

Duplicate inputs for several past time steps
 Recurrent Neural Networks
71
L. Orseau
Neural Networks
Recurrent ANN Architectures
• Feedback connections
• Dynamic memory: y(t+1)=f(x(τ),y(τ),s(τ)) τ(t,t-1,...)
• Models:

Jordan/Elman ANNs

Hopfield

Adaptive Resonance Theory (ART)
72
L. Orseau
Neural Networks
Recurrent Neural Networks
• Can learn regular grammars

Finite State Machines

Back Propagation Through Time
• Can even model full computers with 11 neurons (!)

Very special use of RNNs…

Uses the property that a weight can be any number,
i.e. it is an unlimited memory

+ Chaotic dynamics

No learning algorithm for this
73
L. Orseau
Neural Networks
Recurrent Neural Networks
• Problems

Complex trajectories
– Chaotic dynamics

Limited memory of past

Learning is very difficult!
– Exponential decay of error signal in time
75
L. Orseau
Neural Networks
Long Short Term Memory (Hochreiter 1997)
• Idea:

Only some nodes are recurrent

Only self-recurrence

Linear activation function
– Error decays linearly, not exponentially
• Can learn

Regular languages (FSM)

Some Context-free (stack machine) and Context-sensitive grammars
– anbn, anbncn
76
L. Orseau
Neural Networks
Reservoir computing
• Idea:

Random recurrent neural network,

Learn only output layer weights
• Many internal dynamics
• Output layer selects
Input
Output
interesting ones
• And combinations thereof
77
Neural Networks
Plan
1. Introduction
2. The perceptron
3. The multi-layer perceptron (MLP)
4. Learning in MLP
5. Computational aspects
6. Methodological aspects of learning
7. Applications
8. Developments and perspectives
9. Conclusions
L. Orseau
79
L. Orseau
Neural Networks
Conclusions
• Avantages

Can learn a wide variety of problems
• Limits

Learning is slow and difficult

Result is opaque
– Difficult to extract knowledge
– Difficult to use prior knowledge (but KBANN)

Incremental learning of new concepts is difficult: catastrophic forgetting
80
L. Orseau
Neural Networks
81
Bibliography
• Ouvrages / articles

Bishop C. (06): Neural networks for pattern recognition. Clarendon Press Oxford, 1995.

Haykin (98): Neural Networks. Prentice Hall, 1998.

Hertz, Krogh & Palmer (91): Introduction to the theory of neural computation.
Addison Wesley, 1991.

Thiria, Gascuel, Lechevallier & Canu (97): Statistiques et methodes neuronales.
Dunod, 1997.

Vapnik (95): The nature of statistical learning. Springer Verlag, 1995.
• Sites web

http://www.lps.ens.fr/~nadal/