The Robota Dolls

MACHINE LEARNING - Doctoral Class - EDIC
MACHINE LEARNING
Information Theory and The Neuron - II
Aude Billard
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Overview
LECTURE I:
• Neuron – Biological Inspiration
• Information Theory and the Neuron
• Weight Decay + Anti-Hebbian Learning  PCA
• Anti-Hebbian Learning  ICA
LECTURE II:
• Capacity of the single Neuron
• Capacity of Associative Memories (Willshaw Net,
Extended Hopfield Network)
LECTURE III:
• Continuous Time-Delay NN
• Limit-Cycles, Stability and Convergence
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Neural Processing - The Brain
Decay-depolarization
Integration
E
Electrical
Potential
Synapse
Dendrites
x2
x1
E   x  dt E  1
t
Cell
Body
Refractory

time
E
A neuron receives and integrate input from other neurons. Once the input exceeds a critical
level, the neuron discharges a spike. This spiking event is also called depolarization, and is
followed by a refractory period, during which the neuron is unable to fire.
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Information Theory and The Neuron


y  f   wi  xi 
 i

W

X
1
Output: y
W
2
W
3
W
4
You can view the neuron as a memory.
• What can you store in this memory?
• What is the maximal capacity?
• How can you find a learning rule that maximizes the capacity?
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Information Theory and The Neuron
A fundamental principle of learning systems is their robustness to
noise.
One way to measure the system’s robustness to noise is to
determine the joint information between its inputs and output.

Input : X

yy ff(X
((X
X),)

)
Output: y
Noise :
Noise :
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Information Theory and The Neuron
W

X
1
Output: y
W
2
W


y  f   wi  xi 
 i

3
W
4
Consider the neuron as a sender-receiver system, with X being the message sent and y
the received message.
Information theory can give you a measure of the information conveyed by y about X.
If the transmission system is imperfect (noisy), you must find a way to ensure minimal
disturbance in the transmission.
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Information Theory and The Neuron
W

X

1
Output: y
W
2
W
3
W
4
The mutual information between the neuron output y and its
Inputs x is given by:
2
y
1
I  x, y   log 2
2

 y2
where  2

is the signal-to-noise ratio.
In order to maximize the ratio, one can increase the magnitude
of the weights.
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Information Theory and The Neuron
1
2

X
W
1
3
Output: y
W
2
4
y    wi   xi   i  
W
3
W
i
4
The mutual information between the neuron output y and its
Inputs X is given by:


2


y
1
I  x, y   log 
2
2 
2  2v  wi 
j


This time, one cannot simply increase the magnitude of the
weights, as this affects the value of  y2 as well.
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Information Theory and The Neuron
y j   wij  xi   j
1
i
y1

x
2
y2
I  x, y   log
det( R)
 2
det( R)   4   2  12   22    12 2 22 1  122 
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
How to define a learning rule to optimize the
mutual information?
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Hebbian Learning
xi
wij
Input
yj
Output
y j   wij  xi
j
wij    xi  y j
 : Learning rate
If x I and y I fire simultaneously, the weight of the connection between
them will be strengthened in proportion to their strength of firing.
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
Hebbian Learning – Limit Cycle
d
W t   CW t 
dt
Stability?
d
W t   0 
dt
W  t   0
 wi* such that E  wi*   0


E  w   E  yxi   E   w j x j xi    Cij w j  0
 j
 j
*
i
C : correlation matrix
This is true for all i, thus, w_j is an eigenvector of C, with associated Eigenvalue 0
C is a positive, symmetric and semi-definite matrix  all eigenvalues are >=0.
Under a small disturbance
 E  w   C  w     C  0
*
*
 The weights tend to grow in the direction of the largest eigenvalue of C.
MACHINE LEARNING - Doctoral Class - EDIC
Hebbian Learning – Weight Decay
The simple weight decay rule belong to a class of decay rule called
Substractive Rule
wij    xi  y j   wij 
The only advantage of substractive rules over simply clipping the weights
lies in that it allows to eliminates weights that have little importance.
Another important type of decay rules is the Multiplicative Rule
wij    xi  y j    wij  wij
  wij  : function of the weight
The advantage of multiplicative rules is that,
in addition to giving small weights, they also give
EPFL - LASA @ 2006 A.. Billard
useful weights.
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Information Theory and The Neuron
1
2

X
W
1
3
Output: y
W
2
4
y    wi   xi   i  
W
3
i
W
4
J wi  
T
i
T
i
w Cwi
w wi
~
Oja’s one neuron model


2


y
1
I  x, y   log 
2
2 
2  2v  wi 
j


wi     xi  y  y 2  wi

The weights converge toward the first eigenvector of the input covariance
 y2
matrix and are normalized.
http://lasa.epfl.ch
EPFL - LASA @ 2006 A.. Billard
MACHINE LEARNING - Doctoral Class - EDIC
Hebbian Learning – Weight Decay
Oja’s subspace algorithm

wij   x j  yi  yi   wkj  yk
k

Equivalent to minimizing the generalized form of J:
J wij  
~
wiT Cw j
wiT wi

*
*


y
n
y
 i j n
n
wiT Iwi
 det R  

I x, y   log 
2
 v 
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Hebbian Learning – Weight Decay
Why PCA, LDA, ICA with ANN?
• Explain
the way the brain could derive important properties of
the sensory and motor space.
• Allows to discover new mode of computation with simple
iterative and local learning rules.
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Recurrence in Neural Networks
Sofar, we have considered only feed-forward neural networks
Most biological network have recurrent connections.
This change of direction in the flow of information is
interesting, as it can allow:
• To keep a memory of the activation of the neuron
• To propagate the information across output neurons
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning
y1

x
Anti-Hebbian Learning
y2
How to maximize information transmission in a network,
I.e. maximize: I(x;y)
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning
y1

x
Anti-Hebbian Learning
y2
Anti-Hebbian learning is also known as lateral inhibition
wij    yi  y j 
Average of values taken over all training patterns
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning
If the two outputs are highly correlated, then, the weights between
them will grow to a large negative value and each will tend to turn
the other off.
wij    yi  y j 
w
ij
 0   yi  y j  0
No need for weight decay or renormalizing on anti-Hebbian
weights, as they are automatically self-limiting!
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning
Foldiak’s first Model
n
yi  xi   wij y j
wij    yi  y j for i  j
j 1
wij    yi  y j for i  j
In Matrix Terms
y  x W  y
 y   I W   x
1
y  T  x   I W   x
1
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning
Foldiak’s first Model
One can further show that there is a stable point in the weight space.
wf 
1  1   2

EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning
Foldiak’s 2ND Model
wii   1  yi yi 
Allows all neurons to receive their own outputs with weight 1
W    I  YY T 
This network will converge when:
1) the outputs are decorrelated
2) the expected variance of the outputs is equal to 1.
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
PCA versus ICA
PCA looks at the covariance matrix only. What if the data is not well
described by the covariance matrix?
The only distribution which is uniquely specified by its covariance (with the
subtracted mean) is the Gaussian distribution. Distributions which deviate
from the Gaussian are poorly described by their covariances.
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
PCA versus ICA
Even with non-Gaussian data, variance maximization leads to the
most faithful representation in a reconstruction error sense.
The mean-square error measure implicitly assumes Gaussianity, since
it penalizes datapoints close to the mean less that those that are
far away.
But it does not in general lead to the most meaningful representation.
 We need to perform gradient descent in some function other than
the reconstruction error.
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Uncorrelated and Statistical Independent
Uncorrelated
E ( y1, y2 )  E ( y1 ) E ( y2 )
Independent
E( f  y1  f  y2 )  E( f  y1 )E( f  y2 )
True for any non-linear
transformation f
Statistical Independence is a stronger constraint
than decorrelation.
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Objective Function of ICA
We want to ensure that the outputs yi are maximally independent.
This is identical to requiring that the mutual information be small.
Or alternately that the joint entropy be large.
H(x,y)
H(x)
H(x|y)
H(y)
I(x,y)
H(y|x)
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning and ICA
Anti-Hebbian Learning can also lead to a decomposition in Statistically
Independent Component, and, as such allow to do a decomposition of the
type of ICA.
To ensure independence, the network must converge
to a solution that satisfies the condition:
E( f  y1  f  y2 )  E( f  y1 )E( f  y2 )
For any given function f.
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
ICA for TIME-DEPENDENT SIGNALS
s1  t 
s2  t 
Original Signal
X  t   AS t 
x1  t 
x2  t 
EPFL - LASA @ 2006 A.. Billard
Mixed Signal
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
ICA for TIME-DEPENDENT SIGNALS
S  t   A 1 X  t 
S t  ?
A 1 ?
x1  t 
x2  t 
Mixed Signal
EPFL - LASA @ 2006 A.. Billard
Adapted from Hyvarinen @
2000
http://lasa.epfl.ch
Anti-Hebbian Learning and ICA
n
yi  x i  wij y j
Jutten and Herault Model
j 1
y  I W  x
1
y  x  Wy
Non-linear Learning Rule
wij   f  yi  g  y j 
for i  j
If f and g are the identity, we find again the Hebbian Rule, which
ensures convergence to uncorrelated outputs:
E  y1 , y2   0
To ensure independence, the network must converge to a solution
that satisfies the condition:
E( f  y1  f  y2 )  E( f  y1 )E( f  y2 )
For any given function f.
Anti-Hebbian Learning and ICA
HINT: Use two odd functions for f and g (f(-x)=-f(x)), then their taylor
series expansion consists solely of the odd terms
f x    a2 j 1 x

g x    b2 j 1 x 2 j 1

2 j 1
j 0
j 0
wij    f  y1   g  y2 

    a j bk  y12 j 1  y22 k 1
j 0 k 0
wij  0  E  y12 j 1  y22 k 1   0
Since most (audio) signals have an even distribution, at convergence,
one has:
E y 2 j 1  y 2 k 1  E y 2 j 1 E y 2 k 1 

1
2

1
2
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning and ICA
Application for Blind Source Separation
MIXED SIGNALS
Hsiao-Chun Wu et al, ICNN 1996, MWSCAS 1998, ICASSP 1999
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning and ICA
Application for Blind Source Separation
UNMIXED SIGNALS THROUGH GENERALIZED ANTI-HEBBIAN LEARNING
Hsiao-Chun Wu et al, ICNN 1996, MWSCAS 1998, ICASSP 1999
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning and ICA
Application for Blind Source Separation
MIXED SIGNALS
Hsiao-Chun Wu et al, ICNN 1996, MWSCAS 1998, ICASSP 1999
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning and ICA
Application for Blind Source Separation
UNMIXED SIGNALS THROUGH GENERALIZED ANTI HEBBIAN LEARNING
Hsiao-Chun Wu et al, ICNN 1996, MWSCAS 1998, ICASSP 1999
EPFL - LASA @ 2006 A.. Billard
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Information Maximization
Bell & Sejnowsky proposed a network to maximize the mutual
information between the output and the input when those are not
subjected to noise (or rather when the input and the noise can no
longer be distinguished, then H(Y|X) tend to negative infinity).
1
2

X
W
1
3
Output: y
W
2
4
W
3
W
W
4
0
y
1
1  e W X  w0
EPFL - LASA @ 2006
A.. Billard
Bell A.J. and Sejnowski T.J. 1995. An information maximisation
approach
to blind separation and blind deconvolution,
Neural Computation, 7, 6, 1129-1159
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Information Maximization
Bell & Sejnowsky proposed a network to maximize the mutual
information between the output and the input when those are not
subjected to noise (or rather when the input and the noise can no
longer be distinguished, then H(Y|X) tend to negative infinity).
I x, y   H  y   H  y | x
H(Y|X) is independent of the weights W and so
EPFL - LASA @ 2006
A.. Billard
Bell A.J. and Sejnowski T.J. 1995. An information maximisation
approach
to blind separation and blind deconvolution,
Neural Computation, 7, 6, 1129-1159
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Information Maximization
The entropy of a distribution is maximized when all outcomes are
equally likely.
 We must choose an activation function at the output neurons which
equalizes each neuron’s chances of firing and so maximizes their
collective entropy.
EPFL - LASA @ 2006
A.. Billard
Bell A.J. and Sejnowski T.J. 1995. An information maximisation
approach
to blind separation and blind deconvolution,
Neural Computation, 7, 6, 1129-1159
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning and ICA
The sigmoid is the optimal solution to even out a gaussian
distribution so that all outputs are equally probable
EPFL - LASA @ 2006
A.. Billard
Bell A.J. and Sejnowski T.J. 1995. An information maximisation
approach
to blind separation and blind deconvolution,
Neural Computation, 7, 6, 1129-1159
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning and ICA
The sigmoid is the optimal solution to even out a gaussian
distribution so that all outputs are equally probable
EPFL - LASA @ 2006
A.. Billard
Bell A.J. and Sejnowski T.J. 1995. An information maximisation
approach
to blind separation and blind deconvolution,
Neural Computation, 7, 6, 1129-1159
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning and ICA
The sigmoid is the optimal solution to even out a gaussian
distribution so that all outputs are equally probable
1
2

X
W
1
3
Output: y
W
2
4
W
3
W
W
4
0
y
1
1  e W X  w0
EPFL - LASA @ 2006
A.. Billard
Bell A.J. and Sejnowski T.J. 1995. An information maximisation
approach
to blind separation and blind deconvolution,
Neural Computation, 7, 6, 1129-1159
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning and ICA
The pdf of the output can be written as:
The entropy of the output is then given by:
The learning rules that optimize this entropy are given by:
EPFL - LASA @ 2006
A.. Billard
Bell A.J. and Sejnowski T.J. 1995. An information maximisation
approach
to blind separation and blind deconvolution,
Neural Computation, 7, 6, 1129-1159
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning and ICA
Anti-Hebbian
Anti-weight decay
(avoids solution y=1)
(moves away from simple solution w=0)
EPFL - LASA @ 2006
A.. Billard
Bell A.J. and Sejnowski T.J. 1995. An information maximization
approach
to blind separation and blind deconvolution,
Neural Computation, 7, 6, 1129-1159
http://lasa.epfl.ch
MACHINE LEARNING - Doctoral Class - EDIC
Anti-Hebbian Learning and ICA
This can be generalized to a many inputs - many outputs network
with sigmoid function for the output. The learning rules that
optimizes the mutual information between input and output are then
given by:
Such a network can linearly decompose up to 10 sources.
EPFL - LASA @ 2006
A.. Billard
Bell A.J. and Sejnowski T.J. 1995. An information maximisation
approach
to blind separation and blind deconvolution,
Neural Computation, 7, 6, 1129-1159
http://lasa.epfl.ch