Hebbian Learning

Self Organization:
Hebbian Learning
CS/CMPE 333 – Neural Networks
Introduction




So far, we have studied neural networks that learn from
their environment in a supervised manner
Neural networks can also learn in an unsupervised
manner as well. This is also known as self organized
learning
Self organized learning discovers significant features or
patterns in the input data through general rules that
operate locally
Self organizing networks typically consist of two layers
with feedforward connections and elements to facilitate
‘local’ learning
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
2
Self-Organization



1.
2.
3.
“Global order can arise from local interactions” –
Turing (1952)
Input signal produces certain activity patterns in
network <-> weights are modified (feedback loop)
Principles of self organization
Modification in weights tend to self-amplify
Limitation of resources leads to competition and
selection of the most active synapse and disregard of
less active synapse
Modifications in weights tends to cooperate
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
3
Hebbian Learning


A self-organizing principle was proposed by Hebb in
1949 in the context of biological neurons
Hebb’s principle
 When
a neuron repeatedly excites another neuron, then the
threshold of the latter neuron is decreased, or the synaptic
weight between the neurons is increased, in effect increasing
the likelihood of the second neuron to excite

Hebbian learning rule
Δwji = ηyjxi
 There
is no desired or target signal required in the Hebbian
rule, hence it is unsupervised learning
 The update rule is local to the weight
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
4
Hebbian Update


Consider the update of a single weight w (x and y are
the pre- and post-synaptic activities)
w(n + 1) = w(n) + ηx(n)y(n)
For a linear activation function
w(n + 1) = w(n)[1 + ηx2(n)]
 Weights
increase without bounds. If initial weight is
negative, then it will increase in the negative. If it is positive,
then it will increase in the positive range

Hebbian learning is intrinsically unstable, unlike errorcorrection learning with BP algorithm
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
5
Geometric Interpretation of Hebbian Learning
Consider a single linear neuron with p inputs
y = wTx = xTw
and
Δw = η[x1y x2y … xpy]T
 The dot product can be written as
y = |w||x| cos(α)

α
= angle between vectors x and w
 If α is zero (x and w are ‘close’) y is large. If α is 90 (x and w
are ‘far’) y is zero.
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
6
Similarity Measure

A network trained with Hebbian learning creates a
similarity measure (the inner product) in its input space
according to the information contained in the weights
 The
weights capture (memorizes) the information in the data
during training


During operation, when the weights are fixed, a large
output y signifies that the present input is "similar" to
the inputs x that created the weights during training
Similarity measures
 Hamming
distance
 Correlation
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
7
Hebbian Learning as Correlation Learning



Hebbian learning (pattern-by-pattern mode)
Δw(n) = ηx(n)y(n)
= ηxT(n)x(n)w(n)
Using batch mode
Δw(n) = η[Σn=1 Nx(n)xT(n)]w(0)
The term Σn=1 Nx(n)xT(n) is sample approximation of
the auto-correlation of the input data
 Thus
Hebbian learning can be thought of learning the autocorrelation of the input space

Correlation is a well-known operation in signal
processing and statistics. In particular, it completely
describes signals defined by Gaussian distributions
 Applications in
signal processing
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
8
Oja’s Rule


The simple Hebbian rule causes the weights to increase
(or decrease) without bounds
The weights need to be normalized to one as
wji(n + 1) = [wji(n) + ηxi(n)yj(n)] /
√Σi[wji(n) + ηxi(n)yj(n)]2
 This
equation effectively imposes a constraint on the weights
that the sum at a neuron be equal to 1

Oja approximated the normalization (for small η) as:
wji(n + 1) = wjin) + ηyj(n)[xi(n) – yj(n)wji(n)]
 This
is Oja’s rule, or the generalized Hebbian rule
 It involves a ‘forgetting term’ that prevents the weights from
growing without bounds
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
9
Oja’s Rule – Geometric Interpretation

The simple Hebbian rule finds the weight vector with
the largest variance with the input data.
 However,
the magnitude of the weight vector increases
without bounds

Oja’s rule has a similar interpretation; normalization
only changes the magnitude while the direction of the
weight vector is same
 Magnitude

is equal to one
Oja’s rule converges asymptotically, unlike Hebbian
rule which is unstable
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
10
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
11
The Maximum Eigenfilter


A linear neuron trained with Oja’s rule produces a
weight vector that is the eigenvector of the input auto
correlation matrix, and produces at its output the
largest eigenvalue
A linear neuron trained with Oja’s rule solves the
following eigen problem
Re1 = λ1e1
R
= auto-correlation matrix of input data
 e1 = largest eigenvector which corresponds to the weight
vector w obtained by Oja’s rule
 λ1 = largest eigenvalue, which corresponds to the network’s
output
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
12
Principal Component Analysis (1)



Oja’s rule when applied to a single neuron creates a
principal component in the input space in the form of
the weight vector
How can we find other components in the input space
with significant variance ?
In statistics, PCA is used to obtain the significant
components of data in the form of orthogonal principal
axes
 PCA is
also known as K-L filtering in signal processing
 First proposed in 1901. Later developments occurred in the
1930s, 1940s and 1960s.

Hebbian network with Oja’s rule can perform PCA
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
13
Principal Component Analysis (2)
PCA
 Consider a set of vectors x with zero mean and unit
variance. There exist an orthogonal transformation y =
QTx such that the covariance matrix of y is Λ = E[yyT]
Λij = λi if i = j and Λij = 0 otherwise (diagonal matrix)
 λ1
> λ2 > … > λp = eigenvalues of covariance matrix of x (C
= E[xxT]
 Columns of Q are the corresponding eigenvectors
 Vector y is the principal component that has the maximum
variance with all other components
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
14
PCA – Example
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
15
Hebbian Network for PCA
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
16
Hebbian Network for PCA

Procedure
 Use
Oja’s rule to find the principal component
 Project the data orthogonal to the principal component
 Use Oja’s rule on the projected data to find the next major
component
 Repeat the above for m <= p (m = desired components; p =
input space dimensionality)

How to find the projection onto orthogonal direction?
 Deflation
method: subtract the principal component from the
input

Oja’s rule can be modified to perform this operation;
Sanger’s rule
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
17
Sanger’s Rule

Sanger’s rule is a modification of Oja’s rule that
implements the deflation method for PCA
 Classical PCA involves
matrix operations
 Sanger’s rule implements PCA in an iterative fashion for
neural networks
Consider p inputs and m outputs, where m < p
yj(n) = Σi=1 p wji(n)xi(n)
j = 1, m
and, the update (Sanger’s rule)
Δwji(n) = η[yj(n)xi(n) – yj(n) Σk=1 j wki(n)yk(n)]

CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
18
PCA for Feature Extraction

PCA is the optimal linear feature extractor. This means
that there is no other linear system that is able to
provide better features for reconstruction.
 PCA may
or may not be the best preprocessing for pattern
classification or recognition. Classification requires good
discrimination which PCA might not be able to provide.


Feature extraction: transform p-dimensional input
space to an m-dimensional space (m < p), such that the
m-dimensions capture the information with minimal
loss
The error e in the reconstruction is given by:
e2 = Σi=M+1 p λi
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
19
PCA for Data Compression




PCA identifies an orthogonal coordinate system for the
input data such that the variance of the projection on
the principal axis is largest, followed by the next major
axis, and so on
By discarding some of the minor components, PCA can
be used for data compression, where a p-dimension
(bit) input is encoded in a m < p dimensional space
Weights are computed by Sanger’s rule on typical
inputs
The de-compressor (receiver) must know the weights
of the network to reconstruct the original signal
x’ = WTy
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
20
PCA for Classification (1)


Can PCA enhance classification ?
In general, no. PCA is good for reconstruction and not
feature discrimination or classification
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
21
PCA for Classification (2)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
22
PCA – Some Remarks

Practical uses of PCA
 Data
Compression
 Cluster analysis
 Feature extraction
 Preprocessing for classification/recognition (e.g.
preprocessing for MLP training)

Biological basis
 It
is unlikely that the processing performed by biological
neurons in, say perception, involves PCA only. More
complex feature extraction processes are involved.
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
23
Anti-Hebbian Learning


Modifying the Hebbian rule as
Δwji(n) = - ηxi(n)yj(n)
The anti-Hebbian rule find the direction in space that
has the minimum variance. In other words, it is the
complement of the Hebbian rule
 Anti-Hebbian
does de-correlation. It de-correlates the output
from the input

Hebbian rule is unstable, since it tries to maximize the
variance. Anti-Hebbian rule, on the other hand, is
stable and converges
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
24