Local learning algorithm for optical neural
networks
Yong Qiao and Demetri Psaltis
An anti-Hebbian local learning algorithm for two-layer optical neural networks is introduced. With this
learning rule, the weight update for a certain connection depends only on the input and output of that
connection and a global, scalar error signal. Therefore the backpropagation of error signals through the
network, as required by the commonly used back error propagation algorithm, is avoided. It still
guarantees, however, that the synaptic weights are updated in the error descent direction. With the
apparent advantage of simpler optical implementation this learning rule is also shown by simulations to
be computationally
effective.
Key words: Optical neural networks, anti-Hebbian local learning.
The most widely used algorithm for training multilayer neural networks is the backward error propagation (BEP) algorithm,' which is a steepest-descent
algorithm that minimizes the error at the output of
the network with respect to the weight values of the
connections between the neurons. Methods for the
23
optical implementation of BEP have been proposed
but the practical realization of these ideas is complicated by the fact that BEP is a nonlocal learning rule.
Consider the two-layer network shown in Fig. 1.
If we change any one of the weights in the first layer
(decrease or increase) the effect of this change on the
output of the network would depend on the value and
sign of the weights of the second layer. BEP is a
steepest-descent method and, therefore, attempts to
decrease the output error as quickly as possible. To
accomplish this, we require knowledge of the secondlayer weights in order to change the weights in the
first layer. From an implementation point of view,
this complicates matters since we must communicate
information in both directions and have bidirectional
neurons with a different functionality in the forward
and backward directions.2 3 Here we describe an
algorithm for training two-layer optical neural networks, in which the weight updates are calculated
from the signal at the input of each connection, the
signal at the other end of the same connection, and a
The authors are with the Department of Electrical Engineering,
California Institute of Technology,Pasadena, California 91125.
Received 30 September, 1991
0003-6935/92/173285-04$05.00/0.
©
1992 Optical Society of America.
global scalar error signal. The advantage of this
algorithm is that it can be implemented with signals
that are locally available, which simplifies the optical
system.
We can get an intuitive feel for how our algorithm
works by considering the network of Fig. 1 with only a
single output neuron. Suppose that the output of
that neuron takes the value + 1 or -1. The weights
of the second layer are trained by using the same
procedure as in BEP. For the weights of the first
layer, if the output is wrong for a particular input, we
can correct it by adjusting the weights of the first
layer to produce the negative of the current response
of the hidden layer.4 Therefore we can treat the first
layer as a single-layer net with a known desired
output, and thus it can be trained with one of the
existing algorithms for training single-layer nets.
Belowwe show that it is possible to select the training
algorithm for the first layer so as to guarantee that
the output error will decrease at each iteration.
Let the numbers of neurons for the input, first, and
second layers of the two-layer network shown in Fig.
1 be No, N 1 , and N2, respectively. The inputs to the
neurons of the nth layer are
(.) =
Nn-1
>E
(.)O(n-1)
(1)
i=1
where
wji(n)
is the weight of the interconnection
between the jth neuron in the nth layer and the ith
neuron in the previous layer, and oi(n) is the output of
the ith neuron in the nth layer. The signal at the
input layer is denoted by oi(). The first and second
10 June 1992 / Vol. 31, No. 17 / APPLIED OPTICS
3285
0(o)
N1
0(2)
0.
(1)
ji
W..
Fig. 1. Schematic diagram of a feed-forward two-layer neural
network.
is, therefore, also positive. This gives us Yk < 0 for
this case. Let us now define the quantity y = Yk2l
Yk, which will be positive if the sign of most of the
output units matches the target sign and will be
negative if the reverse is true. We can construct a
learning rule for the first layer of weights by using y
as a performance metric. Notice that y is a scalar
quantity that can be calculated from signals that are
available at the output layer of the network. The
basic idea is to modify the first layer of weights so that
we reinforce the production of the current hidden
layer response if y is positive or reinforce the production of the negative of the current hidden layer
response if y is negative. We know that the simple
Hebbian learning rule
layers of neurons perform a soft thresholding operation on their inputs to produce the outputs
Oi(n = f [sj(n)i,
(2)
where the function f is chosen as f (x) = tanh(x) for
the analysis.
The desired response that corresponds to the input
oi(°)is the vector t whose elements are binary, tk E
[1, -1} for k = 1, 2,.. ., N2. The output error of the
network is measured by the logarithmic energy func5 6
AWji(l) oi Moi(°)
(6)
reinforces the reproduction of the current response
when presented with the same input. Therefore, if
we want to reinforce the negative of the current
response when the output is incorrect, then we can
simply adopt an anti-Hebbian rule by multiplying the
right-hand side of proportion (6) by y. This idea
leads to the following anti-Hebbian local learning
(ALL)algorithm for the first layer:
tion ' :
1
Awj i(l)
E = k1 [(1 + tk)ln 1 +
10(2)
+ (1- t)ln
1
1-IO
2)
Moi(o)
jM2
2
AWkj~
AE=
o:
-
-
aE
k 2
()
(4)
=28o
aE
N2
where ak
E
=
= tk - Ok(2)is
2{1 - [ojM]2 lo/0o
>
N2
kki()
(5)
the output error signal. The
nonlocal nature of the BEP algorithm is due to the
1kN-1 kWkj(2) factor in proportion (5), which contains
the values of the weights of the second layer.
Let Yk = kSk (2). Then Yk is positive if and only if
the sign of the kth output unit matches the sign of tk.
For example let us assume that tk = -1 and Ok(2) > 0.
Then b < 0. The input signal to the kth output unit
Sk (2) has the same sign as the output of that unit and
APPLIED OPTICS / Vol. 31, No. 17 / 10 June 1992
>
N2
-y[oi(0)]
2
1: 8Sk b2)
k=1
=_7[Oi(0)]2
<0
Aw'i
Ni
[o(0)]2 E 8k
Wj(2)OJ(1)
1=1 j=1
-
=
(2)
(7)
The denominator in proportion (7) is an additional
term that is needed to guarantee that this learning
rule always decreases the overall error at the output.
By using proportions (5) and (7), and by assuming
that the weights of interconnections between any
input neuron and all hidden neurons are updated
simultaneously (which is true in most practical situations), we obtain
The BEP rule changes the weights by means of the
gradient descent, i.e.,
3286
{i -
(3)
Equation (3) has its global minimum at E = 0 and
reaches this minimum if and only if the network
output is the same as the desired response. We
chose this form of error function instead of the more
commonly used quadratic error function because we
found in simulations that for the problems we investigated this energy function gave a better performance.
The algorithm we describe here also works with the
quadratic error function with a slight, straightforward modification of the weight update rule we derive
below.
to
(8)
which proves our claim. Thus we have proved that
even though the ALL algorithm is not a steepestdescent rule it is still a descent rule.
In some cases it may happen that during training
the output of a certain hidden neuron becomes so
close to + 1 or -1 that the denominator - [oJ()]2
in proportion (7) is close to zero. This causes numerical instability. One way to avoid this is to find
oma(') which is the hidden neuron output that has
the maximum magnitude, and normalize the righthand side of proportion (7) by the factor 11- [o max(1)]21.
Computer simulations of the ALL algorithm were
performed for the problem of recognizing handwritten zip-code digits provided by U.S. Postal Service.
For comparison, the BEP algorithm was also used to
solve the same problem. The handwritten zip codes
were first segmented into single digits, and then each
digit was reduced to fit a 10 x 10 binary pixel grid.
A network of 100 input neurons (to match the
10 x 10 pixel grid), 5 hidden neurons, and 3 output
neurons were selected and trained to provide classification for three classes of handwritten digits, 3, 6,
and 8. Each output neuron responds to one class
only. 600 digit patterns, with 200 patterns from
each class, were selected. These 600 patterns were
partitioned into 300 training samples, 150 validation
samples, and 150 test samples. The validation samples were used after each learning iteration (i.e.,
presentation of the whole training set) to calculate
the classification error of the network. The network
training stopped when the classification error of the
network on the validation set stopped decreasing with
further iteration. After the network was trained,
the test samples were persented to the network to
find its generalization error. For the ALL algorithm, the first layer was trained in 1 of 40 iterations
only. By doing this, we relied more on the steepestdescent training in the second layer and it improved
learning convergence for this particular classification
problem and this particular network. For given
training, validation, and test sets, the network was
trained four times with different initial conditions for
both the ALL and the BEP algorithms. The same
step size was used for the two algorithms for the
purpose of comparison. The same simulations were
repeated with different training, validation, and test
sets obtained from different partitioning of the 600
digit patterns (the numbers of the training, validation, and test samples were still 300, 150, and 150,
respectively). Therefore, there were total of eight
runs for each algorithm.
For the ALL algorithm, the network was able to
converge (meaning all the training patterns were
classified correctly) in seven cases. In only one case
the network fell into a local minimum and gave a
training error of 1%. For the BEP algorithm, the
network was able to converge in all eight cases. The
average generalization errors for the ALL and the
BEP algorithms were 9% and 8%, respectively. As
for the average convergence rate of the two algorithms, it took 581 iterations for the BEP algorithm
to converge, and 3,665 iterations for the ALL algorithm to converge. However, since the amount of
computation involved in each learning iteration is
different for the two algorithms, we should also
compare the convergence rate in terms of number of
computational steps. It turns out that, for this size
of network, the number of computational steps in
each iteration for BEP is 2.4 times of that for ALL.
In this sense, ALL is only 2.6 times slower than
BEP when implemented by a serial digital computer.
In the optical implementation that is fully parallel the
appropriate speed comparison between BEP and ALL
should be based on the number of learning iterations
since the time required to complete each iteration is
roughly the same for both algorithms. Therefore,
for the problem that we studied in our experiment, a
parallel optical implementation of ALL would be 6
times slower than BEP. However, the optical system that implements ALL is much simpler than the
BEP system, and this makes it much more likely that
an ALL system can be built in practice. We should
point out that the relative convergence rates of the
two algorithms quoted above apply only to the problem we have tried and the relative performance will
be generally problem dependent.
One possible implementation of the ALL algorithm
is shown in Fig. 2. This architecture is quite similar
to the architecture described in Refs. 2 and 3 for the
implementation of BEP, with a few key differences.
The input images are recorded on an electrically
addressed spatial light modulator (EASLM1), and
hologram #1 interconnects the pixels at the input
plane to the pixels at the intermediate or hidden layer
plane. The nonlinear response of the neurons at the
hidden layer is simulated by an optically addressed
spatial light modulator (OASLM). The second layer
is similar, with hologram #2 interconnecting pixels
from the OASLM to the output plane where a twodimensional CCD detector (CCD1) is placed to detect
the light. At the input stage there is a second
EASLM (EASLM2) on which the reference signals
that are necessary for the adaptation of hologram #1
are recorded. Similarly at the hidden layer there is
an EASLM3 to record the reference for hologram #2,
which is simply the error signal 8k [see proportion
(4)]. This error signal is produced by subtracting
the network output from the desired target signal, a
point operation that can be accomplished either optically or electronically. The reference for hologram
#1 is more difficult to derive [see proportion (7)].
It involves the global error signal y and the response
of the hidden layer (o(l)). y can be calculated with
point operations from signals that are already available at the output of the system, but we need a second
detector (CCD2) at the output to record ((I)) as it is
imaged through hologram #2. Once the reference
Fig. 2. Optical architecture that implements the ALL algorithm.
10 June 1992 / Vol. 31, No. 17 / APPLIED OPTICS
3287
signals are calculated and recorded on EASLM2 and
EASLM3, hologram
#1 (hologram #2) is exposed
to the interference between the signal recorded on
EASLM2 (EASLM3) and o(O) (o'l)).
This requires
that a latching device, such as the microchannel
spatial light modulator, be used as the OASLM. An
exposure schedule (which is described in Ref. 3) must
be used to ensure that each holographic exposure contributes equally to the overall hologram. The key
difference between the ALL architecture and the BEP
architecture described in Refs. 2 and 3 is that the
light always travels in the same direction throughout
the system (left to right in Fig. 2). This simplifies
the construction and the alignment of the system
and, most importantly, does not require a device at
the hidden layer (OASLM)that operates in both operations and has a different response function in each
direction. Thus the ALL architecture is much more
likely to be constructed in the foreseeable future.
3288
This work was supported by the Defense Advanced
Research Projects Agency and the U.S. Air Force
Officeof Scientific Research.
APPLIED OPTICS / Vol. 31, No. 17 / 10 June 1992
References
1. D. E. Rumelhart,
G. E. Hinton, and R. J. Williams, in Parallel
Distributed Processing, D. E. Rumelhart and J. L. McClelland,
eds. (MIT Press, Cambridge, Mass., 1986), Vol. 1, pp. 318-362.
2. K. Wagner and D. Psaltis, "Multilayer optical learning
networks," Appl. Opt. 26,5061-5076 (1987).
3. D. Psaltis, D. Brady, and K. Wagner, "Adaptive optical net-
works using photorefractive crystals," Appl. Opt. 27, 17521759 (1988).
4. T. Grossman, R. Meir, and E. Domany, "Learning by choice of
internal representations," ComplexSyst. 2, 555-575 (1988).
5. J. J. Hopfield, "Learning algorithms and probability distributions in feed-forward and feed-back networks," Proc. Natl.
Acad. Sci. USA 84, 8429-8433 (1987).
6. S. A. Solla, E. Levin, and M. Fleisher, "Accelerated learning in
layered neural networks," ComplexSyst. 2, 625-640 (1988).
© Copyright 2026 Paperzz