Fast Learning in Spiking Neural Networks by Learning Rate

RESEARCH NOTES
Chinese Journal of Chemical Engineering, 20(6) 1219—1224 (2012)
Fast Learning in Spiking Neural Networks by
Learning Rate Adaptation*
FANG Huijuan (方慧娟)**, LUO Jiliang (罗继亮) and WANG Fei (王飞)
College of Information Science and Engineering, Huaqiao University, Xiamen 361021, China
Abstract For accelerating the supervised learning by the SpikeProp algorithm with the temporal coding paradigm
in spiking neural networks (SNNs), three learning rate adaptation methods (heuristic rule, delta-delta rule, and
delta-bar-delta rule), which are used to speed up training in artificial neural networks, are used to develop the training algorithms for feedforward SNN. The performance of these algorithms is investigated by four experiments:
classical XOR (exclusive or) problem, Iris dataset, fault diagnosis in the Tennessee Eastman process, and Poisson
trains of discrete spikes. The results demonstrate that all the three learning rate adaptation methods are able to speed
up convergence of SNN compared with the original SpikeProp algorithm. Furthermore, if the adaptive learning rate
is used in combination with the momentum term, the two modifications will balance each other in a beneficial way
to accomplish rapid and steady convergence. In the three learning rate adaptation methods, delta-bar-delta rule performs the best. The delta-bar-delta method with momentum has the fastest convergence rate, the greatest stability of
training process, and the maximum accuracy of network learning. The proposed algorithms in this paper are simple
and efficient, and consequently valuable for practical applications of SNN.
Keywords spiking neural networks, learning algorithm, learning rate adaptation, Tennessee Eastman process
1
INTRODUCTION
In the past two decades, the artificial neural networks (ANNs) have been widely used to model or
control many industrial processes [1, 2]. The third
generation of neural networks, spiking neural networks (SNNs), has received increasing attention in
recent years [3]. This kind of networks differs from
traditional ANNs in that spiking neurons propagate
information by the timing of individual spikes, rather
than by the rate of spikes. Some studies have shown
that the temporal coding is more biologically plausible
than the rate coding [4], and SNNs can be used to
analyze spike trains directly without losing temporal
information [5]. Since the inherent property is closer to
biological neurons than sigmoidal ones, SNNs with
temporal coding have higher computational power
than ANNs with sigmoidal activation functions in
theory [6]. Furthermore, the networks communicated
through discrete spikes instead of analog values are
more suitable for hardware implementations [7]. The
SNN model is a very promising alternative for the
sigmoidal network model.
The learning method is one of the most important
problems for practical applications of temporally encoded spiking neural networks. A supervised learning
rule, called SpikeProp [8], is presented based on error
backpropagation (BP) by assuming that the internal
state of the neuron increases linearly for a region small
enough around the instant of neuronal firing. Furthermore, it is proven that SpikeProp is mathematically
correct without the help of the linearity assumption [9].
Subsequently, some researchers tried to improve
the original SpikeProp such as providing additional
learning rules [10], adding a momentum term [11], and
developing QuickProp and resilient propagation (RProp)
algorithms [12]. The performance of the gradient descent based learning algorithm is very sensitive to the
proper setting of the learning rate [11-13]. And modifications of the BP algorithm by adjusting the learning
rate during the training in ANN do have practical
value. However, in the aforementioned extending efforts on the SpikeProp algorithm, the learning rate is
either constant or simplistically adaptive. A dynamic
self-adaptation (DS) method of learning rate is proposed [14], but the algorithm is computationally expensive and has a lower success rate of convergence
by comparison with SpikeProp.
In this paper, three methods (heuristic rule,
delta-delta rule, and delta-bar-delta rule) are applied to
update learning rate adaptively in the weight training
of the original SpikeProp algorithm. A momentum term
is also used to help move out of local minima on the
error surface by taking into account previous movements on this surface. We perform four experiments on
the classical exclusive OR (XOR) problem, the Iris dataset classification problem, fault diagnosis in the Tennessee Eastman (TE) process and decoding information
from Poisson spike trains to demonstrate that the proposed algorithms are simple and efficient, easy to be
used in combination with other learning algorithms of
SNNs.
2
LEARNING RATE ADAPTATION METHODS
The general feedforward SNN architecture with
Received 2012-06-10, accepted 2012-07-31.
* Supported by the National Natural Science Foundation of China (60904018, 61203040), the Natural Science Foundation of
Fujian Province of China (2009J05147, 2011J01352), the Foundation for Distinguished Young Scholars of Higher Education
of Fujian Province of China (JA10004), and the Science Research Foundation of Huaqiao University (09BS617).
** To whom correspondence should be addressed. E-mail: [email protected]
1220
Chin. J. Chem. Eng., Vol. 20, No. 6, December 2012
multiple delayed synaptic terminals is shown in Fig. 1.
Input nodes are in a set called H, hidden neurons in set
I, and output neurons in set J. The behavior of each
spiking neuron is modeled according to a simple version of the spike response model (SRM) [15]. The state
of the neuron in SRM is described by its membrane
potential
x j (t ) =
∑ ∑ wijk ε ( t − ti − d k )
(1)
i∈Γ j k
Here neuron j in layer J, having a set Γj of immediate
predecessors (“pre-synaptic neurons”), receives a set
of spikes with firing time ti, i ∈ Γ j . wijk is the weight
of synapse k of the connection from neuron i to j, and
d k is the delay of synapse k. ε is a standard post synaptic potential (PSP) that can be described by function
ε (t ) = (t / τ )e1−t /τ for t > 0 (else 0), and τ is the decay
time constant. When the state variable x j (t ) crosses
the threshold θ, the neuron fires a pulse, so-called action potential, or spike that is described by its spike
time t j . The neuron’s membrane potential will drop to
its resting potential after it emits a spike and the refractory period will be produced. The original SpikeProp is developed for SNN that fires only once, so it
does not need to consider the refractory period.
attempt to keep the learning step size as large as possible
while keeping learning stable. In the following, three
methods are used to adjust the learning rate adaptively.
2.1
Heuristic rule
The heuristic rule can be written as
⎧a ⋅η (n − 1) if E[ w(n)] < E[ w(n − 1)]
⎪
η (n) = ⎨b ⋅η (n − 1) if E[ w(n)] > k ⋅ E[ w(n − 1)] (4)
⎪η (n − 1)
otherwise
⎩
where a, b and k are all parameters [16]. The heuristic
rule can update the learning rate in every learning step,
but the learning rate in one learning step is same to all
weights in the network. The following two learning
rate adaptation rules assign individual learning rate to
each synaptic weight for dealing with the complex
error surface.
2.2
Delta-delta rule
As the delta-delta rule in ANN [17], we take the
partial derivative of the error function in the nth iteration to ηijk (n) :
a
a
∂E (n) ∂E (n) ∂t j (n) ∂x j ⎡⎣t j (n) ⎤⎦
=
∂ηijk (n) ∂t aj (n) ∂x j t aj (n) ∂ηijk (n)
(
)
and obtain the final update rule:
∂E (n)
∂E (n) ∂E (n − 1)
Δηijk (n) = −γ
=γ
k
∂ηij (n)
∂wijk (n) ∂wijk (n − 1)
Figure 1 Feedforward spiking neural network with multiple delayed synaptic terminals for each connection
SpikeProp, derived in [8], uses the same weight
update method as error backpropagation
∂E
Δwijk = −η k
(2)
∂wij
Here η is the learning rate, which holds constant
throughout training [8]. The performance of the steepest descent algorithm can be improved if the learning
rate is allowed to change during the training. In addition, to improve convergence and get around of local
minima, the momentum is used in combination with
adaptive learning rate
∂E (n)
wijk (n + 1) = wijk (n) − ηijk (n + 1) k
+ αΔwijk (n − 1),
∂wij (n)
(3)
where α is the momentum coefficient. The adaptive
learning rate ηijk is indexed by n + 1 rather than n to
indicate simply that its update occurs before wijk update. The adaptive learning rate, which is made responsive to the complexity of the error surface, should
(5)
(6)
where γ is a positive adjustable step for learning rate.
Delta-delta rule has some potential problems. For
example, in two successive iterations, if the partial
derivatives to some weight have the same signs but
have small values, the corresponding learning rate gets
little increment. On the other hand, if the two successive
partial derivatives to some weight have opposite signs
and larger values, the corresponding learning rate gets
great decrement. It is difficult to choose proper adjustable
step γ in these two cases. The following delta-bar-delta
rule can overcome the difficulties.
2.3
Delta-bar-delta rule
The delta-bar-delta rule is defined by the following functions [17]
⎧a
if Sijk (n − 1) Dijk (n) > 0
⎪
⎪
Δηijk (n) = ⎨−bηijk (n) if Sijk (n − 1) Dijk (n) < 0
⎪
otherwise
⎪⎩0
Dijk (n) =
∂E (n)
∂wijk (n)
(7)
(8)
Chin. J. Chem. Eng., Vol. 20, No. 6, December 2012
Sijk (n) = (1 − ξ ) Dijk (n) + ξ Sijk (n − 1)
(9)
where a, b and ξ are all parameters. The typical values are:
10−4 ≤ a ≤ 0.1 , 0.1 ≤ b ≤ 0.5 , and 0.1 ≤ ξ ≤ 0.7 .
In delta-bar-delta rule, learning rates increase
linearly and decrease exponentially. Linear increment
avoids excessive speed to increase. And the exponential decrement makes learning rates decrease very fast,
but remain positive.
3
EXPERIMENTS AND RESULTS
The network architecture adopted in this paper is
a fully connected feedforward network with multiple
delays per connection, as described in Fig. 1. With the
method used in [18], the weight in each connection is
initialized to a random number between 1 and 10.
During the training, only positive weights are allowed
for distinguishing excitatory and inhibitory neurons.
The parameters for the spiking neuron model are:
synaptic time constant τ = 7 and membrane threshold
θ = 50 . The simulation time range is from 0 to 50 ms
with the time step 0.01 ms. The maximum number of
training epochs is set to 500. The initial learning rates
of the three learning rate adaptation methods introduced in Section 2 are all set to 1. The parameters of
three learning rules are selected in the typical range
used in ANN [16, 17] by trial and error. In heuristic
rule, a = 1.05, b = 0.8, k = 1.04; in delta-delta rule,
γ = 0.1; in delta-bar-delta rule, a = 0.1, b = 0.2, ξ = 0.3.
3.1 XOR problem
We attempt to replicate the XOR experiment with
the similar encoding scheme and network architecture
used in original SpikeProp algorithm [8]. The three
learning rate adaptation methods (heuristic, delta-delta,
and delta-bar-delta) with and without momentum are
used to train the SNN to learn the XOR pattern separately. The results are presented in Table 1. Each entry
represents the average number of epochs for 100 different simulations. Convergence is defined by a mean
squared error (MSE) of 1.0 ms. The momentum coefficient α is set to 1.5 in delta-bar-delta method and 0.9
in other methods.
Table 1
Number of average iterations (± SD) with XOR
Algorithm
Without momentum
With momentum
heuristic
75±44
46±16
delta-delta
116±37
38±16
delta-bar-delta
66±17
18±4
SpikeProp
128±51
43±40
As shown in Table 1, using the three learning rate
adaptation methods with momentum, the speed of
1221
convergence is increased by 73% averagely than that
of original SpikeProp. Simulations with momentum
are averagely quickened by 60% than those without
momentum. The momentum term increases the effective learning rate but may have a destabilizing effect.
The standard deviation (SD) of the SpikeProp with
momentum shows significant dispersion from the average compared with that of the SpikeProp without
momentum, while the learning rate adaptation methods
with momentum are more stable than those without
momentum. The above analyses indicate that learning
rate adaptation and momentum balance each other in a
beneficial way to deal with the complex error surface.
In all the modifications of SpikeProp, the delta-bar-delta
method with momentum has the fastest and the most
stable convergence rate. If the momentum coefficient
α is set to 0.9 in delta-bar-delta method, the average of
iterations is 33. Because of the best effect in learning rate
adaptation, the momentum coefficient of delta-bar-delta
method can be raised to 1.5 to convergence within 18
iterations. On the contrary, if the momentum coefficient
of heuristic or delta-delta method is greater than 1, the
training procedure will be more likely to last longer
even not converge.
3.2
Iris dataset
The Iris dataset consists of three classes, two of
which are not linearly separable, providing a good
practical problem for our algorithm. The dataset contains 150 data samples with 4 continuous input variables. As in [8], we employ 12 neurons with Gaussian
receptive fields to encode each of the 4 input variables.
Ten hidden neurons (including one inhibitory neuron)
and three output neurons are used. Output classification is encoded according to a winner-take-all paradigm where the neuron coding for the respective class
is designated an early firing time (12 ms), and all
other neurons a considerably later ones (16 ms). A
classification is correct if the neuron that fires earliest
corresponds to the neuron required to fire first. Ten
different sets of initial weights are tested. And for each
set of initial weights, the dataset is randomly partitioned into a training set (50%) and a test set (50%)
for 10 times. The total number of simulations is 100.
Other parameters are same with those of XOR except
that the time step is 0.1 ms and the maximum number
of training epochs is set to 200.
Figure 2 shows the classification accuracy averaged over 100 groups as a function of the number of
training epochs. The performance of algorithms with
momentum is better than those without momentum.
The delta-bar-delta rule with momentum has the most
rapid speed of convergence, the highest stability of
training process, and the maximum accuracy among
all eight algorithms. The training accuracy of deltabar-delta method with momentum can reach 100%
after 85 training epochs, whereas the SpikeProp algorithm can only reach 98.5% until 200 epochs. Table 2
shows the best accuracy in the entire 200 epochs and
1222
Chin. J. Chem. Eng., Vol. 20, No. 6, December 2012
(a)
(b)
Figure 2 Comparison of Iris dataset classification accuracy of the original SpikeProp algorithm and seven modifications
versus the number of training epochs for training set (a) and test set (b)
SpikeProp;
heuristics;
delta-delta;
delta-bar-delta;
SpikeProp with momentum;
heuristics with
momentum;
delta-delta with momentum;
delta-bar-delta with momentum
Table 2
Results of Iris dataset
Iterations
Best accuracy
Algorithm
with momentum
without momentum
Training accuracy >95%
Test accuracy >95%
Training set/%
Test set/%
heuristic
41
48
99.97
95.95
delta-delta
66
75
99.89
95.87
delta-bar-delta
24
35
100
95.76
SpikeProp
51
59
99.89
95.84
heuristic
81
112
98.91
95.95
delta-delta
83
165
99.31
94.93
delta-bar-delta
41
59
99.95
95.63
SpikeProp
87
144
98.51
95.65
how many training iterations it takes to reach 95%
accuracy.
3.3
TE process
The TE process created by the Eastman Chemical
Co. has been widely used as a benchmark chemical
process for evaluating fault diagnosis methods [19].
The simulation data are generated by MATLAB [20].
Three faults 4, 9, 11 of the TE process are chosen for
identification. For each fault, the simulation dataset
contains 40 observations respectively. Each observation contains 41 measured variables. The 12 manipulated variables in the TE process are set to constant.
Only measured variable 9, 18, and 20 are selected as
features for recognizing the three faults [21]. The parameters of SNN in this experiment are same with
those of Iris dataset except that the maximum number
of training epochs is set to 100.
Figure 3 shows the fault diagnosis accuracy averaged over 100 groups as a function of the number of
training epochs. After 21 training epochs, the training
accuracy of the delta-bar-delta method with momentum can reach 90.23%, whereas that of the SpikeProp
is 63.72%. The corresponding test accuracy of the two
algorithms are 84.87% and 61.82% respectively. In
100 training epochs, the best training and test accuracy of the delta-bar-delta with momentum are
95.08% and 87.42%, and those of the SpikeProp are
90.95% and 85.95%. Therefore, the results illustrate
that the modified algorithms of SpikeProp can speed
up convergence of SNN and improve the accuracy of
fault diagnosis. The delta-bar-delta rule with momentum furthermore has the best performance among all
the eight algorithms.
3.4
Poisson spike trains
In this experiment, we investigate the capability
of SNN to decode temporal information from spike
trains. Two spike train templates are produced by
Poisson processes [22] with a frequency of 100 Hz and
Chin. J. Chem. Eng., Vol. 20, No. 6, December 2012
1223
(a)
(b)
Figure 3 Comparison of fault identification accuracy of the original SpikeProp algorithm and seven modifications versus
the number of training epochs for training set (a) and test set (b)
SpikeProp;
heuristics;
delta-delta;
delta-bar-delta;
SpikeProp with momentum;
heuristics with
momentum;
delta-delta with momentum;
delta-bar-delta with momentum
Figure 4
(a) Class 1
(b) Class 2
Two noisy patterns of Poisson spike trains for both class
(a)
(b)
Figure 5 Comparison of Poisson spike trains classification accuracy of the original SpikeProp algorithm and seven modifications versus the number of training epochs for training set (a) and test set (b)
SpikeProp;
heuristics;
delta-delta;
delta-bar-delta;
SpikeProp with momentum;
heuristics with
momentum;
delta-delta with momentum;
delta-bar-delta with momentum
duration of 30 ms. From these two templates, 50 noisy
patterns are created by randomly shifting each spike
by an amount drawn from a normal distribution with a
SD of 2 ms, see Fig. 4. The resulting set of 100 patterns is then split in half to produce a training set and
a test set. The output is coded in the time-to-first-spike:
for one class the output neuron is designated to fire an
early spike at 31 ms and for another a later spike at 36 ms.
Every connection between spiking neurons consists of
36 synapses with delays from 1 to 36. The time step is
0.1 ms and the maximum number of training epochs is
set to 100. Other parameters are same with those of
XOR.
The rates of spikes in two classes are both 100 Hz,
so the traditional ANN cannot be used to this classification problem, whereas the SNN can recognize the
difference in the timing of individual spikes. Fig. 5
shows the classification accuracy averaged over 100
1224
Chin. J. Chem. Eng., Vol. 20, No. 6, December 2012
groups as a function of the number of training epochs.
The training accuracy of delta-bar-delta method with
momentum can reach 98.46% after 20 training epochs,
whereas that of the SpikeProp algorithm reaches 98.0%
until 54 epochs. The test accuracy of delta-bar-delta
method with momentum can reach 99.06% after 29
training epochs, whereas that of the SpikeProp algorithm
reaches 98.32% until 59 epochs. The delta-bar-delta
method with momentum again has the best performance among all the eight algorithms. However, the
worst method is not the SpikeProp, but the delta-delta.
The network in this experiment has so many synapses
that the initial MSE is up to about 200 ms. Thus the
delta-delta rule increases the learning rate rapidly.
Once the error of the network drops steeply, the reduction of the learning rate is very small. Consequently the
learning rate is so large that the performance of the
network is rather bad. Fig. 6 shows the learning rate
changing processes of three learning rate adaptation
methods with and without momentum. The learning
rates of delta-delta and delta-bar-delta methods in
Fig. 6 are the average number of all the weights. This
figure indicates that the learning rate of delta-bar-delta
is appropriately adjusted around 1, whereas the learning rate of delta-delta is more than 5 from the 13th
training epoch. In addition the momentum term can
smooth the oscillations resulted from the large learning rate in delta-delta algorithm.
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Figure 6 The adaptation processes of learning rates
heuristics;
delta-delta;
delta-bar-delta;
heuristics with momentum;
delta-delta with momentum;
delta-bar-delta with momentum
20
21
REFERENCES
22
1
Lazzús, J.A., “Prediction of flash point temperature of organic compounds using a hybrid method of group contribution + neural network + particle swarm optimization”, Chin. J. Chem. Eng., 18 (5),
817-823 (2010).
Zou, Z., Yu, D., Feng, W., Yu, L., Guo, N., “An intelligent neural
networks system for adaptive learning and prediction of a bioreactor
benchmark process”, Chin. J. Chem. Eng., 16 (1), 62-66 (2008).
Maass, W., “Networks of spiking neurons: The third generation of
neural network models”, Neural Networks, 10 (9), 1659-1671 (1997).
Thorpe, S., Fize D., Marlot, C., “Speed of processing in the human
visual system”, Nature, 381 (6582), 520-522 (1996).
Fang, H., Wang, Y., He, J., “Spiking neural networks for cortical
neuronal spike train decoding”, Neural Computation, 22 (4),
1060-1085 (2010).
Maass, W., “Noisy spiking neurons with temporal coding have more
computational power than sigmoidal neurons”, In: Advances in
Neural Information Processing Systems, MIT Press, Cambridge,
USA, 9, 211-217 (1997).
Maass, W., “Lower bounds for the computational power of networks
of spiking neurons”, Neural Computation, 8 (1), 1-40 (1996).
Bohte, S.M., Kok, J.N., La Poutré, H., “Error-backpropagation in
temporally encoded networks of spiking neurons”, Neurocomputing,
48, 17-37 (2002).
Yang, J., Yang, W., Wu, W., “A remark on the error-backpropagation
learning algorithm for spiking neural networks”, Applied Mathematics Letters, 25 (8), 1118-1120 (2012).
Schrauwen, B., van Campenhout, J., “Extending spikeprop”, In:
Proceedings of the International Joint Conference on Neural Networks, IEEE, Piscataway, USA, 471-475 (2004).
Xin, J., Embrechts, M., “Supervised learning with spiking neural
networks”, In: Proceedings of International Joint Conference on
Neural Networks, IEEE, Piscataway, USA, 1772-1777 (2001).
McKennoch, S., Liu, D., Bushnell, L.G., “Fast modifications of the
SpikeProp algorithm”, In: Proceedings of the International Joint
Conference on Neural Networks, IEEE, Piscataway, USA, 3970-3977
(2006).
Ghosh-Dastidar, S., Adeli, H., “Improved spiking neural networks
for EEG classification and epilepsy and seizure detection”, Integr.
Comput. -Aid E., 14 (3), 187-212 (2007).
Delshad, E., Moallem, P., Monadjemi, S.A.H., “Spiking neural network learning algorithms: Using learning rates adaptation of gradient and momentum steps”, In: 5th International Symposium on
Telecommunications (IST), IEEE, 944-949 (2010).
Gerstner, W., Kistler, W., Spiking Neuron Models, Cambridge University Press, England (2002).
Vogl, T.P., Mangis, J.K., Rigler, A.K., Zink, W.T., Alkon, D.L., “Accelerating the convergence of the back-propagation method”, Biol.
Cybern., 59 (4), 257-263 (1988).
Jacobs, R.A., “Increased rates of convergence through learning rate
adaptation”, Neural Networks, 1, 295-307 (1988).
Moore, S.C., “Back-propagation in spiking neural networks”, Master
Thesis, University of Bath, UK (2002).
Downs, J.J., Vogel, E.F., “A plant-wide industrial process control
problem”, Comput. Chem. Eng., 17 (3), 245-255 (1993).
“Tennessee Eastman Problem for MATLAB”, Control Systems Engineering Laboratory, Arizona State University, 1998 [2012-06-08],
http://csel.asu.edu/downloads/Software/TEmatlab.zip.
Lu, N., Yu, X., “Fault diagnosis in TE process based on feature selection via second order mutual information”, CIESC Journal, 60 (9),
2252-2258 (2009).
Heeger, D., “Poisson model of spike generation”, New York University,
2000 [2012-06-08], http://www.cns.nyu.edu/~david/ftp/handouts/
poisson.pdf.