International Journal of Modern Physics C, Vol. 0, No. 0 (1992) 000{000
c World Scientic Publishing Company
Dipartimento di Sistemi e Informatica
University of Florence
Via di Santa Marta, 3 - 50139 Firenze (Italy)
In this paper we describe the guideline of Daphne, a parallel simulator for supervised
recurrent neural networks trained by Backpropagation through time. The simulator has
a modular structure, based on a parallel training kernel running on the CM-2 Connection Machine. The training kernel is written in CM Fortran in order to exploit some
advantages of the slicewise execution model. The other modules are written in serial C
code. They are used for designing and testing the network, and for interfacing with the
training data. A dedicated language is available for dening the network architecture,
which allows the use of linked modules.
The implementation of the learning procedures is based on training example parallelism. This dimension of parallelism has been found to be eective for learning static
patterns using feedforward networks. We extend training example parallelism for learning sequences with full recurrent networks. Daphne is mainly conceived for applications
in the eld of Automatic Speech Recognition, though it can also serve for simulating
feedforward networks.
Keywords : Recurrent Neural Networks, Connection Machine, Training Example Parallelism, Speech Recognition.
1. Introduction
Learning time is probably the least appealing feature of neural networks trained by
Backprop-like algorithms. In these models, the optimization of connection weights
is achieved by dening a quadratic error function and using gradient descent techniques to bring the error function to a minimum. Actually, the size of the experiments which can be carried out is limited by the power of the computer being used.
For example, learning to discriminate phonetic features with a recurrent neural network (RNN) may require many days of computation using an ordinary workstation.
The situation is even worst for complex tasks, such as isolated word recognition on
large dictionaries.
At present, a couple of packages exist for simulating neural networks on supercomputers. Some of them are public domain software, such as NeuralShell, Aspirin
and PlaNet. They run on various platforms, including Cray and workstations. Some
research was partially supported by MURST 40%.
2 Paolo Frasconi, Marco Gori, and Giovanni Soda
simulators also exist for the Connection Machine. For example the implementation
of Zhang1 and a version of the McClelland and Rumelhart simulator2, adapted for
the CM-2 by Thrun, using the training example approach proposed by Singer3.
To the best of our knowledge, the only existing simulator for the Connection Machine, supporting recurrent neural networks is GRAD-CM24. It is written in the
C* programming language and unfortunately it is not adequate to deal with long
sequences, because of the memory limitations under the eldwise5 execution model.
In this paper we present the design of a parallel simulator supporting recurrent
networks. The system is particularly conceived for applications to automatic speech
recognition, based on integration of prior knowledge and learning by example6;7.
The learning algorithm is based on Backpropagation through time (BPTT)8;9 and
is general enough to support many supervised network models.
The proposed approach for implementing the learning procedures is based
on training example parallelism (TEP). Though implementation of TEP is quite
straightforward, some special attention must be devoted to the case of recurrent
networks, in order to avoid the memory and the computation waste due to the
variability in the lengths of the input sequences. Our proposal is based on the optimization of the memory allocation of the training data. This can be accomplished
using sequence concatenation and resetting the network state at the end of each
original sequence.
A denitive performance measurement is not available, since the package is
not completely implemented yet. Preliminary tests indicate learning performances
comparable with the results obtained in 3.
The paper is organized as follows. In section 2 we briey review some mathematical aspects of the BPTT learning algorithm. In section 3 we discuss the method
we propose for the parallel implementation. Section 4 describes the organization of
the simulator. Finally some conclusions are drawn in section 5.
2. Recurrent networks approach to sequence classication
2.1. Recurrent network formalism
The model considered in this paper is a rst order Multi-Delay RNN. This neural
network architecture can be mapped onto an oriented graph G =: fU [N ; Wg, where
U is the set of input nodes, N is a set of sigmoidal units and W is a set of arcs.
The generic arc c(ijd) 2 W connects the node j 2 U [ N to the node i 2 N with
delay d. The corresponding connection weight is denoted wij(d) . The delay d can be
any non-negative integer. In particular it can be zero; in that case the connection
is said to be static. Denote with G (d) the (possibly not connected) graph obtained
removing from G all the arcs c(ijr) with delay r 6= d. The only topological constraint
we assume is the following hypothesis: The sub-graph G (0) has no cycles.
The network copes with temporal patterns. Discrete time is assumed. Each
Daphne: Data Parallelism Neural Network Simulator 3
training example Lp ; p = 1 : : :P is composed by a sequence of input vectors
up (t) 2 Rm ; t = 1 : : :Tp
and a sequence of target vectors
x0p (t) 2 Rn; t = 1 : : :Tp :
Each element of the input sequence is also called frame. The training set is dened
L =: fLp ; p = 1 : : :P g:
Special reset markers can be inserted during the input sequence. A reset marker
rp (t) is a boolean value indicating that the state of the network has to be reset to
the initial condition. Such initial condition is assumed to be zero in this paper.
Reset markers can be used to break the input sequence into a set of independent
input sub-sequences. For example we can regard the training set as a collection of
few sequences, putting reset markers on the rst frame of each training example
The computation of the Multi-Delay RNN is based on the following equations:
api(t) =
(1 ? rp (t ? d)) 64
xpi(t) = f(api (t)) = tanh
j 2S (id) (N )
api (t)
wij(d) xpj (t ? d) +
j 2S (id) (U )
wij(d) upj (t ? d)75
i 2 N ; p = 1 : : :P; t = 1 : : :Tp
where api (t) is the activation of i-th neuron at time t, xpi (t) is its squashed output,
and upj (t) is the j-th external input. S (id) (N ) is the set of network senders to neuron
i, with delay d; S (id) (U ) is the set of external senders to neuron i, with delay d. The
computation in eq. (4) is referred to as forward pass.
In? this framework
we assume supervised learning. We denote with epi (t) =:
pi (t) xpi (t) ? x0pi(t) the error on unit i 2 N with respect to the target at time
t for the p-th sequence; pi (t) is a supervision weight. The cost function is then
dened as:
C =: Cp
Tp X
e2pi (t):
Cp = 2
t=1 i2N
The goal of learning is that to nd a set of connection weights which minimizes
the cost (5). Such solution can be obtained by using standard gradient descent.
4 Paolo Frasconi, Marco Gori, and Giovanni Soda
We describe hereafter a method for gradient computation, which is based on error
back-propagation through time8;9. Dene:
p ;
ypi (t) =: @[email protected](t)
then we can rewrite gradient as
@Cp = X
ypi (t)xpj (t ? d)
@wij(d) t=1
with the convention xpj () = 0 for 0. The terms ypi (t) are computed as:
ypi (t) = f 0 (api (t)) 64epi (t) +
(1 ? rp (t + d))
( )
wki(d) ypk (t + d)75
with the convention ypk () = 0 for > Tp . R(id) denotes the set of receivers
from neuron i with delay d. The computation in eq. (9) is referred to as backward
pass. According to eq.(8), since the computation of the partial derivatives ypi (t)
cannot start before the complete sequence has been processed, all the past network
activities must be stored in memory.
Besides full recurrent networks, the above proposed framework includes other
models of supervised networks, such as local-feedback MLN10 and TDNN11. As a
special case, when all sub-graphs G (d) are empty for d > 0, we obtain feedforward
2.2. Insertion of rule-based prior knowledge
Learning by example has a primary role in connectionist models. Nevertheless,
many cases exist in which some form of prior knowledge is available. Taking advantage of such knowledge can relieve learning from discovering complex rules7 . For
sequences classication problems, it is convenient to represent the prior knowledge
by means of automaton rules6;7. Such rules can be injected into the connections of
a recurrent architecture12 . It can be proved that the so obtained network carries
out the automaton computation, provided that the weight values satisfy an appropriate set of linear constraints. The so obtained network can be thereby used in
a modular system, which also includes randomly initialized sub-networks, used to
learn the unknown rules.
Fig. 1 provides an example of this kind of network in a problem of isolated
word recognition. Each word is modeled by a RNN composed by two sub-networks.
The left hand sub-network (Mano K) is generated using lexical knowledge, whereas
the right hand one (Mano L) is only devoted to learning by examples. The system
classies the incoming word selecting the model having the highest value for the
output neuron activation. Each network is trained using positive and negative
Daphne: Data Parallelism Neural Network Simulator 5
examples. The injected prior knowledge make it possible to train each model using
only a subset of the dictionary, selecting just the words more likely to be confused.
This approach has proven to be eective in preliminary experiments of automatic speech recognition6;7. One of the primary aims of Daphne is that of allowing
massive experimentation of this method.
External input
Fig. 1. Neural Network model for the Italian word \mano"
3. The training kernel
The kernel of Daphne is the parallel program, running on the Connection Machine,
which carries out the learning procedures. The input to the kernel is composed by:
the training data, i.e. the input sequences and the supervision information;
a network description, including the topology, an initial set of weights and a
set of constraints;
a set of parameters controlling the learning procedure.
The result of the computation is an updated network description with learned
weights. The kernel has a low level interface with the network description and the
training data, basically composed by les and pipes, whose formats are suitable for
internal use but unreadable to humans. A set of tools, designed to provide a high
level interface, are described in the next section.
3.1. Training example parallelism for feedforward nets
Many possible dimensions of parallelism exist, for implementing neural network on
massively parallel machines13. As a matter of fact, training example parallelism
(TEP) is the most eective dimension for achieving high eciency when using such
machines. We introduce the basic concepts of TEP by rstly considering the case
of feedforward networks (i.e. d = 0 and Tp = 1 in each equation of the previous
section). The basic steps of TEP are the following:
1. Store each training example into the memory of a dierent processor element
(PE). This includes both the external input and the target vector.
2. For each PE, allocate the memory to hold the network activities and the partial
derivatives (7).
3. Allocate the connection weights on the front-end.
4. Compute in parallel the forward pass on each training example | eq. (4).
6 Paolo Frasconi, Marco Gori, and Giovanni Soda
5. Similarly, compute in parallel the backward pass | eq. (9).
6. In the above two steps, the values of the connection weights are broadcasted to
each PE.
7. For each connection weights, compute in parallel the partial gradient relative
to each training pattern | see eq. (8). To obtain the complete gradient, sum
the partial terms using a scan prex operation5 (the sum is carried out in
logarithmic time).
8. Update the weights on the front-end.
This basic formulation is sucient to discuss some general properties of TEP. A rst
advantage of the method is the low amount of processors communication. Except
for the initial loading of the training data, the only communication occurs when
the partial components of the gradient are collected, and this communication is
relatively fast on the Connection Machine5 .
In this context we are interested in the time required for one iteration of Backpropagation algorithm. Such iteration, also referred to as epoch, includes the forward and the backward computations over the training set, as long as the weights
updating. A commonly accepted index of performance is MCUPS (Mega Connection Updates Per Second), dened as:
MCUPS =: Epoch execution
time in s :
Another common measure is MCPS (Mega Connections Per Second), which takes
into account only the forward pass computation:
MCPS =: Forward pass execution
time in s :
Typically, the ratio between the two performance indexes is about 3{5. The following table gives an idea of the dierences in the achieved performance when using
dierent dimensions of parallelism. These dierences are essentially due to the
amount of communication required by the implementation.
Table 1. Some CM-2 implementations of Backpropagation
Rosenberg, Belloch14
Node+data, Datavault swap
175 MCPS
31.0 .3 GByte training data
1300 MCPS
Another advantage of TEP is the independence of the performance with respect to
the network topology. For example, node or weights parallelism performances are
Daphne: Data Parallelism Neural Network Simulator 7
subject to worse if a sparse network is used. Finally, TEP is easy to program and
general. This allows to exploit dierent neural network models (e.g. second order
networks) without having to modify the mapping of data on the machine.
3.2. Performance and degree of parallelism
In the previous subsection we shown that TEP is very well suited to achieve high
MCUPS performance. Unfortunately there are two main reasons which may limit
the usefulness of the method in practical cases. The rst one is that stochastic
@Cp ) is not
gradient descent8 (i.e. updating the weights with the partial gradient @w
possible. Instead, the true gradient must be used, i.e. the weights must be updated
in batch mode. In many cases this turns out to be a disadvantage because batch
updating can lead to slower convergence of learning. This is not true in general but,
for highly redundant training sets, it could be even possible to cancel the advantages
of parallelism. Another potential problem of the implementation is that each PE
must store all the network activities. This limit the maximum allowed network size.
A possible solution to these problems can be given by changing the degree of
parallelism, i.e. the balance between the number of processors and the computational power of each processor. From this point of view, the CM-2 Connection
Machine can be seen as a massively parallel computer, if the elementary processors are used as PE, or as a highly parallel computer if the oating point units are
used instead. The two models of execution are respectively known as eldwise and
slicewise5. For the slicewise model, assuming 64 bit oating point hardware, the
amount of memory of each PE is eight times the memory of an elementary PE
under eldwisea . Also, the minimum allowed grid size is eight times less than the
eldwise one. Then the use of slicewise can be a solution for the second problem of
A solution to the rst problem can also be given. Suppose P is an integer
multiple of the minimum allowed grid size M, (which depends on the machine size),
i.e. P = KM. Then divide the original training set L into smaller subsets Lk ; k =
1 : : : K and allocate to the m-th PE the examples Lp such that m = (p ? 1) mod M.
In so doing we can compute the partial gradient relative to each subset k = 1 : : :K
and update the weights using that gradient vector. This corresponds to a semibatch updating, which can exploit the training set redundancy and provide faster
3.3. Extension to recurrent nets
For the case of feedforward nets the use of a coarser degree of parallelism can help to
reduce some problems of TEP. For recurrent nets however, we believe that the use
of the slicewise model is the only practical way to implement a general simulator.
This is essentially due to the heavy memory requirements of the learning algorithm
a Actually there is a oating point unit for each set of 32 processors; each unit is seen as a vector
of size 4.
8 Paolo Frasconi, Marco Gori, and Giovanni Soda
when applied to long sequences. For example, in problem such as isolated word
recognition, a maximumsequence length of 100 or more is quite common. Therefore,
only small networks could be simulated using the eldwise model.
A remarkable dierence with respect to the case of feedforward networks is
that the training examples have variable length. A straightforward extension of the
TEP approach would be the following. Allocate a training sequence to each PE,
then compute the forward and backward pass in parallel, disabling the processors
containing a sequence p which length Tp is less that the currently processed frame
t. It is easy to recognize how this approach is wasteful, both in memory and in
computational resources. This is particularly true when the sequence length has a
large variance.
A more ecient extension of TEP is used in the parallel kernel of Daphne.
We recall that introducing reset points we can concatenate two or more training
examples to build a single sequence. In so doing we can reduce the memory waste,
trying to build a set of sequences having similar length. From a formal point of
view we can state the following optimization problem.
Wasted memory
Parallel dimension (training sequences)
Serial dimension (neurons)
Fig. 2. Allocation of data on the Connection Machine
Let L = fL1 ; : : :; LP g be an initial training set from which to select the subset
of sequences which will actually be used for learning. Let M be the minimum
allowed grid size (function of the size of the machine). Let N be the maximum
number of frames which can be allocated to each processor element (function of the
memory installed and of the network size). Basically, given a sequence Lp of length
Tp , we want to decide whether or not to allocate it into the memory of the m-th
PE. To this purpose we dene the integer variables xm;p , m = 1 : : :M, p = 1 : : :P
as follows:
xm;p =
0 if Lp is not allocated to the m-th PE
1 if Lp is allocated to the m-th PE
Daphne: Data Parallelism Neural Network Simulator 9
Then the best allocation of data can be obtained by solving the integer program16;17:
subject to:
xm;p Tp
< p=1
xm;p Tp
N 8 m = 1 : : :M
1 8 p = 1 : : :P
xm;p 2 f0; 1g
In g. 2 we illustrate this concept. Fig. 2 also shows how the network activation
is allocated on the Connection Machine. Basically, a three-dimensional grid is
used. The dimension relative to the training sequences is parallel. The two other
dimension, associated with time and neuron index are serial.
During the computation of the forward pass, the presence of reset points must
be detected. According to eq. (4), where an active reset point is found, no multiplication need to be carried out. Therefore, it is sucient to disable the context of the
processors having an active reset point. The operation introduces a small overhead.
Similarly, during the backward pass, reset points must be checked in order to avoid
to propagate the error between independent sequences.
Training data
Learning report
Network file
Network language
and linker
Network file
(symbolic test)
High level
priori rules
Fig. 3. Block diagram of Daphne.
4. Organization of the simulator
In this section we briey describe the tools available with the simulator for interfacing the training kernel with the external data and with the network description.
10 Paolo Frasconi, Marco Gori, and Giovanni Soda
Figure 3 sketches the overall interaction among the modules.
4.1. Network language
The network description comprises the following pieces of information:
network topology, i.e. a representation of the graph G ;
initial values for the connection weights;
constraints on weights.
The network architecture is described by a special purpose high level language.
The language is particularly conceived to deal with architectures composed of subnetworks modules. Each module is compiled into an object module, in which unit
indexes are relocatable. In this way, a set of object modules can be linked together
to produce a so called \runnable net", which is actually used in the learning process.
The syntax of the network language reminds the syntax of C. The language has three
data types: unit (used to declare external inputs and network neurons), weight
and float. Arrays of units can be declared with the same syntax used in C. A right
index is used to refer to an array element. A left index species a time delay.
#define #Istar 3.214
network Mano_K(unit NasVow[7],
unit M, unit A, unit N, unit O,
float a, float b, float c)
weights wm, wa, wn, wo, sm, sa, sn, so;
wn :- connect({ Bias, NasVow[1] },
wu :- connect({ Bias, M, NasVow[2]
wm :- connect({ M, A, NasVow[0] },
wa :- connect({ A, N, NasVow[5] },
sm :- connect([1]M, M);
sm := 5.342;
wn * {a, b} > Istar;
}, A);
main network Mano(input unit u[7], output unit x)
unit NFA[4]; /* Automaton state encoding neurons */
unit Learnable[2]; /* Learnable sub-net hidden units */
Mano_K(u, NFA, 1.32, -3.132, 3.43);
Mano_L(u, Learnable);
connect( {NFA, Learnable}, x);
Fig. 4. Example of the network language
Subnetworks are declared with a syntax which is very close to the syntax of a C
function. A sub-network declaration is an intentional declaration. It just produces
a description of connections among units. Formal arguments can be specied in the
header of the declaration. They can belong to whichever data type. Units declared
as formal parameters for a sub-network are externally accessible. Other units, which
are hidden for that sub-network, can be declared inside the body.
A runnable network is built by linking sub-network modules. A special type
Daphne: Data Parallelism Neural Network Simulator 11
of module, (identied by the keyword main), describe the global architecture. Subnetworks can be referenced inside the body of the main network module or the subnetwork modules. When a sub-network is referenced new neurons and weights are
instantiated, according to the intentional declaration contained in the sub-network
The basic statement to declare connection is connect. It takes as arguments
two lists of units, and produces a set of connections from each unit in the rst
list, towards each unit in the second list. The return value of connect is a weights
descriptor, which can be assigned to a weight variable. Such variable can be subsequently used to assign initial values or to impose constraints on the corresponding
weights. For more details of the language see 18. In g. 4 an example of code is
reported, relative to the declaration of the network of g. 1.
4.2. Other modules
The automata compiler module serves to inject rule-based prior knowledge
into the connection weights. The program takes as input a set of automata rules
and produces a source le in network language. The rule injection algorithm is
described in detail in 12. Basically, a recurrent architecture is built and an initial
value is provided for the connection weights. A set of linear constraints on weights is
also generated by the compiler. The so built network architecture can be integrated
with other modules written by the designer. Also, the designer can modify the
generated source code, for example to add connections or neurons.
The data generator interfaces the kernel with the training set. Its main
task is that of computing the optimal allocation of data, according to the method
of subsection 3.3. In this implementation there is no datavault support; the data
generator runs on the front-end and feeds the simulator using a UNIX pipe. There is
not a predened data format. Actually, the routines which access the training data
must be rewritten when the data format changes. This operation is trivial, increase
the exibility during the experiments, and avoids to duplicate large databases only
to conform to a predened data format.
Finally a serial test program is provided, which carries out only the forward
pass computation. Its primary purpose is that of inspecting the behavior of the
network after learning has occurred. The program can be conveniently run on a
local workstation and features graphic animationand symbolic access to the network
neurons and weights.
5. Conclusions
In this paper we have described the basic ideas for the implementation of a parallel
simulator for recurrent neural network on the Connection Machine. The simulator
uses data parallelism combined with optimal allocation of the training sequences
into the processors memory. The use of the slicewise model permits to deal with
reasonably long sequences. This is an essential feature for some tasks in automatic
12 Paolo Frasconi, Marco Gori, and Giovanni Soda
speech recognition. Finally, a dedicated language allows modular design of the neural system, with particular emphasis on the injection of rule-based prior knowledge.
1. X. Zhang., et al, \An Ecient Implementation of the Backpropagation Algorithm on
the Connection Machine CM-2," in Neural Information Processing Systems 2, Denver,
CO, 1989, pp. 801{809.
2. D.E. Rumelhart, and J.L. McClelland, \Exploration in Parallel Distributed Processing", Vol. 3, MIT Press, 1988.
3. A. Singer, \Implementations of Articial Neural Networks on the Connection Machine," Parallel Computing, 14, 1990, pp. 305{315.
4. T. Fontaine, \GRAD-CM2: A Data-parallel Connectionist Network Simulator", MSCIS-92-55/LINC LAB 232, University of Pennsylvania, July 1992.
5. CM Fortran. Programming Guide, Thinking Machine Corporation, Cambridge, MA,
Version 1.1, January 1991.
6. P. Frasconi, M. Gori, M. Maggini, and G. Soda, \An Unied Approach for Integrating
Explicit Knowledge and Learning by Example in Recurrent Networks", Proceedings
of IEEE-IJCNN91, Seattle, I 811{816, 1991.
7. P. Frasconi, M. Gori, M. Maggini, and G. Soda, \Unied Integration of Explicit Rules
and Learning by Example in Recurrent Networks," IEEE Trans. on Knowledge and
Data Engineering, in press.
8. D.E. Rumelhart, G.E. Hinton, and R.J. Williams, \Learning Internal Representations by Error Propagation," in Parallel Distributed Processing. Exploration in the
microstructure of Cognition. Vol. 1: Foundations., MIT Press, 1986.
9. R.L. Watrous, \Speech Recognition Using Connectionist Networks," Ph.D. Thesis,
University of Pennsylvania, Philadelphia, November 1988.
10. P. Frasconi, M. Gori, and G. Soda, \Local Feedback Multi-Layered Networks," Neural
Computation 4-(1), 1991, pp. 120{130.
11. A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang, \Phoneme Recognition
Using Time-Delay Neural Networks," IEEE Transactions on ASSP, 37-(3), 1989, pp.
12. P. Frasconi, M. Gori, and G. Soda, \Injecting Nondeterministic Finite State Automata into Recurrent Neural Networks," Technical Report DSI-15/92, University of
Florence, 1992.
13. T. Nordstrom, and B. Svensson, \Using and Designing Massively Parallel Computers
for Articial Neural Networks," TULEA 1991:13, Lulea Univ. of Technology.
14.. C.R. Rosenberg and G. Blelloch, \An Implementation of Network Learning on the
Connection Machine," in D. Waltz and J. Feldman, eds., Connectionist Models and
their Implications, Norwood, NJ: Ablex Pub. Corp., 1988.
15. C. Diegert, \Out-of-core Backpropagation," in: Proceedings of IEEE-IJCNN90, San
Diego, II 97{103, 1990.
16. C.H. Papadimitriou, Combinatorial Optimization: Algorithms and Complexity, Englewood Clis NJ: Prentice-Hall, 1982.
17. H.M. Weingartner, D.N. Ness, \Methods for the Solution of the Multi-Dimensional
0/1 Knapsack Problem," Operations Research, 15, pp. 83{103, 1967.
18. G. Bellesi, P. Frasconi, M. Gori, and G. Soda, \A Compiler for a Modular Neural
Network Language," Technical Report DSI-16/90, University of Florence, 1992.