International Journal of Modern Physics C, Vol. 0, No. 0 (1992) 000{000 c World Scientic Publishing Company DAPHNE: DATA PARALLELISM NEURAL NETWORK SIMULATOR PAOLO FRASCONI, MARCO GORI, and GIOVANNI SODA Dipartimento di Sistemi e Informatica University of Florence Via di Santa Marta, 3 - 50139 Firenze (Italy) In this paper we describe the guideline of Daphne, a parallel simulator for supervised recurrent neural networks trained by Backpropagation through time. The simulator has a modular structure, based on a parallel training kernel running on the CM-2 Connection Machine. The training kernel is written in CM Fortran in order to exploit some advantages of the slicewise execution model. The other modules are written in serial C code. They are used for designing and testing the network, and for interfacing with the training data. A dedicated language is available for dening the network architecture, which allows the use of linked modules. The implementation of the learning procedures is based on training example parallelism. This dimension of parallelism has been found to be eective for learning static patterns using feedforward networks. We extend training example parallelism for learning sequences with full recurrent networks. Daphne is mainly conceived for applications in the eld of Automatic Speech Recognition, though it can also serve for simulating feedforward networks. Keywords : Recurrent Neural Networks, Connection Machine, Training Example Parallelism, Speech Recognition. 1. Introduction Learning time is probably the least appealing feature of neural networks trained by Backprop-like algorithms. In these models, the optimization of connection weights is achieved by dening a quadratic error function and using gradient descent techniques to bring the error function to a minimum. Actually, the size of the experiments which can be carried out is limited by the power of the computer being used. For example, learning to discriminate phonetic features with a recurrent neural network (RNN) may require many days of computation using an ordinary workstation. The situation is even worst for complex tasks, such as isolated word recognition on large dictionaries. At present, a couple of packages exist for simulating neural networks on supercomputers. Some of them are public domain software, such as NeuralShell, Aspirin and PlaNet. They run on various platforms, including Cray and workstations. Some This research was partially supported by MURST 40%. 1 2 Paolo Frasconi, Marco Gori, and Giovanni Soda simulators also exist for the Connection Machine. For example the implementation of Zhang1 and a version of the McClelland and Rumelhart simulator2, adapted for the CM-2 by Thrun, using the training example approach proposed by Singer3. To the best of our knowledge, the only existing simulator for the Connection Machine, supporting recurrent neural networks is GRAD-CM24. It is written in the C* programming language and unfortunately it is not adequate to deal with long sequences, because of the memory limitations under the eldwise5 execution model. In this paper we present the design of a parallel simulator supporting recurrent networks. The system is particularly conceived for applications to automatic speech recognition, based on integration of prior knowledge and learning by example6;7. The learning algorithm is based on Backpropagation through time (BPTT)8;9 and is general enough to support many supervised network models. The proposed approach for implementing the learning procedures is based on training example parallelism (TEP). Though implementation of TEP is quite straightforward, some special attention must be devoted to the case of recurrent networks, in order to avoid the memory and the computation waste due to the variability in the lengths of the input sequences. Our proposal is based on the optimization of the memory allocation of the training data. This can be accomplished using sequence concatenation and resetting the network state at the end of each original sequence. A denitive performance measurement is not available, since the package is not completely implemented yet. Preliminary tests indicate learning performances comparable with the results obtained in 3. The paper is organized as follows. In section 2 we briey review some mathematical aspects of the BPTT learning algorithm. In section 3 we discuss the method we propose for the parallel implementation. Section 4 describes the organization of the simulator. Finally some conclusions are drawn in section 5. 2. Recurrent networks approach to sequence classication 2.1. Recurrent network formalism The model considered in this paper is a rst order Multi-Delay RNN. This neural network architecture can be mapped onto an oriented graph G =: fU [N ; Wg, where U is the set of input nodes, N is a set of sigmoidal units and W is a set of arcs. The generic arc c(ijd) 2 W connects the node j 2 U [ N to the node i 2 N with delay d. The corresponding connection weight is denoted wij(d) . The delay d can be any non-negative integer. In particular it can be zero; in that case the connection is said to be static. Denote with G (d) the (possibly not connected) graph obtained removing from G all the arcs c(ijr) with delay r 6= d. The only topological constraint we assume is the following hypothesis: The sub-graph G (0) has no cycles. The network copes with temporal patterns. Discrete time is assumed. Each Daphne: Data Parallelism Neural Network Simulator 3 training example Lp ; p = 1 : : :P is composed by a sequence of input vectors up (t) 2 Rm ; t = 1 : : :Tp (1) and a sequence of target vectors x0p (t) 2 Rn; t = 1 : : :Tp : (2) Each element of the input sequence is also called frame. The training set is dened as L =: fLp ; p = 1 : : :P g: (3) Special reset markers can be inserted during the input sequence. A reset marker rp (t) is a boolean value indicating that the state of the network has to be reset to the initial condition. Such initial condition is assumed to be zero in this paper. Reset markers can be used to break the input sequence into a set of independent input sub-sequences. For example we can regard the training set as a collection of few sequences, putting reset markers on the rst frame of each training example sequence. The computation of the Multi-Delay RNN is based on the following equations: api(t) = D X d=0 3 2 X (1 ? rp (t ? d)) 64 xpi(t) = f(api (t)) = tanh j 2S (id) (N ) api (t) wij(d) xpj (t ? d) + X j 2S (id) (U ) wij(d) upj (t ? d)75 2 i 2 N ; p = 1 : : :P; t = 1 : : :Tp (4) where api (t) is the activation of i-th neuron at time t, xpi (t) is its squashed output, and upj (t) is the j-th external input. S (id) (N ) is the set of network senders to neuron i, with delay d; S (id) (U ) is the set of external senders to neuron i, with delay d. The computation in eq. (4) is referred to as forward pass. In? this framework we assume supervised learning. We denote with epi (t) =: pi (t) xpi (t) ? x0pi(t) the error on unit i 2 N with respect to the target at time t for the p-th sequence; pi (t) is a supervision weight. The cost function is then dened as: P X (5) C =: Cp p=1 Tp X X 1 e2pi (t): Cp = 2 t=1 i2N (6) The goal of learning is that to nd a set of connection weights which minimizes the cost (5). Such solution can be obtained by using standard gradient descent. 4 Paolo Frasconi, Marco Gori, and Giovanni Soda We describe hereafter a method for gradient computation, which is based on error back-propagation through time8;9. Dene: p ; (7) ypi (t) =: @a@C(t) pi then we can rewrite gradient as Tp @Cp = X ypi (t)xpj (t ? d) (8) @wij(d) t=1 with the convention xpj () = 0 for 0. The terms ypi (t) are computed as: 2 ypi (t) = f 0 (api (t)) 64epi (t) + 3 D X X d=0 k2Rid (1 ? rp (t + d)) ( ) wki(d) ypk (t + d)75 (9) with the convention ypk () = 0 for > Tp . R(id) denotes the set of receivers from neuron i with delay d. The computation in eq. (9) is referred to as backward pass. According to eq.(8), since the computation of the partial derivatives ypi (t) cannot start before the complete sequence has been processed, all the past network activities must be stored in memory. Besides full recurrent networks, the above proposed framework includes other models of supervised networks, such as local-feedback MLN10 and TDNN11. As a special case, when all sub-graphs G (d) are empty for d > 0, we obtain feedforward networks. 2.2. Insertion of rule-based prior knowledge Learning by example has a primary role in connectionist models. Nevertheless, many cases exist in which some form of prior knowledge is available. Taking advantage of such knowledge can relieve learning from discovering complex rules7 . For sequences classication problems, it is convenient to represent the prior knowledge by means of automaton rules6;7. Such rules can be injected into the connections of a recurrent architecture12 . It can be proved that the so obtained network carries out the automaton computation, provided that the weight values satisfy an appropriate set of linear constraints. The so obtained network can be thereby used in a modular system, which also includes randomly initialized sub-networks, used to learn the unknown rules. Fig. 1 provides an example of this kind of network in a problem of isolated word recognition. Each word is modeled by a RNN composed by two sub-networks. The left hand sub-network (Mano K) is generated using lexical knowledge, whereas the right hand one (Mano L) is only devoted to learning by examples. The system classies the incoming word selecting the model having the highest value for the output neuron activation. Each network is trained using positive and negative Daphne: Data Parallelism Neural Network Simulator 5 examples. The injected prior knowledge make it possible to train each model using only a subset of the dictionary, selecting just the words more likely to be confused. This approach has proven to be eective in preliminary experiments of automatic speech recognition6;7. One of the primary aims of Daphne is that of allowing massive experimentation of this method. output Mano_K Mano_L M /n/ A N /m/ /a/ O /e/ /i/ /o/ /u/ External input Fig. 1. Neural Network model for the Italian word \mano" 3. The training kernel The kernel of Daphne is the parallel program, running on the Connection Machine, which carries out the learning procedures. The input to the kernel is composed by: the training data, i.e. the input sequences and the supervision information; a network description, including the topology, an initial set of weights and a set of constraints; a set of parameters controlling the learning procedure. The result of the computation is an updated network description with learned weights. The kernel has a low level interface with the network description and the training data, basically composed by les and pipes, whose formats are suitable for internal use but unreadable to humans. A set of tools, designed to provide a high level interface, are described in the next section. 3.1. Training example parallelism for feedforward nets Many possible dimensions of parallelism exist, for implementing neural network on massively parallel machines13. As a matter of fact, training example parallelism (TEP) is the most eective dimension for achieving high eciency when using such machines. We introduce the basic concepts of TEP by rstly considering the case of feedforward networks (i.e. d = 0 and Tp = 1 in each equation of the previous section). The basic steps of TEP are the following: 1. Store each training example into the memory of a dierent processor element (PE). This includes both the external input and the target vector. 2. For each PE, allocate the memory to hold the network activities and the partial derivatives (7). 3. Allocate the connection weights on the front-end. 4. Compute in parallel the forward pass on each training example | eq. (4). 6 Paolo Frasconi, Marco Gori, and Giovanni Soda 5. Similarly, compute in parallel the backward pass | eq. (9). 6. In the above two steps, the values of the connection weights are broadcasted to each PE. 7. For each connection weights, compute in parallel the partial gradient relative to each training pattern | see eq. (8). To obtain the complete gradient, sum the partial terms using a scan prex operation5 (the sum is carried out in logarithmic time). 8. Update the weights on the front-end. This basic formulation is sucient to discuss some general properties of TEP. A rst advantage of the method is the low amount of processors communication. Except for the initial loading of the training data, the only communication occurs when the partial components of the gradient are collected, and this communication is relatively fast on the Connection Machine5 . In this context we are interested in the time required for one iteration of Backpropagation algorithm. Such iteration, also referred to as epoch, includes the forward and the backward computations over the training set, as long as the weights updating. A commonly accepted index of performance is MCUPS (Mega Connection Updates Per Second), dened as: jWjP (10) MCUPS =: Epoch execution time in s : Another common measure is MCPS (Mega Connections Per Second), which takes into account only the forward pass computation: jWjP (11) MCPS =: Forward pass execution time in s : Typically, the ratio between the two performance indexes is about 3{5. The following table gives an idea of the dierences in the achieved performance when using dierent dimensions of parallelism. These dierences are essentially due to the amount of communication required by the implementation. Table 1. Some CM-2 implementations of Backpropagation Authors Parallelism Rosenberg, Belloch14 Zhang1 Diegert15 Singer3 Node+weights Node+data Node+data, Datavault swap Data MCUPS Notes 2.8 NETtalk 40.0 175 MCPS 31.0 .3 GByte training data n.a. 1300 MCPS Another advantage of TEP is the independence of the performance with respect to the network topology. For example, node or weights parallelism performances are Daphne: Data Parallelism Neural Network Simulator 7 subject to worse if a sparse network is used. Finally, TEP is easy to program and general. This allows to exploit dierent neural network models (e.g. second order networks) without having to modify the mapping of data on the machine. 3.2. Performance and degree of parallelism In the previous subsection we shown that TEP is very well suited to achieve high MCUPS performance. Unfortunately there are two main reasons which may limit the usefulness of the method in practical cases. The rst one is that stochastic @Cp ) is not gradient descent8 (i.e. updating the weights with the partial gradient @w ij possible. Instead, the true gradient must be used, i.e. the weights must be updated in batch mode. In many cases this turns out to be a disadvantage because batch updating can lead to slower convergence of learning. This is not true in general but, for highly redundant training sets, it could be even possible to cancel the advantages of parallelism. Another potential problem of the implementation is that each PE must store all the network activities. This limit the maximum allowed network size. A possible solution to these problems can be given by changing the degree of parallelism, i.e. the balance between the number of processors and the computational power of each processor. From this point of view, the CM-2 Connection Machine can be seen as a massively parallel computer, if the elementary processors are used as PE, or as a highly parallel computer if the oating point units are used instead. The two models of execution are respectively known as eldwise and slicewise5. For the slicewise model, assuming 64 bit oating point hardware, the amount of memory of each PE is eight times the memory of an elementary PE under eldwisea . Also, the minimum allowed grid size is eight times less than the eldwise one. Then the use of slicewise can be a solution for the second problem of TEP. A solution to the rst problem can also be given. Suppose P is an integer multiple of the minimum allowed grid size M, (which depends on the machine size), i.e. P = KM. Then divide the original training set L into smaller subsets Lk ; k = 1 : : : K and allocate to the m-th PE the examples Lp such that m = (p ? 1) mod M. In so doing we can compute the partial gradient relative to each subset k = 1 : : :K and update the weights using that gradient vector. This corresponds to a semibatch updating, which can exploit the training set redundancy and provide faster convergence. 3.3. Extension to recurrent nets For the case of feedforward nets the use of a coarser degree of parallelism can help to reduce some problems of TEP. For recurrent nets however, we believe that the use of the slicewise model is the only practical way to implement a general simulator. This is essentially due to the heavy memory requirements of the learning algorithm a Actually there is a oating point unit for each set of 32 processors; each unit is seen as a vector of size 4. 8 Paolo Frasconi, Marco Gori, and Giovanni Soda when applied to long sequences. For example, in problem such as isolated word recognition, a maximumsequence length of 100 or more is quite common. Therefore, only small networks could be simulated using the eldwise model. A remarkable dierence with respect to the case of feedforward networks is that the training examples have variable length. A straightforward extension of the TEP approach would be the following. Allocate a training sequence to each PE, then compute the forward and backward pass in parallel, disabling the processors containing a sequence p which length Tp is less that the currently processed frame t. It is easy to recognize how this approach is wasteful, both in memory and in computational resources. This is particularly true when the sequence length has a large variance. A more ecient extension of TEP is used in the parallel kernel of Daphne. We recall that introducing reset points we can concatenate two or more training examples to build a single sequence. In so doing we can reduce the memory waste, trying to build a set of sequences having similar length. From a formal point of view we can state the following optimization problem. Wasted memory N Serial dimension (time) 1 0 Reset marker 1 M-1 Parallel dimension (training sequences) Serial dimension (neurons) Fig. 2. Allocation of data on the Connection Machine Let L = fL1 ; : : :; LP g be an initial training set from which to select the subset of sequences which will actually be used for learning. Let M be the minimum allowed grid size (function of the size of the machine). Let N be the maximum number of frames which can be allocated to each processor element (function of the memory installed and of the network size). Basically, given a sequence Lp of length Tp , we want to decide whether or not to allocate it into the memory of the m-th PE. To this purpose we dene the integer variables xm;p , m = 1 : : :M, p = 1 : : :P as follows: ( xm;p = 0 if Lp is not allocated to the m-th PE 1 if Lp is allocated to the m-th PE (12) Daphne: Data Parallelism Neural Network Simulator 9 Then the best allocation of data can be obtained by solving the integer program16;17: max xm;p subject to: 8 P > X > > > xm;p Tp > > > > < p=1 M X > > xm;p > > > > m =1 > > : P X p=1 xm;p Tp (13) N 8 m = 1 : : :M (14) 1 8 p = 1 : : :P xm;p 2 f0; 1g In g. 2 we illustrate this concept. Fig. 2 also shows how the network activation is allocated on the Connection Machine. Basically, a three-dimensional grid is used. The dimension relative to the training sequences is parallel. The two other dimension, associated with time and neuron index are serial. During the computation of the forward pass, the presence of reset points must be detected. According to eq. (4), where an active reset point is found, no multiplication need to be carried out. Therefore, it is sucient to disable the context of the processors having an active reset point. The operation introduces a small overhead. Similarly, during the backward pass, reset points must be checked in order to avoid to propagate the error between independent sequences. Simulation control learning database Training data generator labelling network sources Learning report Parallel training kernel Data Trained Network file Supervision Network language compiler and linker Runnable Network file automata compilation algorithm Serial simulator (symbolic test) High level priori rules Fig. 3. Block diagram of Daphne. 4. Organization of the simulator In this section we briey describe the tools available with the simulator for interfacing the training kernel with the external data and with the network description. 10 Paolo Frasconi, Marco Gori, and Giovanni Soda Figure 3 sketches the overall interaction among the modules. 4.1. Network language The network description comprises the following pieces of information: network topology, i.e. a representation of the graph G ; initial values for the connection weights; constraints on weights. The network architecture is described by a special purpose high level language. The language is particularly conceived to deal with architectures composed of subnetworks modules. Each module is compiled into an object module, in which unit indexes are relocatable. In this way, a set of object modules can be linked together to produce a so called \runnable net", which is actually used in the learning process. The syntax of the network language reminds the syntax of C. The language has three data types: unit (used to declare external inputs and network neurons), weight and float. Arrays of units can be declared with the same syntax used in C. A right index is used to refer to an array element. A left index species a time delay. #define #Istar 3.214 network Mano_K(unit NasVow[7], unit M, unit A, unit N, unit O, float a, float b, float c) { weights wm, wa, wn, wo, sm, sa, sn, so; wn :- connect({ Bias, NasVow[1] }, wu :- connect({ Bias, M, NasVow[2] wm :- connect({ M, A, NasVow[0] }, wa :- connect({ A, N, NasVow[5] }, sm :- connect([1]M, M); .... sm := 5.342; wn * {a, b} > Istar; M); }, A); N); O); } ..... main network Mano(input unit u[7], output unit x) { unit NFA[4]; /* Automaton state encoding neurons */ unit Learnable[2]; /* Learnable sub-net hidden units */ Mano_K(u, NFA, 1.32, -3.132, 3.43); Mano_L(u, Learnable); connect( {NFA, Learnable}, x); } Fig. 4. Example of the network language Subnetworks are declared with a syntax which is very close to the syntax of a C function. A sub-network declaration is an intentional declaration. It just produces a description of connections among units. Formal arguments can be specied in the header of the declaration. They can belong to whichever data type. Units declared as formal parameters for a sub-network are externally accessible. Other units, which are hidden for that sub-network, can be declared inside the body. A runnable network is built by linking sub-network modules. A special type Daphne: Data Parallelism Neural Network Simulator 11 of module, (identied by the keyword main), describe the global architecture. Subnetworks can be referenced inside the body of the main network module or the subnetwork modules. When a sub-network is referenced new neurons and weights are instantiated, according to the intentional declaration contained in the sub-network body. The basic statement to declare connection is connect. It takes as arguments two lists of units, and produces a set of connections from each unit in the rst list, towards each unit in the second list. The return value of connect is a weights descriptor, which can be assigned to a weight variable. Such variable can be subsequently used to assign initial values or to impose constraints on the corresponding weights. For more details of the language see 18. In g. 4 an example of code is reported, relative to the declaration of the network of g. 1. 4.2. Other modules The automata compiler module serves to inject rule-based prior knowledge into the connection weights. The program takes as input a set of automata rules and produces a source le in network language. The rule injection algorithm is described in detail in 12. Basically, a recurrent architecture is built and an initial value is provided for the connection weights. A set of linear constraints on weights is also generated by the compiler. The so built network architecture can be integrated with other modules written by the designer. Also, the designer can modify the generated source code, for example to add connections or neurons. The data generator interfaces the kernel with the training set. Its main task is that of computing the optimal allocation of data, according to the method of subsection 3.3. In this implementation there is no datavault support; the data generator runs on the front-end and feeds the simulator using a UNIX pipe. There is not a predened data format. Actually, the routines which access the training data must be rewritten when the data format changes. This operation is trivial, increase the exibility during the experiments, and avoids to duplicate large databases only to conform to a predened data format. Finally a serial test program is provided, which carries out only the forward pass computation. Its primary purpose is that of inspecting the behavior of the network after learning has occurred. The program can be conveniently run on a local workstation and features graphic animationand symbolic access to the network neurons and weights. 5. Conclusions In this paper we have described the basic ideas for the implementation of a parallel simulator for recurrent neural network on the Connection Machine. The simulator uses data parallelism combined with optimal allocation of the training sequences into the processors memory. The use of the slicewise model permits to deal with reasonably long sequences. This is an essential feature for some tasks in automatic 12 Paolo Frasconi, Marco Gori, and Giovanni Soda speech recognition. Finally, a dedicated language allows modular design of the neural system, with particular emphasis on the injection of rule-based prior knowledge. References 1. X. Zhang., et al, \An Ecient Implementation of the Backpropagation Algorithm on the Connection Machine CM-2," in Neural Information Processing Systems 2, Denver, CO, 1989, pp. 801{809. 2. D.E. Rumelhart, and J.L. McClelland, \Exploration in Parallel Distributed Processing", Vol. 3, MIT Press, 1988. 3. A. Singer, \Implementations of Articial Neural Networks on the Connection Machine," Parallel Computing, 14, 1990, pp. 305{315. 4. T. Fontaine, \GRAD-CM2: A Data-parallel Connectionist Network Simulator", MSCIS-92-55/LINC LAB 232, University of Pennsylvania, July 1992. 5. CM Fortran. Programming Guide, Thinking Machine Corporation, Cambridge, MA, Version 1.1, January 1991. 6. P. Frasconi, M. Gori, M. Maggini, and G. Soda, \An Unied Approach for Integrating Explicit Knowledge and Learning by Example in Recurrent Networks", Proceedings of IEEE-IJCNN91, Seattle, I 811{816, 1991. 7. P. Frasconi, M. Gori, M. Maggini, and G. Soda, \Unied Integration of Explicit Rules and Learning by Example in Recurrent Networks," IEEE Trans. on Knowledge and Data Engineering, in press. 8. D.E. Rumelhart, G.E. Hinton, and R.J. Williams, \Learning Internal Representations by Error Propagation," in Parallel Distributed Processing. Exploration in the microstructure of Cognition. Vol. 1: Foundations., MIT Press, 1986. 9. R.L. Watrous, \Speech Recognition Using Connectionist Networks," Ph.D. Thesis, University of Pennsylvania, Philadelphia, November 1988. 10. P. Frasconi, M. Gori, and G. Soda, \Local Feedback Multi-Layered Networks," Neural Computation 4-(1), 1991, pp. 120{130. 11. A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang, \Phoneme Recognition Using Time-Delay Neural Networks," IEEE Transactions on ASSP, 37-(3), 1989, pp. 328{339. 12. P. Frasconi, M. Gori, and G. Soda, \Injecting Nondeterministic Finite State Automata into Recurrent Neural Networks," Technical Report DSI-15/92, University of Florence, 1992. 13. T. Nordstrom, and B. Svensson, \Using and Designing Massively Parallel Computers for Articial Neural Networks," TULEA 1991:13, Lulea Univ. of Technology. 14.. C.R. Rosenberg and G. Blelloch, \An Implementation of Network Learning on the Connection Machine," in D. Waltz and J. Feldman, eds., Connectionist Models and their Implications, Norwood, NJ: Ablex Pub. Corp., 1988. 15. C. Diegert, \Out-of-core Backpropagation," in: Proceedings of IEEE-IJCNN90, San Diego, II 97{103, 1990. 16. C.H. Papadimitriou, Combinatorial Optimization: Algorithms and Complexity, Englewood Clis NJ: Prentice-Hall, 1982. 17. H.M. Weingartner, D.N. Ness, \Methods for the Solution of the Multi-Dimensional 0/1 Knapsack Problem," Operations Research, 15, pp. 83{103, 1967. 18. G. Bellesi, P. Frasconi, M. Gori, and G. Soda, \A Compiler for a Modular Neural Network Language," Technical Report DSI-16/90, University of Florence, 1992.
© Copyright 2024 Paperzz