Massively Parallel Processing Implementation of the Toroidal Neural Networks P.Palazzari1, M.Coli2, R.Rughi2 ENEA – HPCN Project, C.R.Casaccia, Via Anguillarese, 301, SP100 00060 S.Maria di Galeria (Rome) – [email protected] 1 Electronic Engineering Department, University “La Sapienza”, Via Eudossiana, 18 00184 Rome – [email protected] – [email protected] 2 ABSTRACT: The Toroidal Neural Networks (TNN), recently introduced, are derived from DT-CNN and are characterized by an appealing mathematical description which allows the development of an exact learning algorithm. In this work, after reviewing the underlying theory, we describe the implementation of TNN on the APE100/Quadrics massively parallel system and, through an efficiency figure, we show that such type of synchronous SIMD systems are very well suited to support the TNN (and DT-CNN) computational paradigm. 1. Introduction The Toroidal Neural Networks (TNN) [4] are a new type of Discrete Time Cellular Neural Networks (DT-CNN, [7]). TNN have binary outputs and, as underlined in their name, are characterized by a 2D toroidal topology: using the formalism of circulating matrices, a compact mathematical formulation of TNN state evolution, along with the exact Polyhedral Intersection Learning Algorithm (PILA), is presented in [4]. Due to the intrinsic parallelism and the fast achievement of the final output, TNN are very well suited to support an efficient image processing environment. In this work, after a brief review of TNN theoretical basis, the parallel implementation of TNN on the massively parallel system APE100/Quadrics [1] is described and comparisons are made with previous implementations of DT-CNN on other massively parallel systems [6]. 2. Basic Mathematical Definitions Given a N-dimensional row vector w, an affine half space is the set HS {x | x Q N , wx Q} . A polyhedron P is the intersection of finitely many half spaces [9], i.e. P {x | x Q N , Ax b} (1) w1 1 w k k being A a matrix composed by k row vectors wi, and b a k-dimensional constant vector. Given a row vector a (a1 , a2 , ..., an ) , ai Q , a scalar right circulating matrix is defined as a1 an R S (a) a 2 a2 a1 an an a n1 a1 (2) being RS () an operator which receives a n-dimensional row vector and returns a matrix which has a as first row; the ith row is computed by rotating toward right of 1 step the (i-1)th row (i=2,3,…,n). A block right circulating matrix is similar to a scalar right circular matrix, but the entries are circulating matrices instead of scalar values; for example, let us consider M scalar circulating matrices A i R S (a i ) Rs( ai,1 , ai,2 ,, ai,n ) , i=1,…,M. The block right circulating matrix is defined as A1 A ~ ~ R ( A) R ( A1 , A 2 , , A M ) M A 2 A2 AM A1 A M 1 A3 A1 (3) ~ being R () an operator which receives a M-dimensional block row vector A=(RS(a1),…, RS(aM)) and returns a matrix which has A as first row; the ith block row is computed by rotating toward right of 1 step the (i-1)th block row ~ (i=2,3,…,n). If vectors ai (i=1,2,…,M) have length n, R ( A) is a (Mn Mn) matrix. 3. TNN Toroidal Neural Networks (TNN) have binary output and are characterized by a bidimensional toroidal topology, i.e. neurons are defined over a (M M) grid G with connections between corresponding neurons on opposite borders. Given two points pi=(xi1,xi2) (i=1,2), the distance between p1 and p2 is defined as D(p1, p2 ) Max min x1i x 2i , M x1i x 2i i 1, 2 . On a TNN, the neighborhood with radius r of neuron pi is N r (pi ) p pi , p G, Dpi , p r (4); when ambiguity cannot arise, neuron pi=(xi1,xi2) is indicated through its coordinates (xi1,xi2). Neuron with coordinates (i,j) is connected to neuron (k,l) if (k,l) belongs to the neighborhood of (i,j). The weight connecting the two neurons is t (i, j)(k,l) ; the set of weights connecting a neuron with its neighborhood is the cloning template CT, i.e. CT t (i, j)(k,l) | (k, l) N r (i, j) (5). As usual in Cellular Neural Networks (CNN - [2],[3]), CT is the same for all the neurons, so it is spatially invariant. If si,j(n) is the state of neuron (i,j) at the discrete time instant n, the successive state is given by si, j (n 1) t (i, j )(k ,l ) s k ,l (n) ( k ,l )N r (i, j ) (6) The output y of a TNN is binary and is assigned on the basis of the sign of the difference between the final and the initial states: 1 sij (n 1) sij (0) yij (n 1) 1 sij (n 1) sij (0) (7) A cloning template with radius r is expressed through the weight matrix t: t r , r t0, r t r , r t (2r 1); (2r 1) t r ,0 t0,0 t r ,0 (8) t r , r t0, r t r , r In the general case of MM TNN, Rpi is defined as the MM scalar right circulating matrix associated to the (i-r+1)th row of cloning template t extended through the insertion of zeroes into the positions r+2,…,M-r. For the pairs (i,k) satisfying i 1,..., r 1 i M - r 1,..., M , Rpi is given by or k i 1 k i - M - 1 t 0, k t 1,k Rp i t r ,k 0 t 1,k t1,k t r ,k 0 t r ,k t 1,k t 0,k t r ,k 0 t 2,k t 0, k t r , k 0 0 0 t r , k t 0,k t r , k 0 0 t r ,k 0 0 t r ,k t 0,k (9) while Rpi is the null MM matrix when r+2 i M-r. The state of a MM TNN at (discrete) time n can be represented through the (M) 2 entries column vector si,1 (n) s1 (n) s ( n) s 2 ( n) s( n) , where si (n) i,2 (i=1,…, M); s (n) M si, M (n) (10) So, from equations (6), (8), (9), (10), the state evolution of a MM TNN can be written as s 1 (n 1) Rp 1 Rp 2 Rp M s1 (n) s (n 1) Rp Rp Rp 1 M 1 s 2 ( n) 2 M s M (n 1) Rp 2 Rp M Rp 1 s M (n) (11) or, with matrix notations, as s(n 1) RP s(n ) (12) where s() is M21 vector and RP is M2M2 block right circulating matrix. Because of associative property of matrix product, evolution for m instants of the TNN state is given by: s(m) RP m s(0) where RP m is still a block right circulating matrix (see [4]). (13) 4. TNN Exact and Heuristic Learning Let us consider a pair <I,O> of input-output (M M) images describing the desired elaboration; the initial state sx,y(0) is set to the value of pixel I(x,y) and the desired output after m steps is obtained as y x,y(m)=O(x,y) (1 x,y M). From equations (7) and (13), we have y x, y (m) O(x, y) iff s x, y (m) x, y,O s x, y (0) (14) where if O(x, y) 1 x , y, O if O(x, y) 1 (15). s x,y (m) x,y,O s x,y (0) means that final state must be greater or equal (smaller) than sx,y(0) when the (x,y) pixel of the output image O is equal to 1 (-1). From equations (6), (8), (9), (10), (11), (13), the set of M 2 inequalities s x,y (m) x,y,O s x,y (0) (x,y=1,..,M) contained in (14) can be written, for the case m=1, as in the following: RP s k (0) x,y,O s (x My) (0) x=1,…, M ; y=1,…, M (x My), k 2 k 1,M (16) Each of the previous M2 inequalities represents an open half space in the CT space (the inequality unknowns, i.e. the CT weights, are contained in the RPij terms, eq. (9)). Since (16) is the intersection of M2 Half Spaces, it represents a polyhedron (see eq. (1)). Each point in the polyhedron is a CT implementing the desired <I,O> transformation. The exact learning algorithm, based on polyhedral operations contained in the Polyhedral Library [10], is described in [4]. Whenever the CT which exactly transforms I into O cannot be found through polyhedral operations because the desired elaboration cannot be implemented (this can happen, for instance, if some contradictory information is contained in the training example), TNN can be trained through heuristic algorithms to find a CT which approximates the desired elaboration. In [4] and [5] we studied the using of the Simulated Annealing (SA) algorithm [8]. Since TNN evolve their state for very few steps (typically m values range from 1 to 3), the SA algorithm is very fast (times on a Pentium II machine are in the order of few (~5) minutes). Once specified <I,O>, the CT radius r and the number of time steps m are chosen according to problem locality, as discussed in [4]. Using these values, SA algorithm is applied to search for the CT * which gives the best approximation of the desired elaboration. Indicating with Y(I,CT,m) the output of a TNN with cloning template CT, initial state I and m time steps of evolution, the SA algorithm is used to (nearly) solve the following minimization problem: O Y I, CT * , m min O YI, CT, m (17) CT 5. TNN Parallel Implementation In order to implement on a parallel system eq. (6) and (7), we refer to a bi-imensional toroidal parallel machine with p2=pp processors (this topology can eventually be easily embedded onto existing parallel systems). M M Given a MM image, it is partitioned into p2 subimages with size (for the sake of simplicity we assume p p M to be an integer multiple of p). In order to avoid communication during the computations, rm additional M columns of the image (i.e. the p subimage assigned to it) plus the last rm columns of the subimage in the processor at its left and the first rm columns of the subimage in the processor at its right (the same for the processors in the up/down directions); this data distribution is depicted in figure 1. border cells are added at each border (see fig. 1). Each processor contains Processor Pi-1 Processor Pi rm columns Processor Pi+1 M col p Subimage (i-1) Subimage (i) Subimage (i+1) Fig. 1: distribution of subimages among the processors; the ith processor has a copy of the last (rm) columns of M the subimage in the processor at its left, columns of the image assigned to it and a copy of the first (rm) p columns of the subimage in the processor at its right With such an image distribution, the TNN parallel algorithm is the following: input: - CT, m, I Output: - binary image Y begin for c=1 to m do in parallel in all the processors M for i=1 to p M for j=1 to p compute eq. (6) enddo in parallel endfor c do in parallel in all the processors M for i=1 to p M for j=1 to p compute eq. (7) enddo in parallel end 6. Choice of the Parallel System and Comparative Results Looking at previous algorithm, it is clear that it is synchronous, all data are accessed according to regular memory patterns and it is completely data parallel. For such reasons we have chosen the APE100/Quadrics SIMD massively parallel system to implement TNN. Such a machine is based on pipelined VLIW custom processors which offer a peak performance of 50 Mflops. Being APE100/Quadrics a synchronous machine, no time must be wasted to synchronize the evolution of the computation. Thanks to the regularity of the memory access patterns, no caching policy is needed and very efficient data movement from local memory to internal register are possible, allowing very efficient (not limited by the memory data access bandwidth) computations. The code was written using the TAO language, native for the APE100/Quadrics systems. With a little effort in writing efficient code (loop unrolling to avoid the pipeline stall, vectorization of memory accesses to avoid memory startup penalties) the program, tested on r=1,2,3, m=1,2,3 and M=512, showed sustained performances of 2.62 Gflops on a 128 processor system, characterized by peak performance of 6.4 Gflops and main memory of 512 sustainedperformanc es MBytes. Defining the efficiency of the code implementation as η , we achieved an peak performanc es efficiency =0.41. As to give an idea of the high quality of this figure, we compare it to the results reported in [6], where a very similar problem (implementation of DT-CNN on massively parallel systems) was tackled. In that work the authors used a 256 processor CM-2, a 32 processor CM-5 and a 32 processor Cray T3D; all these machine have peak performances nearly equal to 4 Gflops. The best figures reported, for the best image size and CT radius, are 830 Mflops on the CM-5, 410 Mflops on the CM-2 and 95 Mflops on the Cray T3D, corresponding to CM-5=0.21, CM-2=0.10 and T3D=0.024. Such efficiency figures clearly show that synchronous SIMD systems are better suited to implement TNN or DT-CNN. Conclusion In this work we briefly review the theory of a new class of recently introduced DT-CNN: the Toroidal Neural Networks (TNN). We have described the parallel implementation of TNN on the APE100/Quadrics massively parallel systems, where we achieved efficiency figures in the order of 40% with respect to the peak performances. Such high efficiency clearly show that parallel systems like the APE100/Quadrics are very well suited to implement TNN. References [1] A. Bartoloni,. et al.: A hardware implementation of theAPE100 architecture. International Journal of Modern Physics, C4, 1993 [2] L.O. Chua, L. Yang: Cellular Neural Networks: Application. IEEE Trans. on CAS, 35, 1273-90 (1988) [3] L.O. Chua, L. Yang: Cellular Neural Networks: Theory. IEEE Trans. on CAS, vol. 35, 1257-72 (1988) [4] M.Coli, P.Palazzari, R.Rughi: The Toroidal Neural Networks. To appear in the Proc. of the Intern. Symp. on Circuit and Systems, ISCAS 2000. [5] M. Coli, P.Palazzari, R.Rughi: Non Linear Image Processing through Sequences of Fast Cellular Neural Networks (FCNN). IEEE-EURASIP International Workshop on Non Linear Signal and Image Processing (NSIP'99) - June 20-23, 1999, Antalya, Turkey [6] G. Destri, P. Marenzoni: Cellular Neural Networks as a General Massively Parallel Computational Paradigm. Int. Journal of Circuit Theory and Application vol. 24, n.3, 1996 [7] H.Harrer, A.Nossek: Discret Time Cellular Neural Networks. Int. Journal of Circuit Theory and Applications, vol. 20, Sept. 1992. [8] S. Kirkpatrick, C. Gelatt, Jr., and M. Vecchi: Optimization by Simulated Annealing. Science, Vol. 220, No. 4598. pp. 498-516, May 1983 [9] A. Schrijver: Theory of linear and integer programming. John Wiley & Sons Ltd, 1986. [10] D.K. Wilde: A Library for doing polyhedral operations. IRISA – Research Report n. 785 (downloadable from http://www.irisa.fr/EXTERNE/bibli/pi/pi785.html)
© Copyright 2026 Paperzz