CNNA_camread

Massively Parallel Processing Implementation of the
Toroidal Neural Networks
P.Palazzari1, M.Coli2, R.Rughi2
ENEA – HPCN Project, C.R.Casaccia, Via Anguillarese, 301, SP100
00060 S.Maria di Galeria (Rome) – [email protected]
1
Electronic Engineering Department, University “La Sapienza”, Via Eudossiana, 18
00184 Rome – [email protected][email protected]
2
ABSTRACT: The Toroidal Neural Networks (TNN), recently introduced, are derived from
DT-CNN and are characterized by an appealing mathematical description which allows the
development of an exact learning algorithm. In this work, after reviewing the underlying
theory, we describe the implementation of TNN on the APE100/Quadrics massively parallel
system and, through an efficiency figure, we show that such type of synchronous SIMD systems
are very well suited to support the TNN (and DT-CNN) computational paradigm.
1. Introduction
The Toroidal Neural Networks (TNN) [4] are a new type of Discrete Time Cellular Neural Networks (DT-CNN,
[7]). TNN have binary outputs and, as underlined in their name, are characterized by a 2D toroidal topology: using
the formalism of circulating matrices, a compact mathematical formulation of TNN state evolution, along with the
exact Polyhedral Intersection Learning Algorithm (PILA), is presented in [4]. Due to the intrinsic parallelism and the
fast achievement of the final output, TNN are very well suited to support an efficient image processing environment.
In this work, after a brief review of TNN theoretical basis, the parallel implementation of TNN on the massively
parallel system APE100/Quadrics [1] is described and comparisons are made with previous implementations of
DT-CNN on other massively parallel systems [6].
2. Basic Mathematical Definitions
Given a N-dimensional row vector w, an affine half space is the set HS  {x | x  Q N , wx    Q} . A polyhedron P
is the intersection of finitely many half spaces [9], i.e.
P  {x | x  Q N , Ax  b} (1)
 w1 
1 
 w k 
 k 
being A     a matrix composed by k row vectors wi, and b     a k-dimensional constant vector.
Given a row vector a  (a1 , a2 , ..., an ) , ai  Q , a scalar right circulating matrix is defined as
 a1

an
R S (a)  
 

a 2
a2

a1




an
an 

a n1 
 

a1 
(2)
being RS () an operator which receives a n-dimensional row vector and returns a matrix which has a as first row;
the ith row is computed by rotating toward right of 1 step the (i-1)th row (i=2,3,…,n).
A block right circulating matrix is similar to a scalar right circular matrix, but the entries are circulating matrices
instead of scalar values; for example, let us consider M scalar circulating matrices

A i  R S (a i ) Rs( ai,1 , ai,2 ,, ai,n ) , i=1,…,M. The block right circulating matrix is defined as

 A1



A
~
~  
R ( A)  R ( A1 , A 2 ,  , A M )   M
 
 
 A 2


A2  AM 



A1  A M 1 

 

 
A3 
A1 
(3)
~
being R () an operator which receives a M-dimensional block row vector A=(RS(a1),…, RS(aM)) and returns a
matrix which has A as first row; the ith block row is computed by rotating toward right of 1 step the (i-1)th block row
~
(i=2,3,…,n). If vectors ai (i=1,2,…,M) have length n, R ( A) is a (Mn  Mn) matrix.
3. TNN
Toroidal Neural Networks (TNN) have binary output and are characterized by a bidimensional toroidal topology,
i.e. neurons are defined over a (M  M) grid G with connections between corresponding neurons on opposite borders.
Given two points pi=(xi1,xi2) (i=1,2), the distance between p1 and p2 is defined as
 
D(p1, p2 )  Max min x1i  x 2i , M  x1i  x 2i
i 1, 2
 .
On a TNN, the neighborhood with radius r of neuron pi is

N r (pi )  p pi , p  G, Dpi , p   r

(4);
when ambiguity cannot arise, neuron pi=(xi1,xi2) is indicated through its coordinates (xi1,xi2). Neuron with
coordinates (i,j) is connected to neuron (k,l) if (k,l) belongs to the neighborhood of (i,j). The weight connecting the
two neurons is t (i, j)(k,l) ; the set of weights connecting a neuron with its neighborhood is the cloning template CT,
i.e.


CT  t (i, j)(k,l) | (k, l)  N r (i, j)
(5).
As usual in Cellular Neural Networks (CNN - [2],[3]), CT is the same for all the neurons, so it is spatially
invariant.
If si,j(n) is the state of neuron (i,j) at the discrete time instant n, the successive state is given by
si, j (n  1) 

t (i, j )(k ,l )  s k ,l (n)
( k ,l )N r (i, j )
(6)
The output y of a TNN is binary and is assigned on the basis of the sign of the difference between the final and the
initial states:
 1 sij (n  1)  sij (0)
yij (n  1)  
 1 sij (n  1)  sij (0)
(7)
A cloning template with radius r is expressed through the weight matrix t:
t r ,  r  t0,  r  t r ,  r 





t (2r  1); (2r  1)    t r ,0  t0,0  t r ,0  (8)





t r ,  r  t0,  r  t r ,  r 


In the general case of MM TNN, Rpi is defined as the MM scalar right circulating matrix associated to the
(i-r+1)th row of cloning template t extended through the insertion of zeroes into the positions r+2,…,M-r.
For the pairs (i,k) satisfying
i  1,..., r  1
i  M - r  1,..., M
, Rpi is given by
or 

k

i

1

k  i - M - 1
 t 0, k
t
 1,k


Rp i  t  r ,k
 0


t
 1,k
t1,k  t r ,k 0  t r ,k  t 1,k 
t 0,k   t r ,k 0   t 2,k 



 t 0, k   t r , k 0 0 0 
t  r , k  t 0,k   t r , k 0 0 



 t r ,k 0  0 t r ,k  t 0,k 
(9)
while Rpi is the null MM matrix when r+2 i M-r.
The state of a MM TNN at (discrete) time n can be represented through the (M) 2 entries column vector
 si,1 (n) 
 s1 (n) 
 s ( n) 
 s 2 ( n) 
s( n)  
, where si (n)   i,2
 (i=1,…, M);


  
s (n)
 M 
si, M (n)
(10)
So, from equations (6), (8), (9), (10), the state evolution of a MM TNN can be written as
 s 1 (n  1)   Rp 1 Rp 2  Rp M   s1 (n) 
 s (n  1)  Rp Rp  Rp
 

1
M 1   s 2 ( n) 
 2
 M


  

    

 
 

s M (n  1)  Rp 2  Rp M Rp 1  s M (n)
(11)
or, with matrix notations, as
s(n  1)  RP  s(n )
(12)
where s() is M21 vector and RP is M2M2 block right circulating matrix.
Because of associative property of matrix product, evolution for m instants of the TNN state is given by:
s(m)  RP m  s(0)
where RP m is still a block right circulating matrix (see [4]).
(13)
4. TNN Exact and Heuristic Learning
Let us consider a pair <I,O> of input-output (M  M) images describing the desired elaboration; the initial state
sx,y(0) is set to the value of pixel I(x,y) and the desired output after m steps is obtained as y x,y(m)=O(x,y)
(1  x,y  M). From equations (7) and (13), we have
y x, y (m)  O(x, y) iff s x, y (m)  x, y,O s x, y (0)
(14)
where
 if O(x, y) 1
 x , y, O  
 if O(x, y)   1
(15).
s x,y (m)  x,y,O s x,y (0) means that final state must be greater or equal (smaller) than sx,y(0) when the (x,y) pixel
of the output image O is equal to 1 (-1).
From equations (6), (8), (9), (10), (11), (13), the set of M 2 inequalities s x,y (m)  x,y,O s x,y (0) (x,y=1,..,M)
contained in (14) can be written, for the case m=1, as in the following:


  RP
s k (0)   x,y,O s (x My) (0) x=1,…, M ; y=1,…, M
(x

My),
k


2
k 1,M



(16)
Each of the previous M2 inequalities represents an open half space in the CT space (the inequality unknowns, i.e.
the CT weights, are contained in the RPij terms, eq. (9)). Since (16) is the intersection of M2 Half Spaces, it
represents a polyhedron (see eq. (1)). Each point in the polyhedron is a CT implementing the desired <I,O>
transformation. The exact learning algorithm, based on polyhedral operations contained in the Polyhedral Library
[10], is described in [4].
Whenever the CT which exactly transforms I into O cannot be found through polyhedral operations because the
desired elaboration cannot be implemented (this can happen, for instance, if some contradictory information is
contained in the training example), TNN can be trained through heuristic algorithms to find a CT which approximates
the desired elaboration. In [4] and [5] we studied the using of the Simulated Annealing (SA) algorithm [8]. Since
TNN evolve their state for very few steps (typically m values range from 1 to 3), the SA algorithm is very fast (times
on a Pentium II machine are in the order of few (~5) minutes).
Once specified <I,O>, the CT radius r and the number of time steps m are chosen according to problem locality, as
discussed in [4]. Using these values, SA algorithm is applied to search for the CT * which gives the best
approximation of the desired elaboration. Indicating with Y(I,CT,m) the output of a TNN with cloning template CT,
initial state I and m time steps of evolution, the SA algorithm is used to (nearly) solve the following minimization
problem:


O  Y I, CT * , m  min  O  YI, CT, m  
(17)
CT
5. TNN Parallel Implementation
In order to implement on a parallel system eq. (6) and (7), we refer to a bi-imensional toroidal parallel
machine with p2=pp processors (this topology can eventually be easily embedded onto existing parallel systems).
M M
Given a MM image, it is partitioned into p2 subimages with size
(for the sake of simplicity we assume

p p
M to be an integer multiple of p). In order to avoid communication during the computations, rm additional
M
columns of the image (i.e. the
p
subimage assigned to it) plus the last rm columns of the subimage in the processor at its left and the first rm
columns of the subimage in the processor at its right (the same for the processors in the up/down directions); this
data distribution is depicted in figure 1.
border cells are added at each border (see fig. 1). Each processor contains
Processor Pi-1
Processor Pi
rm columns
Processor Pi+1
M
col
p
Subimage (i-1)
Subimage (i)
Subimage (i+1)
Fig. 1: distribution of subimages among the processors; the ith processor has a copy of the last (rm) columns of
M
the subimage in the processor at its left,
columns of the image assigned to it and a copy of the first (rm)
p
columns of the subimage in the processor at its right
With such an image distribution, the TNN parallel algorithm is the following:
input:
- CT, m, I
Output:
- binary image Y
begin
for c=1 to m
do in parallel in all the processors
M
for i=1 to
p
M
for j=1 to
p
compute eq. (6)
enddo in parallel
endfor c
do in parallel in all the processors
M
for i=1 to
p
M
for j=1 to
p
compute eq. (7)
enddo in parallel
end
6. Choice of the Parallel System and Comparative Results
Looking at previous algorithm, it is clear that it is synchronous, all data are accessed according to regular memory
patterns and it is completely data parallel. For such reasons we have chosen the APE100/Quadrics SIMD massively
parallel system to implement TNN. Such a machine is based on pipelined VLIW custom processors which offer a
peak performance of 50 Mflops. Being APE100/Quadrics a synchronous machine, no time must be wasted to
synchronize the evolution of the computation. Thanks to the regularity of the memory access patterns, no caching
policy is needed and very efficient data movement from local memory to internal register are possible, allowing very
efficient (not limited by the memory data access bandwidth) computations.
The code was written using the TAO language, native for the APE100/Quadrics systems. With a little effort in
writing efficient code (loop unrolling to avoid the pipeline stall, vectorization of memory accesses to avoid memory
startup penalties) the program, tested on r=1,2,3, m=1,2,3 and M=512, showed sustained performances of 2.62
Gflops on a 128 processor system, characterized by peak performance of 6.4 Gflops and main memory of 512
sustainedperformanc es
MBytes. Defining the efficiency of the code implementation as η 
, we achieved an
peak performanc es
efficiency =0.41. As to give an idea of the high quality of this figure, we compare it to the results reported in [6],
where a very similar problem (implementation of DT-CNN on massively parallel systems) was tackled. In that work
the authors used a 256 processor CM-2, a 32 processor CM-5 and a 32 processor Cray T3D; all these machine have
peak performances nearly equal to 4 Gflops. The best figures reported, for the best image size and CT radius, are 830
Mflops on the CM-5, 410 Mflops on the CM-2 and 95 Mflops on the Cray T3D, corresponding to CM-5=0.21,
CM-2=0.10 and T3D=0.024. Such efficiency figures clearly show that synchronous SIMD systems are better suited to
implement TNN or DT-CNN.
Conclusion
In this work we briefly review the theory of a new class of recently introduced DT-CNN: the Toroidal Neural
Networks (TNN). We have described the parallel implementation of TNN on the APE100/Quadrics massively
parallel systems, where we achieved efficiency figures in the order of 40% with respect to the peak performances.
Such high efficiency clearly show that parallel systems like the APE100/Quadrics are very well suited to implement
TNN.
References
[1] A. Bartoloni,. et al.: A hardware implementation of theAPE100 architecture. International Journal of Modern
Physics, C4, 1993
[2] L.O. Chua, L. Yang: Cellular Neural Networks: Application. IEEE Trans. on CAS, 35, 1273-90 (1988)
[3] L.O. Chua, L. Yang: Cellular Neural Networks: Theory. IEEE Trans. on CAS, vol. 35, 1257-72 (1988)
[4] M.Coli, P.Palazzari, R.Rughi: The Toroidal Neural Networks. To appear in the Proc. of the Intern. Symp. on
Circuit and Systems, ISCAS 2000.
[5] M. Coli, P.Palazzari, R.Rughi: Non Linear Image Processing through Sequences of Fast Cellular Neural
Networks (FCNN). IEEE-EURASIP International Workshop on Non Linear Signal and Image Processing
(NSIP'99) - June 20-23, 1999, Antalya, Turkey
[6] G. Destri, P. Marenzoni: Cellular Neural Networks as a General Massively Parallel Computational Paradigm.
Int. Journal of Circuit Theory and Application vol. 24, n.3, 1996
[7] H.Harrer, A.Nossek: Discret Time Cellular Neural Networks. Int. Journal of Circuit Theory and Applications,
vol. 20, Sept. 1992.
[8] S. Kirkpatrick, C. Gelatt, Jr., and M. Vecchi: Optimization by Simulated Annealing. Science, Vol. 220, No.
4598. pp. 498-516, May 1983
[9] A. Schrijver: Theory of linear and integer programming. John Wiley & Sons Ltd, 1986.
[10] D.K. Wilde: A Library for doing polyhedral operations. IRISA – Research Report n. 785 (downloadable from
http://www.irisa.fr/EXTERNE/bibli/pi/pi785.html)