The T o r u s : An E x e r c i s e i n C o n s t r u c t i n g a P r o c e s s i n g S u r f a c e A l a i n J. M a r t i n Computer S c i e n c e Department C a l i f o r n i a I n s t i t u t e o f Technology P r o c e e d i n g s o f t h e Second C a l t e c h C o n f e r e n c e o n VLSI, J a n u a r y 1 9 8 1 THE TORUS: AN EXERCISE I N CONSTRUCTING A PROCESSING SURFACE P h i l i p s Research Laboratories 5600 MD Eindhoven The Netherlands Abstract. A "Processing Surface" is defined a s a l a r g e , dense, and regular arrangement of processor and s t o r a g e modules on a two-dimensional surface, e.g. a V L S I chip. A general method i s described f o r d i s t r i b u t i n g p a r a l l e l r e c u r s i v e computations over such a surface. Scope r u l e s enforcing t h e " l o c a l i t y " of v a r i a b l e s and procedure parameters a r e introduced i n t h e programming language. These r u l e s and a p a r t i c u l a r i n t e r c o n n e c t i o n 0: the modules on t h e s u r f a c e make it p o s s i b l e t o transmit parameter and v a - i a b l e values between modules without using extraneous communication a c t i o n s . The choice of t h e Processing Surface topology f o r binary r e c u r s i v e computations i s discussed and a t o r u s - l i k e topology i s chosen. 0. INTRODUCTION L e t us c a l l a "Processing Surface" a l a r g e , dense, r e g u l a r arrangement of processor and s t o r a g e modules on a two-dimensional s u r f a c e , e.g. a VLSI chip. How can a computatkion be d i s t r i b u t e d over such a s u r f a c e ? what a r e t h e arrangements of t h e modules on t h e s u r f a c e b e s t s u i t e d f o r a c e r t a i n c l a s s of computations? W e propose t o explore this problem i n t h e following d i r e c t i o n . I n such an environment, an a c t i o n on a v a r i a b l e d i f f e r s i n complexity ( i n terms of t h e number of elementary s t e p s necessary t o perform t h e a c t i o n ) depending on the d i s t a n c e between t h e processor module performing t h e a c t i o n and t h e storage module containing t h e v a r i a b l e . We want t o r e f l e c t t h i s i s s u e a t t h e programming l e v e l by introducing scope r u l e s d e f i n i n g t h e d i s t a n c e between t h e program component where a v a r i a b l e i s declared, and t h e program components where t h e v a r i a b l e can be used. Since we expect intense communications between t h e program components, we expect assignments of t h e form x:=y where x and y belong t o two adjacent components ( t h i s assignment can take t h e form of a procedure c a l l o r a p a i r of matching communication a c t i o n s ) t o occur a s f r e q u e n t l y a s assignments between v a r i a b l e s .of t h e same component. I n most d i s t r i b u t e d systems, t h e f i r s t type of assignment is an order of magnitude more complex than t h e second one. We consider t h i s hidden discrepancy between e q u i v a l e n t actions unacceptable. We w i l l show t h a t it i s p o s s i b l e t o d e f i n e some l o c a l i t y r u l e f o r t h e program v a r i a b l e s , and t o organize t h e processor and storage modules on t h e surface such t h a t no discrepancy of t h i s s o r t appears. I n such a case, t h e Processing Surface i s s a i d t o be CALTECH CONFERENCE ON V L S I , January 1 9 8 1 AZain J. Martin Furthermore, s i n c e f o r i n s t a n c e i n v e r t i n g a 2+2 matrix does not require a s much p a r a l l e l i s m a s i n v e r t i n g a 1000~1000 matrix, t h e p o t e n t i a l p a r a l l e l i s m of an algorithm should n o t be f i x e d beforehand (e.g. by t h e number of a v a i l a b l e p r o c e s s o r s ) b u t should be determined dynamically according t o the needs of the computation. The component a c t i o n s of a computation should be c r e a t e d and destroyed a s t h e computation proceeds, and should be automatically d i s t r i b u t e d over t h e a v a i l a b l e modules. f . THE GENERAL METHOD The general method we use has been described i n [ I ] . briefly. - We s h a l l r e c a l l it - The component a c t i o n s of a computation t h e "nodes" are regarded t h e "computation graph" which grows and a s t h e v e r t i c e s of a graph shrinks during the computation. An edge a "channel" between two nodes means t h a t one of t h e two, say node A , h a s created t h e o t h e r , say node B , by a procedure c a l l , and t h a t A and 33 communicate d i r e c t l y with each o t h e r . A is t h e " f a t h e r " of B , and B i s 5 "son" of A Thanks t o a p a r a l l e l procedure c a l l , a f a t h e r may c r e a t e s e v e r a l sons simultaneously. The father/son r e l a t i o n d e f i n e s a p a r t i a l ordering of t h e nodes, and a l l nodes t h a t a r e not r e l a t i v e l y ordered can be performed i n parallel. - - - - . A computation graph grows and s h r i n k s through a given f i n i t e "implementation graph", whose v e r t i c e s the "cells" represent the the "links" t h e communication available modules, and t h e edges p o s s i b i l i t i e s between modules. Each node i s mapped on a c e l l , and each channel on a l i n k . - - - - Hence, each c e l l may have t o accommodate an unbounded number of nodes, Since a c e l l r e p r e s e n t s a very small number of sequential automata ( i n most cases, one!!, fhe a c t i v i t i e s c?f a l l rides sLicltar?esusl.~p r e s e n t i n a cell have t o be s e q u e n t i a l i z e d i n some way. ~ u such t a s e q u e n t i a l i z a t i o n may introduce deadlock. The main r e s u l t of [ I ] i s t o prove t h a t t h e nodes of a c e l l can be i n t e r l e a v e d without introducing deadlock provided t h a t t h e g r a i n of i n t e r l e a v i n g be c o r r e c t l y chosen. The s o l u t i o n i s very simple i n t h a t it does not r e q u i r e any p a r t i c u l a r knowledge about t h e nodes o r t h e implementat i o n graph nor complicated scheduling. I n t h i s paper we s h a l l consider a s p e c i a l c l a s s of computations, namely r e c u r s i v e computations. For t h i s c l a s s of computations we s h a l l describe how t o implement a continuous Processing Surface, and we s h a l l propose a t o r u s - l i k e topology f o r t h e implementation graph. 2 . RECURSION Much has been s a i d about t h e use of recursion f o r p a r a l l e l programming. The reader i s r e f e r r e d t o t h e abundant l i t e r a t u r e on t h i s s u b j e c t . For t h e sake of s i m p l i c i t y , we s h a l l r e s t r i c t ourselves t o one of t h e most usual r e c u r s i v e methods, namely "divide-and-conquer" ( a l s o c a l l e d " r e c u r s i v e doubling"). Divide-and-conquer algorithms a r e p a r t i c u l a r l y i n t e r e s t i n g i n t h a t they produce binary t r e e s a s computation graphs. Binary t r e e s a r e regular s t r u c t u r e s and each node has an outdegree of two, which i s i n t e r e s t i n g i n view of t h e i r mapping onto a two-dimensional surface. ARCHITECTURE S E S S I O N The T o r u s : An E x e r c i s e i n C o n s t r u c t i n g a P r o c e s s i n g S u r f a c e Parallelism i s introduced only by c a l l i n g two procedures " i n p a r a l l e l " . The p o s s i b i l i t y of further increasing parallelism by pipelining the parameters w i l l not be mentioned although it can e a s i l y be added. Neverthel e s s t h i s c l a s s of algorithms is large enough ( i n p a r t i c u l a r numerical algorithms) f o r the exercise t o be r e a l i s t i c . Since a node i s created by a procedure c a l l , a node i s a procedure instance with i t s own program counter, and i t s s e t of variables and parameters. The following r u l e s define the "locality" of variables and parameters . . The u n i t of l o c a l i t y is the node: a variable declared inside a node i s l o c a l t o t h a t node. A variable local t o a node A is a neighbour f o r a l l son nodes of A . . Since the father/son r e l a t i o n between nodes i s not t r a n s i t i v e , the locality or neighbourhood of a variable with respect t o a node i s not t r a n s i t i v e e i t h e r : if a node PI c a l l s a node P2 which c a l l s a node ~3 a variable l o c a l t o PI i s neighbour f o r P2 , but not f o r P3 . , Three types of parameters a r e used: . An input parameter is used t o "import" a parameter value i n t o a son node, by an assignment of the a c t u a l parameter value t o the formal parameter variable . . An output parameter i s used t o "export" a value from a son node t o i t s father by an assignment of the formal parameter value t o the actual parameter variable. A reference parameter i s used both t o import and t o export, but by a prwess of substitution, o r "aliasing": the formal parameter replaces the actual parameter i n the son node ( i t i s another name f o r the same v a r i a b l e ) . . In the case of the input and the output parameters, the formal parameter is local t o t h e son. In the case of the reference 'parameter, the formal parameter has the same l o c a l i t y a s the actual parameter. The formal parameter i s thus not l o c a l t o the son. Assume t h a t the value x of a variable i s t o be imported from a f a t h e r node P I into a son node P2 Either an input or a reference mechanism can be used. Assume now t h a t x i s t o be passed again from P2 t o a son node P3 I f x was passed from PI t o P2 a s an input parameter, x w i l l be local t o P3 i f it i s passed a s an input parameter from P2 t o P3 , and neighbour t o P3 i f it i s paksed a s a reference parameter from P2 t o P3 But i f x was passed from PI t o P2 as a reference parameter, x w i l l neither be l o c a l nor neighbour t o P 3 , whether it be passed a s an input or as a reference parameter from P2 t o P3 ( I n the case where a value is t o be exported from a son node t o a f a t h e r node, exactly the same differences hold according t o whether it i s passed as an output or a reference parameter,) . . . . CALTECH CONFERENCE ON VLSI, J a n u a r y 1 9 8 1 AZain J. M a r t i n Hence, the l o c a l i t y o r neighbouthood of a reference parameter with respect t o a node i s not t r a n s i t i v e whereas t h a t of an input or output parameter is. But when a value x is passed as an input or an output parameter from node, P t o node Q , by definition x is copied from the No copying i s necessary storage area of P i n t o the storage area of Q when x is passed a s a reference parameter. . The r e p e t i t i v e transport of values via global variables and reference parameters could be used i n i t s f u l l generality, but we propose t o r e s t r i c t i t s use by the following " l o c a l i t y rule". h c a l i t y rule: An action of a node involves only variables and parameters t h a t a r e local and/or neighbour f o r the node. (Whether global variables should be used a t a l l is doubtful. They have been included for the sake of completeness.) we s h a l l see t h a t this l o c a l i t y r u l e permits the implementation of a continuous Processing Surface, 4. IMPZtEMENTATION OF A CONTINUOUS PROCESSING SURFACE Definition: A Processing Surface is said t o be "continuous" when any action performed on the surface involves only variables t h a t a r e d i r e c t l y accessible t o the processor performing the action, i.e. accessible by elementary read or write operations. Hence, i f we succeed i n implementing a continuous surface, we s h a l l have suppressed-any form of extraneous communication action f o r accessing variables. According t o the general method, we know t h a t i f node N1 i s mapped on c e l l C1 , a son node N2 of N1 i s mapped on a neighhollr c e l l C2 of C1 For node N1 t o be mapped on C1 means t h a t the local variables and parameters of NI must be allocated i n the storage module associated with C1 , and the same for N 2 r e l a t i v e t o C2 Let MI and M 2 be the storage modules associated with C1 and C2 , respectively. According t o the locality r u l e , any action of N2 . may involve variables located i n MI and M2 The s e t (MI, ~ 2 )is called the " l o c a l i t y area" of N2 In the case where the computation graph i s a t r e e , the l o c a l i t y area of a node consists of a t most two elements. . . . . As a d i r e c t consequence of the l o c a l i t y r u l e and of the d e f i n i t i o n of a continuous Processing Surface, the Processing Surface is continuous i f the property C(N) holds f o r any node N . N is performed by a processor d i r e c t l y connected t o the two storage modules of the l o c a l i t y area of N C ( N ) : any action of . W e s h a l l describe a strategy for placing the processoi and storage modules on the implementation graph, and for d i s t r i b u t i n g the actions and the variables of the nodes over the processor and storage modules, such t h a t C(N) holds f o r any node N . This strategy thus implements a continuous Processing Surface, ARCHITECTURE SESSION The T o r u s : An E x e r c i s e i n C o n s t r u c t i n g a P r o c e s s i n g S u r f a c e 1 ) The p l a c i n g s t r a t e g y is d i r e c t l y suggested by t h e property C(N) . i s placed a t each vertex, and a processor module a t each edge of t h e implementation graph. A s t o r a g e module d i r e c t access t o two -(See an example on f i g , 1 . ) Hence, each processor h a s storage modules, and each s t o r a g e module is shared by a s many processors a s t h e degree of t h e v e r t e x where it i s placed. 2) Assume t h a t - -- C(F) holds f o r a node F For instance, F has been -- created i n c e l l 2 o f f i g . l ( a ) ; its l o c a l v a r i a b l e s are i n M2 , i t s neighbour v a r i a b l e s i n M1 , and i t s a c t i o n s a r e processed by P12 ( s e e f i g . 2). . ( b ) processor and s t o r a g e placement ( a ) implementation graph Fig. 1. Assume t h a t a t some s t a g e i n t h e domputation of F two son nodes R and D ( f o r r i g h t and down) of F a r e t o be created i n c e l l s 3 and 4, r e s p e c t i v e l y . The l o c a l i t y a r e a s of R and D must then be (M2, M 3 ) and (M2, M4) , r e s p e c t i v e l y ( s e e f i g . 2 ) . This means t h a t C ( R ) and C(D) w i l l hold i f and only i f R and D a r e processed by P23 and P24 , respectively. Upon reaching t h e procedure c a l l s of R and D i n t h e procedure body of F , P12 must t r a n s m i t t h e c r e a t i o n of R and D t o P23 and P24 Since, by c o n s t r u c t i o n PI2 ,: P23 , and P24 share a common s t o r e , namely M2 , t h e transmission of procedure c a l l s i s a simple and l o c a l a c t i o n : PI2 adds t h e names of R and D t o t h e lists l o c a t e d i n M2 of nodes t o be processed by P23 and P24 , r e s p e c t i v e l y . . - - A processor switches from one node t o the o t h e r upon a procedure c a l l i n the same way a s i n a multiprogramming system a processor switches from one process t o another upon a P-operation on a zero semaphore. We s h a l l not describe t h e implementation i n more d e t a i l . CALTECH CONFERENCE ON VLSI, J a n u a r y 1 9 8 1 AZain J . Martin Hence, i f C ( F ) h o l d s f o r a node F , C ( R ) and C ( D ) hold f o r t h e observe t h a t t h e above s t r a t e g y i s independent of t h e two son nodes of F topologies of the implementation qraph and of t h e computation tree. The r o o t node P of t h e computation t r e e i s c r e a t e d by t h e "environment" a root of t h e computation. A t l e a s t one c e l l of t h e implementation graph i s connected t o t h e environment. I t i s easy t o map P o n t o a r o o t cell c e l l i n such a way t h a t --C (- P ) 'holds. . - - ?-- ,.--.* !, MI \ F Fig. 2. l o c a l i t y a r e a s 5. THE CHOICE OF THE IMPLEMENTATION GRAPH We look f o r a f i n i t e implementation graph such t h a t 1 ) an a r b i t r a r y binary t r e e can be mapped onto it without knowing t h e s i z e s of t h e t r e e and of t h e graph, 2 ) t h e nodes of t h e t r e e a r e optimally spread over t h e c e l l s of the graph. Becacse af I ) , we a h a t "-aLnuilating" an i n f i n i t e grapn on a f i n i t e one. Let us assume t h a t we could indeed c o n s t r u c t an i n f i n i t e implementation graph, which graph would we choose? Since we a r e looking f o r graphs t h a t can be represented i n the plane by regular and dense s t r u c t u r e s , we a r e bound t o choose between t h e t h r e e r e g u l a r t e s s e l l a t i o n s of t h e plane, which a r e t h e square, t h e t r i a n g u l a r , and t h e hexagonal t e s s e l l a t i o n s . ( ~ l t h o u g ht h e i n f i n i t e binary t r e e i s r e g u l a r , it i s n o t dense, because it grows exponentially and t h e r e f o r e cannot be represented with minimal c o n s t a n t edge lengths. ) W e have chosen t h e square t e s s e l l a t i o n , although t h e hexagonal i s a l s o i n t e r e s t i n g . We s h a l l f i r s t d i s c u s s t h e problems of mapping a b i n a r y t r e e onto an i n f i n i t e g r i d . We s h a l l then simulate t h e i n f i n i t e g r i d on a f i n i t e grid. 6 . THE INFINITE GRID AS AN IMPLE'MENTATION GRAPH > ( i, An i n f i n i t e g r i d i s a graph such t h a t : f o r i 0 j) i s connected with v e r t e x ( I , j ) and v e r t e x . , vertex and j 2 0 ( i , j+l ) The mapping of a b i n a r y t r e e on t h e g r i d i s obvious. The r o o t of t h e I f a node i s mapped on v e r t e x ( i , j ) , t r e e i s mapped onto v e r t e x ( 0 , 0 ) then i t s r i g h t son R i s mapped on v e r t e x ( i , j + l ) , and i t s down son D . ARCBITECTURE SESSION T h e Torus: An E x e r c i s e i n C o n s t r u c t i n g a P r o c e s s i n g S u r f a c e . i s mapped on vertex ( i + l , j) When an exponential s t r u c t u r e ( t h e binary t r e e ) i s mapped on a quadratic one ( t h e grid) a congestion problem i s created: vertex ( i , j ) of the grid may have t o accommodate up t o j i ! ! nodes of the binary t r e e simultaneously, 7 . THE STRAIGHT TORUS The problem now i s t o simulate the i n f i n i t e grid on a f i n i t e one. For reasons of symmetry we choose a square grid of M*M c e l l s . (We s h a l l return t o t h i s choice l a t e r . ) The f i r s t solution consists i n connecting c e l l (x, y) of the f i n i t e grid ( 0 ( x , y M) t o the c e l l s : - and ( x , (yi-flmod M) ( ( x + l ) eM I y) < . This amounts t o connecting with each other the corresponding elements of the f i r s t and l a s t columns, and those of the f i s t and l a s t rows. The volume obtained is topologically similar t o a torus. Consider an a r b i t r a r y c e l l ( i , j ) of the i n f i n i t e grid and the c e l l (x, y) of the f i n i t e grid on which it is mapped. According t o the above connecting r u l e , we have : This r e l a t i o n describes the t i l i n g of the i n f i n i t e grid by square t i l e s of s i z e M*M : i f ( i , j) are the coordinates of a c e l l of the i n f i n i t e grid, then ( x , y) are i t s coordinates i n the t i l e (k, 1) ( s e e f i g . 3 . ) . The congestion problem can be solved i n the following way. Consider the i n f i n i t e grid. When a vertex is occupied by a node N of the computation t r e e , no other node i s accepted by the vertex u n t i l N and the subtree attached t o N have terminated t h e i r a c t i v i t y . I t is easy t o prove t h a t t h i s cannot lead t o deadlock on the i n f i n i t e grid. B u t t h i s solution cannot be used i n a straightforward manner for Lhe t ~ r u swithout danger of deadlock. Assume t h a t a c e l l of the torus i s occupied by the node N1 , and a new node N 2 i s not accepted by This would be the c e l l . I t may occur t h a t N 2 belongs t o the subtree of N 1 a deadlock. For each node of the computation t r e e , it i s recorded t o which t i l e the node belongs. When a c e l l i s occupied by a node N 1 , it may refuse a node N 2 only i f N 2 belongs t o the same t i l e a s N1 ( I f two nodes belong t o t h e same t i l e , it i s impossible t h a t one belongs t o the subtree of the other.) . . 8. THE PROPAGATION PATTERN Assume t h a t a l l c e l l s and a l l nodes i n the c e l l s have similar behaviours, and t h a t the propagation speeds a r e similar i n a l l directions even i n the case of an asynchronous implementation. Then we can say t h a t i n a phase of homogeneous expansion or contraction of the computation, there i s a f r o n t wave of active nodes which a r e located a t a maximum distance from the r o o t , i.e. on a diagonal i ij = K of the i n f i n i t e grid, which we s h a l l c a l l the "active diagonal". A t s t e p K of the computation, the complete computation t r e e contains 2w(K-1) a c t i v e nodes ( t h e leaves). B u t a t step K of the computation, a t most K(K+1)/2 c e l l s of the i n f i n i t e grid can be active, and i f the strategy for reducing congestion i s applied, a t most K : the c e l l s of the active leaf nodes cannot be active diagonal. A s a consequence, the 2.n-n(K-I) CALTECH CONFERENCE ON VLSI, J a n u a r y 1981 AZain J. Martin simultaneously; t h e i r a c t i v i t i e s have t o be sequentialized. The hypothesis of homogeneous expansion and c o n t r a c t i o n then does not s t r i c t l y hold anymore because n o t a l l c e l l s on a diagonal have t o accommodate t h e same number of l e a f nodes, and t h e r e f o r e t h e c o n t r a c t i o n of t h e computation w i l l not start i n a l l c e l l s a t t h e same time. But it is an acceptable approximation. Fig. 3 . Fig. 3 shows t h a t t h e a c t i v e diagonal of t h e i n f i n i t e g r i d i s mapped on a t most two diagonals of t h e f i n i t e g r i d , i . e . a t most M cells o u t of t h e M*M are a c t i v e . (Algebraically, f o r a given value of i + j , t h e r e a r e a t most two values of x + y (0 x t y 2 I f u l f i l l i n g R , namely: < - ( i + j)@M if ( i + j ) r & M < M - 1 : ( i + j ) e M + M . Hence, i f t h e a c t i v e diagonal approximation is c o r r e c t , t h e s t r a i g h t t o r u s topology l e a d s t o a poor d i s t r i b u t i o n of t h e computation over t h e Processing Surface. 9. THE TWISTED TORUS Obviously, t h e drawback of t h e s t r a i g h t t o r u s i s caused by t h e symmetry We can d e s t r o y t h e symmetry of t h e t i l i n g of f i g . 3 around t h e a x i s i = j by s h i f t i n g t h e t i l i n g by one p o s i t i o n and i n one d i r e c t i o n , a s shown by f i g , 4 . Now we s e e t h a t f o r t h e same a c t i v e diagonal, more diagonals of t h e f i n i t e g r i d a r e occupied. I n f a c t , it can be proved t h a t the d i s t r i b u t i o n of t h e a c t i v e diagonal over t h e f i n i t e g r i d i s now optimal: i f t h e a c t i v e diagonal contains no more than M*M nodes, no two nodes of t h e a c t i v e diagonal are mapped on the same c e l l of t h e t o r u s . This t i l i n g corresponds t o t h e t i l i n g r e l a t i o n : . ARCRITECTURE SESSION The T o r u s : An E x e r c i s e i n C o n s t r u c t i n g a P r o c e s s i n g S u r f a c e Fig. 4. 10. THE DOUBLY TWISTED TORUS The same r e s u l t could have been reached by u s i n g a r e c t a n g u l a r s t r a i g h t t o r u s of M*P c e l l s where M and P are r e l a t i v e primes. The d i f f e r e n c e i s t h a t i n a twisted t o r u s , a h o r i z o n t a l chain of nodes, i.e. the succession of nodes w i t h c o n s t a n t i , i s mapped on a c y c l e containing On t h e r e c t a n g u l a r a l l c e l l s of t h e t o r u s , i . e . on a c y c l e of length M*M t o r u s , such a s t r u c t u r e is mapped on a cycle of l e n g t h M (one row of t h e torus!, I n both cases a vertical chain (ccnstant j ) i s inspped on t l e c e l l s of only one column. I n view of c e r t a i n degenerate binary t r e e s , which reduce t o a chain of only r i g h t o r l e f t procedure c a l l s , it could be i n t e r e s t i n g t o t w i s t t h e t o r u s i n all both d i r e c t i o n s i n such a way t h a t a v e r t i c a l chain is a l s o mapped on c e l l s of t h e t o r u s . To avoid r e i n t r o d u c t i o n of t h e symmetry, t h e t o r u s must be twisted i n opposite d i r e c t i o n s i n t h e two dimensions (e.g. +1 f o r t h e rows, and -1 f o r t h e columns). The f a c t t h a t t h e corresponding t i l i n g r e l a t i o n : i=x+k*M-1 j = y + l * + k ---. has no s o l u t i o n ' f o r ( i , j ) = (M(q+l) - p , pM + 4 ) means t h a t such a t i l i n g does n o t r e p r e s e n t a "plane" surface. We mean t h a t i f , on t h e i n f i n i t e g r i d , p o i n t B i s rsached from p o i n t A by r h o r i z o n t a l s t e p s and s v e r t i c a l ones, B is a l s o reached from A by any permutation of t h e s e s t e p s . This i s no longer t r u e f o r this doubly t w i s t e d t o r u s . This i s shown by t h e following counter-example. Consider t h e 3*3 doubly twisted t o r u s of f i g . 5 ( a ) . From p o i n t A , one h o r i z o n t a l s t e p ( i n d i c a t e d i n f i g . 5 by a d o t t e d p a t h ) , followed by one v e r t i c a l s t e p ( i n d i c a t e d i n f i g . 5 by a dashed path) l e a d s t o p o i n t B ( f i g . 5 ( b ) ) . From p o i n t A , - . CALTECH CONFERENCE ON VLSI, J a n u a r y 1 9 8 1 A l a i n J. M a r t i n one v e r t i c a l . s t e p followed by one h o r i z o n t a l s t e p l e a d s t o p o i n t 5 t c ) ) . Points B and C a r e d i f f e r e n t . (a) (b) c (fig. (c) Fig. 5. This drawback i s o n l y s i g n i f i c a n t i f one wants t o implement computation graphs o t h e r t h a n t r e e s . I n a t r e e , t h e r e i s only one path between t w o p o i n t s . I f one wants t o maintain the p l a n a r i t y of the t o r u s , one must look f o r t e s s e l l a t i o n s of t h e plane t h a t a r e not square, and y e t s t i l l u s e the double t w i s t . Two are given i n f i g . 6. The f i r s t one i s due t o Carlo Sequin r-1 Fig. 6. 11 . CONCLUSION A method h a s been proposed t o c o n s t r u c t h i g h l y p a r a l l e l and d i s t r i b u t e d systems where t h e basic hardware building blocks a r e ' w h o l e processor and s t o r a g e modules, and t h e b a s i c software b u i l d i n g block is the procedure. The main a s p e c t s of the method a r e t h e following. ARCHITECTURE SESSION The T o r u s : An E x e r c i s e i n C o n s t r u c t i n g a P r o c e s s i n g S u r f a c e F i r s t , on such a Processing Surface t h e l o c a t i o n of v a r i a b l e s r e l a t i v e t o t h e processors using them i s a r e l e v a n t f a c t o r . Scope r u l e s have t h e r e f o r e been introduced i n t h e programming language, which a l l o w t h e programmer t o determine the "distance" between t h e v a r i a b l e s o r procedure parameters and t h e a c t i o n s where they are used. Second, s i n c e i n t e n s e comnunications between a d j a c e n t modules a r e expected, w e have attempted t o smooth away t h e d i s c o n t i n u i t y i n v a r i a b l e access caused by t h e boundary between s t o r a g e modules. For t h i s purpose, t h e access t o d i s t a n t v a r i a b l e s has been l i m i t e d t o neighbour v a r i a b l e s by a " l o c a l i t y r u l e " . Furthermore t h e processor and s t o r a g e modules have been arranged i n such a way t h a t no extraneous communication procedure i s needed t o "move" v a r i a b l e values over a storage module boundary. The r e s u l t is c a l l e d a continuous Processing Surface. Third, by using a "boundary-less" topology f o r t h e surface (here, a t o r u s ) , t h e automatic d i f f u s i o n of a divide-and-conquer computation through t h e surface leads t o an optimal spreading of t h e load over the modules. The programmer need not know t h e a c t u a l number of modules, and no complicated scheduling i s required. 12. HISTORY AND ACKNOWLEDGEMENTS The f i r s t t o r u s machine was b u i l t a t t h e beginning of 1979 a t p h i l i p s Research Laboratories. It is a twisted t o r u s of 36 c e l l s . Each c e l l c o n s i s t s of two INTEL chips (one processor with a 1K byte ROM, and one 256 byte RAM.). I t i s n o t a Processing Surface b u t a network of machines cormrmnicating by e x p l i c i t message exchanges. Acknowledgement i s due t o W.J. Lippmann and G.A. Slavenburg f o r t h e i r invaluable cooperation during t h e construction of t h i s machine. Without t h e i r hardware and software competence, it would never have been completed within such a s h o r t term. The f a c t t h a t it was completed w i t h i n 3 months i s a l s o a consequence of t h e r e g u l a r i t y of t h e s t r u c t u r e . Acknowledgement i s a l s o due t o C.S. Scholten f o r s e v e r a l valuable comments on t h e f i r s t paper on t h e s u b j e c t [3], and t o Alan Davis whose comments on t h e manuscript l e d t o many improvements. REFERENCES [1] A. J. Martin: "A ~ i s t r i b u t e dImplementation Method f o r P a r a l l e l Programming." Proceedings IFIP congress 80 October 1980. [2] C.H. Sequin: "Doubly Twisted Torus Networks f o r VLSI Processor draft. Arrays." December 3, 1980 - / [3] - A.J. Martin: "A ~ i s t r i b u t e dArchitecture f o r P a r a l l e l ~ e c u r s i v e Computations." P h i l i p s . AJM18 September 1979. - CALTECH CONFERENCE ON VLSI, J a n u a r y 1 8 8 1
© Copyright 2026 Paperzz