An Exercise in Constructing a Processing Surface Alain J. Martin

The T o r u s :
An E x e r c i s e i n C o n s t r u c t i n g a P r o c e s s i n g S u r f a c e
A l a i n J. M a r t i n
Computer S c i e n c e Department
C a l i f o r n i a I n s t i t u t e o f Technology
P r o c e e d i n g s o f t h e Second C a l t e c h C o n f e r e n c e
o n VLSI, J a n u a r y 1 9 8 1
THE TORUS: AN EXERCISE I N CONSTRUCTING A PROCESSING SURFACE
P h i l i p s Research Laboratories
5600 MD Eindhoven
The Netherlands
Abstract. A "Processing Surface" is defined a s a l a r g e , dense, and
regular arrangement of processor and s t o r a g e modules on a two-dimensional
surface, e.g. a V L S I chip. A general method i s described f o r d i s t r i b u t i n g
p a r a l l e l r e c u r s i v e computations over such a surface. Scope r u l e s enforcing
t h e " l o c a l i t y " of v a r i a b l e s and procedure parameters a r e introduced i n t h e
programming language. These r u l e s and a p a r t i c u l a r i n t e r c o n n e c t i o n 0: the
modules on t h e s u r f a c e make it p o s s i b l e t o transmit parameter and v a - i a b l e
values between modules without using extraneous communication a c t i o n s .
The choice of t h e Processing Surface topology f o r binary r e c u r s i v e
computations i s discussed and a t o r u s - l i k e topology i s chosen.
0. INTRODUCTION
L e t us c a l l a "Processing Surface" a l a r g e , dense, r e g u l a r arrangement
of processor and s t o r a g e modules on a two-dimensional s u r f a c e , e.g. a VLSI
chip. How can a computatkion be d i s t r i b u t e d over such a s u r f a c e ? what a r e
t h e arrangements of t h e modules on t h e s u r f a c e b e s t s u i t e d f o r a c e r t a i n
c l a s s of computations?
W
e propose t o explore this problem i n t h e following d i r e c t i o n . I n such
an environment, an a c t i o n on a v a r i a b l e d i f f e r s i n complexity ( i n terms of
t h e number of elementary s t e p s necessary t o perform t h e a c t i o n ) depending
on the d i s t a n c e between t h e processor module performing t h e a c t i o n and t h e
storage module containing t h e v a r i a b l e . We want t o r e f l e c t t h i s i s s u e a t
t h e programming l e v e l by introducing scope r u l e s d e f i n i n g t h e d i s t a n c e
between t h e program component where a v a r i a b l e i s declared, and t h e program
components where t h e v a r i a b l e can be used.
Since we expect intense communications between t h e program components,
we expect assignments of t h e form x:=y where x and y belong t o two
adjacent components ( t h i s assignment can take t h e form of a procedure c a l l
o r a p a i r of matching communication a c t i o n s ) t o occur a s f r e q u e n t l y a s
assignments between v a r i a b l e s .of t h e same component. I n most d i s t r i b u t e d
systems, t h e f i r s t type of assignment is an order of magnitude more complex
than t h e second one. We consider t h i s hidden discrepancy between e q u i v a l e n t
actions unacceptable. We w i l l show t h a t it i s p o s s i b l e t o d e f i n e some
l o c a l i t y r u l e f o r t h e program v a r i a b l e s , and t o organize t h e processor and
storage modules on t h e surface such t h a t no discrepancy of t h i s s o r t
appears. I n such a case, t h e Processing Surface i s s a i d t o be
CALTECH CONFERENCE ON V L S I , January 1 9 8 1
AZain J. Martin
Furthermore, s i n c e f o r i n s t a n c e i n v e r t i n g a 2+2 matrix does not
require a s much p a r a l l e l i s m a s i n v e r t i n g a 1000~1000 matrix, t h e p o t e n t i a l
p a r a l l e l i s m of an algorithm should n o t be f i x e d beforehand (e.g. by t h e
number of a v a i l a b l e p r o c e s s o r s ) b u t should be determined dynamically
according t o the needs of the computation. The component a c t i o n s of a computation should be c r e a t e d and destroyed a s t h e computation proceeds, and
should be automatically d i s t r i b u t e d over t h e a v a i l a b l e modules.
f
. THE GENERAL METHOD
The general method we use has been described i n [ I ] .
briefly.
-
We s h a l l r e c a l l it
-
The component a c t i o n s of a computation
t h e "nodes"
are regarded
t h e "computation graph"
which grows and
a s t h e v e r t i c e s of a graph
shrinks during the computation. An edge
a "channel"
between two
nodes means t h a t one of t h e two, say node A , h a s created t h e o t h e r , say
node B , by a procedure c a l l , and t h a t A and 33 communicate d i r e c t l y
with each o t h e r . A is t h e " f a t h e r " of B , and B i s 5 "son" of A
Thanks t o a p a r a l l e l procedure c a l l , a f a t h e r may c r e a t e s e v e r a l sons
simultaneously. The father/son r e l a t i o n d e f i n e s a p a r t i a l ordering of t h e
nodes, and a l l nodes t h a t a r e not r e l a t i v e l y ordered can be performed i n
parallel.
-
-
-
-
.
A computation graph grows and s h r i n k s through a given f i n i t e
"implementation graph", whose v e r t i c e s
the "cells"
represent the
the "links"
t h e communication
available modules, and t h e edges
p o s s i b i l i t i e s between modules. Each node i s mapped on a c e l l , and each
channel on a l i n k .
-
-
-
-
Hence, each c e l l may have t o accommodate an unbounded number of nodes,
Since a c e l l r e p r e s e n t s a very small number of sequential automata ( i n most
cases, one!!, fhe a c t i v i t i e s c?f a l l rides sLicltar?esusl.~p r e s e n t i n a cell
have t o be s e q u e n t i a l i z e d i n some way. ~ u such
t
a s e q u e n t i a l i z a t i o n may
introduce deadlock. The main r e s u l t of [ I ] i s t o prove t h a t t h e nodes of a
c e l l can be i n t e r l e a v e d without introducing deadlock provided t h a t t h e g r a i n
of i n t e r l e a v i n g be c o r r e c t l y chosen. The s o l u t i o n i s very simple i n t h a t it
does not r e q u i r e any p a r t i c u l a r knowledge about t h e nodes o r t h e implementat i o n graph nor complicated scheduling.
I n t h i s paper we s h a l l consider a s p e c i a l c l a s s of computations,
namely r e c u r s i v e computations. For t h i s c l a s s of computations we s h a l l
describe how t o implement a continuous Processing Surface, and we s h a l l
propose a t o r u s - l i k e topology f o r t h e implementation graph.
2 . RECURSION
Much has been s a i d about t h e use of recursion f o r p a r a l l e l programming.
The reader i s r e f e r r e d t o t h e abundant l i t e r a t u r e on t h i s s u b j e c t .
For t h e sake of s i m p l i c i t y , we s h a l l r e s t r i c t ourselves t o one of t h e most
usual r e c u r s i v e methods, namely "divide-and-conquer" ( a l s o c a l l e d " r e c u r s i v e
doubling"). Divide-and-conquer algorithms a r e p a r t i c u l a r l y i n t e r e s t i n g i n
t h a t they produce binary t r e e s a s computation graphs. Binary t r e e s a r e
regular s t r u c t u r e s and each node has an outdegree of two, which i s i n t e r e s t i n g i n view of t h e i r mapping onto a two-dimensional surface.
ARCHITECTURE S E S S I O N
The T o r u s : An E x e r c i s e i n C o n s t r u c t i n g a P r o c e s s i n g S u r f a c e
Parallelism i s introduced only by c a l l i n g two procedures " i n p a r a l l e l " .
The p o s s i b i l i t y of further increasing parallelism by pipelining the
parameters w i l l not be mentioned although it can e a s i l y be added. Neverthel e s s t h i s c l a s s of algorithms is large enough ( i n p a r t i c u l a r numerical
algorithms) f o r the exercise t o be r e a l i s t i c .
Since a node i s created by a procedure c a l l , a node i s a procedure
instance with i t s own program counter, and i t s s e t of variables and
parameters. The following r u l e s define the "locality" of variables and
parameters
.
.
The u n i t of l o c a l i t y is the node: a variable declared inside a node
i s l o c a l t o t h a t node.
A variable local t o a node A
is a neighbour f o r a l l son nodes
of A
.
.
Since the father/son r e l a t i o n between nodes i s not t r a n s i t i v e , the
locality or neighbourhood of a variable with respect t o a node i s not
t r a n s i t i v e e i t h e r : if a node PI c a l l s a node P2 which c a l l s a node ~3
a variable l o c a l t o PI i s neighbour f o r P2 , but not f o r P3
.
,
Three types of parameters a r e used:
. An input parameter is used t o "import" a parameter value i n t o a son node,
by an assignment of the a c t u a l parameter value t o the formal parameter
variable .
. An output parameter i s used t o "export" a value from a son node t o i t s
father by an assignment of the formal parameter value t o the actual
parameter variable.
A reference parameter i s used both t o import and t o export, but by a
prwess of substitution, o r "aliasing": the formal parameter replaces the
actual parameter i n the son node ( i t i s another name f o r the same v a r i a b l e ) .
.
In the case of the input and the output parameters, the formal
parameter is local t o t h e son.
In the case of the reference 'parameter, the formal parameter has the
same l o c a l i t y a s the actual parameter. The formal parameter i s thus
not l o c a l t o the son.
Assume t h a t the value x of a variable i s t o be imported from a f a t h e r node
P I into a son node P2
Either an input or a reference mechanism can be
used. Assume now t h a t x i s t o be passed again from P2 t o a son node P3
I f x was passed from PI t o P2 a s an input parameter, x w i l l be
local t o P3 i f it i s passed a s an input parameter from P2 t o P3 , and
neighbour t o P3 i f it i s paksed a s a reference parameter from P2 t o P3
But i f x was passed from PI t o P2 as a reference parameter, x w i l l
neither be l o c a l nor neighbour t o P 3 , whether it be passed a s an input or as a
reference parameter from P2 t o P3
( I n the case where a value is t o be exported from a son node t o a f a t h e r node,
exactly the same differences hold according t o whether it i s passed as an
output or a reference parameter,)
.
.
.
.
CALTECH CONFERENCE ON VLSI, J a n u a r y 1 9 8 1
AZain J. M a r t i n
Hence, the l o c a l i t y o r neighbouthood of a reference parameter with
respect t o a node i s not t r a n s i t i v e whereas t h a t of an input or output
parameter is. But when a value x is passed as an input or an output
parameter from node, P t o node Q , by definition x is copied from the
No copying i s necessary
storage area of P i n t o the storage area of Q
when x is passed a s a reference parameter.
.
The r e p e t i t i v e transport of values via global variables and reference
parameters could be used i n i t s f u l l generality, but we propose t o r e s t r i c t
i t s use by the following " l o c a l i t y rule".
h c a l i t y rule: An action of a node involves only variables and
parameters t h a t a r e local and/or neighbour f o r the node.
(Whether global variables should be used a t a l l is doubtful. They have been
included for the sake of completeness.) we s h a l l see t h a t this l o c a l i t y r u l e
permits the implementation of a continuous Processing Surface,
4. IMPZtEMENTATION OF A CONTINUOUS PROCESSING SURFACE
Definition: A Processing Surface is said t o be "continuous" when any
action performed on the surface involves only variables
t h a t a r e d i r e c t l y accessible t o the processor performing
the action, i.e. accessible by elementary read or write
operations.
Hence, i f we succeed i n implementing a continuous surface, we s h a l l
have suppressed-any form of extraneous communication action f o r accessing
variables.
According t o the general method, we know t h a t i f node N1 i s mapped
on c e l l C1 , a son node N2 of N1 i s mapped on a neighhollr c e l l C2 of
C1
For node N1 t o be mapped on C1 means t h a t the local variables and
parameters of NI must be allocated i n the storage module associated with
C1 , and the same for N 2 r e l a t i v e
t o C2
Let MI and M 2 be the
storage modules associated with C1 and C2 , respectively. According t o
the locality r u l e , any action of N2 . may involve variables located i n
MI and M2
The s e t (MI, ~ 2 )is called the " l o c a l i t y area" of N2
In the case where the computation graph i s a t r e e , the l o c a l i t y area of a
node consists of a t most two elements.
.
.
.
.
As a d i r e c t consequence of the l o c a l i t y r u l e and of the d e f i n i t i o n of
a continuous Processing Surface, the Processing Surface is continuous i f the property C(N) holds f o r any node N
.
N is performed by a processor d i r e c t l y connected
t o the two storage modules of the l o c a l i t y area of N
C ( N ) : any action of
.
W
e s h a l l describe a strategy for placing the processoi and storage
modules on the implementation graph, and for d i s t r i b u t i n g the actions and
the variables of the nodes over the processor and storage modules, such t h a t
C(N) holds f o r any node N
.
This strategy thus implements a continuous Processing Surface,
ARCHITECTURE SESSION
The T o r u s : An E x e r c i s e i n C o n s t r u c t i n g a P r o c e s s i n g S u r f a c e
1 ) The p l a c i n g s t r a t e g y is d i r e c t l y suggested by t h e property
C(N)
.
i s placed a t each vertex, and a processor module a t
each edge of t h e implementation graph.
A s t o r a g e module
d i r e c t access t o two -(See an example on f i g , 1 . ) Hence, each processor h a s storage modules, and each s t o r a g e module is shared by a s many processors a s
t h e degree of t h e v e r t e x where it i s placed.
2) Assume t h a t - -- C(F) holds
f o r a node F
For instance, F has been
-- created i n c e l l 2 o f f i g . l ( a ) ; its l o c a l v a r i a b l e s are i n M2 , i t s
neighbour v a r i a b l e s i n M1 , and i t s a c t i o n s a r e processed by P12 ( s e e
f i g . 2).
.
( b ) processor and s t o r a g e placement
( a ) implementation graph
Fig. 1.
Assume t h a t a t some s t a g e i n t h e domputation of F two son nodes R and
D ( f o r r i g h t and down) of F a r e t o be created i n c e l l s 3 and 4,
r e s p e c t i v e l y . The l o c a l i t y a r e a s of R and D must then be (M2, M 3 ) and
(M2, M4) , r e s p e c t i v e l y ( s e e f i g . 2 ) .
This means t h a t C ( R ) and C(D) w i l l hold i f and only i f R and D a r e
processed by P23 and P24 , respectively. Upon reaching t h e procedure
c a l l s of R and D i n t h e procedure body of F , P12 must t r a n s m i t t h e
c r e a t i o n of R and D t o P23 and P24
Since, by c o n s t r u c t i o n PI2 ,: P23 , and P24 share a common s t o r e , namely
M2 , t h e transmission of procedure c a l l s i s a simple and l o c a l a c t i o n :
PI2 adds t h e names of R and D t o t h e lists
l o c a t e d i n M2
of
nodes t o be processed by P23 and P24 , r e s p e c t i v e l y .
.
-
-
A processor switches from one node t o the o t h e r upon a procedure c a l l
i n the same way a s i n a multiprogramming system a processor switches from
one process t o another upon a P-operation on a zero semaphore. We s h a l l not
describe t h e implementation i n more d e t a i l .
CALTECH CONFERENCE ON VLSI, J a n u a r y 1 9 8 1
AZain J .
Martin
Hence, i f C ( F ) h o l d s f o r a node F , C ( R ) and C ( D ) hold f o r t h e
observe t h a t t h e above s t r a t e g y i s independent of t h e
two son nodes of F
topologies of the implementation qraph and of t h e computation tree.
The r o o t node P of t h e computation t r e e i s c r e a t e d by t h e "environment"
a root
of t h e computation. A t l e a s t one c e l l of t h e implementation graph
i s connected t o t h e environment. I t i s easy t o map P o n t o a r o o t
cell
c e l l i n such a way t h a t --C (- P ) 'holds.
.
-
-
?--
,.--.*
!,
MI
\
F
Fig. 2. l o c a l i t y a r e a s
5. THE CHOICE OF THE IMPLEMENTATION GRAPH
We look f o r a f i n i t e implementation graph such t h a t 1 ) an a r b i t r a r y
binary t r e e can be mapped onto it without knowing t h e s i z e s of t h e t r e e and
of t h e graph, 2 ) t h e nodes of t h e t r e e a r e optimally spread over t h e c e l l s
of the graph.
Becacse af I ) , we a h a t "-aLnuilating" an i n f i n i t e grapn on a f i n i t e
one. Let us assume t h a t we could indeed c o n s t r u c t an i n f i n i t e implementation
graph, which graph would we choose? Since we a r e looking f o r graphs t h a t can
be represented i n the plane by regular and dense s t r u c t u r e s , we a r e bound t o
choose between t h e t h r e e r e g u l a r t e s s e l l a t i o n s of t h e plane, which a r e t h e
square, t h e t r i a n g u l a r , and t h e hexagonal t e s s e l l a t i o n s . ( ~ l t h o u g ht h e
i n f i n i t e binary t r e e i s r e g u l a r , it i s n o t dense, because it grows
exponentially and t h e r e f o r e cannot be represented with minimal c o n s t a n t
edge lengths. )
W e have chosen t h e square t e s s e l l a t i o n , although t h e hexagonal i s a l s o
i n t e r e s t i n g . We s h a l l f i r s t d i s c u s s t h e problems of mapping a b i n a r y t r e e
onto an i n f i n i t e g r i d . We s h a l l then simulate t h e i n f i n i t e g r i d on a f i n i t e
grid.
6 . THE INFINITE GRID AS AN IMPLE'MENTATION
GRAPH
>
( i,
An i n f i n i t e g r i d i s a graph such t h a t : f o r i
0
j) i s connected with v e r t e x ( I , j ) and v e r t e x
. , vertex
and j 2 0
( i , j+l )
The mapping of a b i n a r y t r e e on t h e g r i d i s obvious. The r o o t of t h e
I f a node i s mapped on v e r t e x ( i , j ) ,
t r e e i s mapped onto v e r t e x ( 0 , 0 )
then i t s r i g h t son R i s mapped on
v e r t e x ( i , j + l ) , and i t s down son D
.
ARCBITECTURE SESSION
T h e Torus: An E x e r c i s e i n C o n s t r u c t i n g a P r o c e s s i n g S u r f a c e
.
i s mapped on vertex ( i + l , j)
When an exponential s t r u c t u r e ( t h e binary t r e e )
i s mapped on a quadratic one ( t h e grid) a congestion problem i s created: vertex
( i , j ) of the grid may have t o accommodate up t o j
i ! ! nodes of the
binary t r e e simultaneously,
7 . THE STRAIGHT TORUS
The problem now i s t o simulate the i n f i n i t e grid on a f i n i t e one. For
reasons of symmetry we choose a square grid of M*M c e l l s . (We s h a l l return t o
t h i s choice l a t e r . ) The f i r s t solution consists i n connecting c e l l (x, y) of
the f i n i t e grid ( 0 ( x , y
M) t o the c e l l s :
-
and
( x , (yi-flmod M)
( ( x + l ) eM I y)
<
.
This amounts t o connecting with each other the corresponding elements of the
f i r s t and l a s t columns, and those of the f i s t and l a s t rows. The volume obtained
is topologically similar t o a torus.
Consider an a r b i t r a r y c e l l ( i , j ) of the i n f i n i t e grid and the c e l l (x, y)
of the f i n i t e grid on which it is mapped. According t o the above connecting r u l e ,
we have :
This r e l a t i o n describes the t i l i n g of the i n f i n i t e grid by square t i l e s of s i z e
M*M : i f ( i , j) are the coordinates of a c e l l of the i n f i n i t e grid, then ( x , y)
are i t s coordinates i n the t i l e (k, 1) ( s e e f i g . 3 . ) .
The congestion problem can be solved i n the following way. Consider the
i n f i n i t e grid. When a vertex is occupied by a node N of the computation t r e e ,
no other node i s accepted by the vertex u n t i l N and the subtree attached t o N
have terminated t h e i r a c t i v i t y . I t is easy t o prove t h a t t h i s cannot lead t o
deadlock on the i n f i n i t e grid. B u t t h i s solution cannot be used i n a straightforward manner for Lhe t ~ r u swithout danger of deadlock. Assume t h a t a c e l l of
the torus i s occupied by the node N1 , and a new node N 2 i s not accepted by
This would be
the c e l l . I t may occur t h a t N 2 belongs t o the subtree of N 1
a deadlock. For each node of the computation t r e e , it i s recorded t o which t i l e
the node belongs. When a c e l l i s occupied by a node N 1 , it may refuse a node
N 2 only i f N 2 belongs t o the same t i l e a s N1
( I f two nodes belong t o t h e
same t i l e , it i s impossible t h a t one belongs t o the subtree of the other.)
.
.
8. THE PROPAGATION PATTERN
Assume t h a t a l l c e l l s and a l l nodes i n the c e l l s have similar behaviours,
and t h a t the propagation speeds a r e similar i n a l l directions even i n the case of
an asynchronous implementation. Then we can say t h a t i n a phase of homogeneous
expansion or contraction of the computation, there i s a f r o n t wave of active
nodes which a r e located a t a maximum distance from the r o o t , i.e. on a diagonal
i ij = K of the i n f i n i t e grid, which we s h a l l c a l l the "active diagonal".
A t s t e p K of the computation, the complete computation t r e e contains
2w(K-1) a c t i v e nodes ( t h e leaves). B u t a t step K of the computation, a t
most K(K+1)/2 c e l l s of the i n f i n i t e grid can be active, and i f the strategy
for reducing congestion i s applied, a t most K : the c e l l s of the active
leaf nodes cannot be active
diagonal. A s a consequence, the 2.n-n(K-I)
CALTECH CONFERENCE ON VLSI, J a n u a r y 1981
AZain J. Martin
simultaneously; t h e i r a c t i v i t i e s have t o be sequentialized. The hypothesis of
homogeneous expansion and c o n t r a c t i o n then does not s t r i c t l y hold anymore
because n o t a l l c e l l s on a diagonal have t o accommodate t h e same number of l e a f
nodes, and t h e r e f o r e t h e c o n t r a c t i o n of t h e computation w i l l not start i n a l l
c e l l s a t t h e same time. But it is an acceptable approximation.
Fig. 3 .
Fig. 3 shows t h a t t h e a c t i v e diagonal of t h e i n f i n i t e g r i d i s mapped on
a t most two diagonals of t h e f i n i t e g r i d , i . e . a t most M cells o u t of t h e
M*M
are a c t i v e .
(Algebraically, f o r a given value of i + j , t h e r e a r e a t most two values of
x + y (0
x t y
2
I
f u l f i l l i n g R , namely:
<
-
( i + j)@M
if ( i + j ) r & M < M - 1 :
( i + j ) e M + M
.
Hence, i f t h e a c t i v e diagonal approximation is c o r r e c t , t h e s t r a i g h t t o r u s
topology l e a d s t o a poor d i s t r i b u t i o n of t h e computation over t h e Processing
Surface.
9. THE TWISTED TORUS
Obviously, t h e drawback of t h e s t r a i g h t t o r u s i s caused by t h e symmetry
We can d e s t r o y t h e symmetry
of t h e t i l i n g of f i g . 3 around t h e a x i s i = j
by s h i f t i n g t h e t i l i n g by one p o s i t i o n and i n one d i r e c t i o n , a s shown by f i g , 4 .
Now we s e e t h a t f o r t h e same a c t i v e diagonal, more diagonals of t h e f i n i t e g r i d
a r e occupied. I n f a c t , it can be proved t h a t the d i s t r i b u t i o n of t h e a c t i v e
diagonal over t h e f i n i t e g r i d i s now optimal: i f t h e a c t i v e diagonal contains no
more than M*M nodes, no two nodes of t h e a c t i v e diagonal are mapped on the same
c e l l of t h e t o r u s .
This t i l i n g corresponds t o t h e t i l i n g r e l a t i o n :
.
ARCRITECTURE SESSION
The T o r u s : An E x e r c i s e i n C o n s t r u c t i n g a P r o c e s s i n g S u r f a c e
Fig. 4.
10. THE DOUBLY TWISTED TORUS
The same r e s u l t could have been reached by u s i n g a r e c t a n g u l a r s t r a i g h t
t o r u s of M*P
c e l l s where M and P are r e l a t i v e primes.
The d i f f e r e n c e i s t h a t i n a twisted t o r u s , a h o r i z o n t a l chain of nodes, i.e.
the succession of nodes w i t h c o n s t a n t i , i s mapped on a c y c l e containing
On t h e r e c t a n g u l a r
a l l c e l l s of t h e t o r u s , i . e . on a c y c l e of length M*M
t o r u s , such a s t r u c t u r e is mapped on
a cycle of l e n g t h M (one row of t h e
torus!, I n both cases a vertical chain (ccnstant j ) i s inspped on t l e c e l l s
of only one column.
I n view of c e r t a i n degenerate binary t r e e s , which reduce t o a chain of only
r i g h t o r l e f t procedure c a l l s , it could be i n t e r e s t i n g t o t w i s t t h e t o r u s i n
all
both d i r e c t i o n s i n such a way t h a t a v e r t i c a l chain is a l s o mapped on
c e l l s of t h e t o r u s .
To avoid r e i n t r o d u c t i o n of t h e symmetry, t h e t o r u s must be twisted i n
opposite d i r e c t i o n s i n t h e two dimensions (e.g. +1 f o r t h e rows, and -1 f o r
t h e columns).
The f a c t t h a t t h e corresponding t i l i n g r e l a t i o n :
i=x+k*M-1
j = y + l * + k
---.
has no s o l u t i o n ' f o r ( i , j ) = (M(q+l) - p , pM + 4 ) means t h a t such
a t i l i n g does n o t r e p r e s e n t a "plane" surface. We mean t h a t i f , on t h e
i n f i n i t e g r i d , p o i n t B i s rsached from p o i n t A by r h o r i z o n t a l s t e p s
and s v e r t i c a l ones, B is a l s o reached from A by any permutation of
t h e s e s t e p s . This i s no longer t r u e f o r this doubly t w i s t e d t o r u s .
This i s shown by t h e following counter-example. Consider t h e 3*3 doubly
twisted t o r u s of f i g . 5 ( a ) . From p o i n t A , one h o r i z o n t a l s t e p ( i n d i c a t e d
i n f i g . 5 by a d o t t e d p a t h ) , followed by one v e r t i c a l s t e p ( i n d i c a t e d i n
f i g . 5 by a dashed path) l e a d s t o p o i n t B ( f i g . 5 ( b ) ) . From p o i n t A ,
-
.
CALTECH CONFERENCE ON VLSI, J a n u a r y 1 9 8 1
A l a i n J. M a r t i n
one v e r t i c a l . s t e p followed by one h o r i z o n t a l s t e p l e a d s t o p o i n t
5 t c ) ) . Points B and C a r e d i f f e r e n t .
(a)
(b)
c
(fig.
(c)
Fig. 5.
This drawback i s o n l y s i g n i f i c a n t i f one wants t o implement computation
graphs o t h e r t h a n t r e e s . I n a t r e e , t h e r e i s only one path between t w o
p o i n t s . I f one wants t o maintain the p l a n a r i t y of the t o r u s , one must look
f o r t e s s e l l a t i o n s of t h e plane t h a t a r e not square, and y e t s t i l l u s e the
double t w i s t . Two are given i n f i g . 6. The f i r s t one i s due t o Carlo Sequin
r-1
Fig. 6.
11
. CONCLUSION
A method h a s been proposed t o c o n s t r u c t h i g h l y p a r a l l e l and d i s t r i b u t e d
systems where t h e basic hardware building blocks a r e ' w h o l e processor and
s t o r a g e modules, and t h e b a s i c software b u i l d i n g block is the procedure.
The main a s p e c t s of the method a r e t h e following.
ARCHITECTURE SESSION
The T o r u s : An E x e r c i s e i n C o n s t r u c t i n g a P r o c e s s i n g S u r f a c e
F i r s t , on such a Processing Surface t h e l o c a t i o n of v a r i a b l e s r e l a t i v e
t o t h e processors using them i s a r e l e v a n t f a c t o r . Scope r u l e s have t h e r e f o r e
been introduced i n t h e programming language, which a l l o w t h e programmer t o
determine the "distance" between t h e v a r i a b l e s o r procedure parameters and
t h e a c t i o n s where they are used.
Second, s i n c e i n t e n s e comnunications between a d j a c e n t modules a r e
expected, w e have attempted t o smooth away t h e d i s c o n t i n u i t y i n v a r i a b l e
access caused by t h e boundary between s t o r a g e modules. For t h i s purpose,
t h e access t o d i s t a n t v a r i a b l e s has been l i m i t e d t o neighbour v a r i a b l e s by
a " l o c a l i t y r u l e " . Furthermore t h e processor and s t o r a g e modules have been
arranged i n such a way t h a t no extraneous communication procedure i s needed
t o "move" v a r i a b l e values over a storage module boundary. The r e s u l t is
c a l l e d a continuous Processing Surface.
Third, by using a "boundary-less" topology f o r t h e surface (here, a
t o r u s ) , t h e automatic d i f f u s i o n of a divide-and-conquer computation through
t h e surface leads t o an optimal spreading of t h e load over the modules.
The programmer need not know t h e a c t u a l number of modules, and no complicated scheduling i s required.
12. HISTORY AND ACKNOWLEDGEMENTS
The f i r s t t o r u s machine was b u i l t a t t h e beginning of 1979 a t
p h i l i p s Research Laboratories. It is a twisted t o r u s of 36 c e l l s . Each
c e l l c o n s i s t s of two INTEL chips (one processor with a 1K byte ROM, and
one 256 byte RAM.). I t i s n o t a Processing Surface b u t a network of machines
cormrmnicating by e x p l i c i t message exchanges.
Acknowledgement i s due t o W.J. Lippmann and G.A. Slavenburg f o r t h e i r
invaluable cooperation during t h e construction of t h i s machine. Without t h e i r
hardware and software competence, it would never have been completed within
such a s h o r t term. The f a c t t h a t it was completed w i t h i n 3 months i s a l s o
a consequence of t h e r e g u l a r i t y of t h e s t r u c t u r e . Acknowledgement i s a l s o
due t o C.S. Scholten f o r s e v e r a l valuable comments on t h e f i r s t paper on
t h e s u b j e c t [3], and t o Alan Davis whose comments on t h e manuscript l e d
t o many improvements.
REFERENCES
[1]
A. J. Martin: "A ~ i s t r i b u t e dImplementation Method f o r P a r a l l e l
Programming." Proceedings IFIP congress 80
October 1980.
[2]
C.H. Sequin: "Doubly Twisted Torus Networks f o r VLSI Processor
draft.
Arrays." December 3, 1980
-
/
[3]
-
A.J. Martin: "A ~ i s t r i b u t e dArchitecture f o r P a r a l l e l ~ e c u r s i v e
Computations." P h i l i p s . AJM18
September 1979.
-
CALTECH CONFERENCE ON VLSI, J a n u a r y 1 8 8 1