The Indirect Binary n-Cube Microprocessor Array

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-26, NO. 5, MAY 1977
458
The Indirect Binary n-Cube Microprocessor Array
MARSHALL C. PEASE, III, FELLOW, IEEE
Abstract-This paper explores the possibility of using a largescale array of microprocessors as a computational facility for the
execution of massive numerical computations with a high degree
of parallelism. By microprocessor we mean a processor realized on
one or a few semiconductor chips that include arithmetic and log-
only moderate speed. Total processing speed and
throughput for the entire system would be attained
par
through*
theentet
itsemlfm
ttainea
many as 214 = 16 384 microprocessors to obtain a very high
ical facilities and some memory. The current state of LSI technol- degree of parallelism.
ogy makes this approach a feasible and attractive candidate for use
A key problem in designing a large-scale array of miin a macrocomputer
in a macrocomputer facility.
croprocessors is the structure of the array that provides
su the
r fh
r
The array studied is called the indirect binary n -cube array. It ineroprocessors is
aprov
uses a network of switching nodes to obtain a high degree of flexi- iprocessor communicaios Wthte use of avmicrobility of the connections among the microprocessors without processor on one or a few chips the number of available
placing excessive demands on the scarce resource of the pins on a pins is severely limited. As the size of the network in-
microprocessor.
creases, the potential for serious difficulty grows because
A number of computational procedures are discussed. Detailed
attention is given to the communication requirements for algorithms used in the solution of partial differential equations in two
or three dimensions. The use of the array for the radix-2 fast Fourier transform (FFT) and related signal processing applications
is considered. Its use for matrix operations is also discussed, with
particular attention given to matrix multiplication as the most basic
of limitations of available connections.
T.e size of the network being considered here makes the
transform, grid computations, microprocessor array, n -cube array,
parallel matrix multiplication, parallel processing, permutation
network, switching network, triangular permutations, virtual
aeln
enrcg
certan computational problems have
long been
recognized, the advent of LSI technology appears to make the
possibility practical.
concept substantially different from that used in current
practice, although the theory of computational networks
has been investigated by many authors [1]-[5]. Iliac IV [6]
is the principal example of aactual implementation of a
E
ofctual
computing ci
asan
operation..
facility an array processors. E
designs
The final conclusion is that the indirect binary n-cube array is
useful for a very wide range of important problems, and thus is an assumed up to 210 processing elements (PE's), but the
attractive candidate for the design of a large-scale array of mi- actual machine uses a relatively small number, 64, of quite
croprocessors.
powerful elements. We do not regard this as a large array.
the possible advantages of large-scale arrays for
Index Terms-Admissibl maps,
maps, array
arrayproceWhile
Index Terms-Admissible
processor, fast Fourier
array.
ationdofgn
crancmuainlpolm
In this paper we describe a particular type of array, the
"indirect binary n-cube array." It is called indirect since
the array is not actually connected according to the toI. INTRODUCTION
- which would fail badly to
of pin-limitation
the binary n-cube
- but its design permeet the
constraint
rHIS study explores possibleorganizatis of a pology
ta m erve
mits it to behave as if it used a binary n-cube conneccomputational resource for a variety of numerical appli- tion.
cations.
The key to the array is the switching network that is
premise is that
would allow
The premise
LSI technology
allow used to provide the interconnections among the microcurrent LSI
technology would
sl that current
processors. This network has a very close relation
u. i ito the
array of microprocessors. By a microprocessor we mean "i
network
e
r
is spciie
d ina wyta
a processor realized by one or a few semiconductor chips ver
n
that include general arithmetic and logical facilities and .
.
. . an
to
whose size
is arbitrary power
some memory. We anticipate that individual micropro- is applicable any array
cessors would use a technology with intermediate values
of gate speed and gate density to keep costs low. Therefore,
implementing the switching network using LSI technology.
th4individual
.ndividual m'icroDrocessor
microprocessor circuitrv
circuitry is
is likely
to be
b.of The
the
likelv to
eo
intent here is to show that there iS a good deal of
flexibility in implementation, and therefore considerable
freedom to adapt to whatever technology might be indi-
plargetarrayoofcmicroprocessors
cati
a
computational
facilit
to
aspa
bebuiltaroundalarge-scale
oft2.
....1I'
Manuscript received February 13, 1975; revised June 3, 1976. This work
cated by a detailed tradeoff analysis.
was supported by the National Science Foundation under Grant GJ-...
42696.
The author is with Information Science Laboratory of the.Information
Science and Engineering Division, Stanford Research Institute, Menlo
Park, CA 94025.
''
Returning to the basic, unpartitioned design, we next
consider how a desired communication pattern among the
microprocessors could be set up. A number of general re-
PEASE: n-CUBE MICROPROCESSOR ARRAY
459
S3
S2
5
S4
(5)
(84~~~~~~~~~~~~~~~~~~~~~~~~~~~~~4
Fig. 1. The indirect binary 4-cube array.
a
=tZ
c
a
tIc
(a)
(b)
Fig. 2. The switch node. (a) Direct connection. (b) Crossed connec-
tion.
suits are proven. These results are illustrated by the connection patterns that appear to be of principal value for
the applications and algorithms of greatest interest.
Next we consider the use of the array in algorithms and
applications that involve computations based on nearest
neighbors on a grid. The solution of partial differential
equations in two or three dimensions is the principal application that is considered. It is shown that the communication paths required for a variety of grids can be implemented directly by appropriate settings of the switching
network.
A suggested control system for the array is considered
next. We propose a two-level system using a small set of
global commands which can be broadcast to all microprocessors and to a set of what we call switch controllers.
The global commands are then locally interpreted into a
sequence of rewritable microinstructions. The effect of the
microinstructions at the switch controllers can be used to
construct a virtual array at the level of the global commands.
Finally, two other application areas are considered. The
first area includes the fast Fourirer transform (FFT) and
other computations that depend on the FFT or related
transforms. The second area includes matrix operations.
A particular algorithm for matrix multiplication is given
in some detail in order to demonstrate the applicability of
the array to matrix operations.
.TEIDRCBNAYnUEARY
I.TEIDRC
IAYnCB RA
The basic form of the indirect binary n-cube array is
illustrated in Fig. 1 for n = 4, N = 24= 16. The circles in
Fig. 1 represent the microprocessors, indexed from 0 to (2n
- 1) as indicated by the numbers in the circles. The main
part of Fig. 1 is the switching network. The lines on the
right from the switching network connect back to the microprocessors with the indices given in parentheses.
The switching network is conceived as being constructed
from switch nodes indicated by the squares in Fig. 1. Each
switch node has two input lines, two output lines, and can
be put into either of the two states indicated in Fig. 2,
providing a "direct" or a "crossed" connection. In Fig. 1
there are n -in this case, four -levels of switching nodes
labeled S1,S2, ***vn
460
The switching network can be recognized as similar to
the type studied by Clos [8], Benes [9], and more recently
by Joel [10], Opferman and Tsao-Wu [11], and Ramanujam [12]. It is very closely related to the "omega network"
described by Lawrie [7].1 The topologies of the two networks are identical. Fig. 1 can be rearranged, relabeling the
microprocessors to make it look like an omega network.
There are two principal differences between Fig. 1 and
an omega network. First, we do not include in the switch
nodes the capability for what Lawrie calls the broadcast
state, in which one of the input lines is connected to both
outputs and the other input line is unused. We have not
found the need for such states since we can obtain a similar
effect by retaining a copy of the data in the microprocessors while transmitting that data through the network.
Including broadcast states doubles the control requirement
or the number of pins on a switching module that must be
dedicated to control. Further study may demonstrate the
desirability of including broadcast states, but this is currently an open question.
The second difference is in the use of the network.
Lawrie suggests using omega networks to connect a set of
memories to a set of arithmetic logic units (ALU's) and vice
versa. While he does briefly consider a possible variation
in which the memories might be bypassed for intermediate
results, he gives little attention to the possibilities of
multiple passes through the network. The organization
considered here, on the other hand, considers microprocessors that combine memory, arithmetic, and logic. The
network is used to permute data among microprocessors.
In this arrangement it becomes natural to consider multiple passes through the network to obtain permutations
of the data that are otherwise unrealizable. Multiple passes
will, of course, entail a sacrifice of speed, but gives added
flexibility.
The configuration of Fig. 1 is intended to implement a
set of connections that can be described by the set of edges
of the binary 4-cube. Let p be the index of a microprocessor. Let p be expressed in binary notation as (PnvPn-1)
* ,p1) where
(1)
P = Pl + 2P2 + ... + 2n1p
and pi = 0 or 1.
If we regard pi as the coordinate ofp in the ith direction,
the entire set can be regarded as mapped onto the vertices
of the binary n-cube as in Fig. 3 for n = 4.
In Fig. 1, if the switch nodes in set Si are put in the
crossed state and all others left direct, data are interchanged between the (0,1), (2,3), (4,5), ... pairs of microprocessors or along the horizontal edges of the 4-cube in
Fig. 3. If the switch nodes in set 52 are put in the crossed
1 More precisely, it is similar to the inverse omega network without
renumbering the microprocessors, and to the omega network itself with
renumbering.
IEEE TRANSACTIONS ON COMPUTERS, MAY 1977
state and all others set direct, the pairs are (0,2), (1,3), (4,6),
(5,7), ... or along the short diagonal edges of Fig. 3. Similarly, the set S3 generates interchanges along the vertical
edge of Fig. 3, and the S4 along the long diagonal edges.
The array of Fig. 1 can be regarded as implementing the
connections of the 4-cube.
While considering the array as developed from the binary n-cube, it is important to recognize that the n-cube
permutations are only a very small subset of the permutations that can be executed by the switching network. It
is assumed that each switch node is independently controllable. The switching network of Fig. 1 has, for example,
232 possible states, none of which duplicates any others.
Later we identify and describe the full set of possible
states.
To develop a general specification of the indirect binary
n-cube array for arbitrary n, or for 2 n microprocessors, we
need to identify the various lines in the network. Let the
lines connected to the microprocessors be indexed from 0
to (2n - 1), monotonically increasing with the indices of
the microprocessors. Let this indexing of the lines be carried through the network, assuming all switch nodes are
set to the direct connection. That is, if lines i and j are incident on a switch node on the left, the lines incident on the
right are also indexed as i and j, with i connected to i and
j to j when the node is set to the direct state. As an example, Fig. 4 shows the indexing of the lines in the switching
network of Fig. 1.
Let 1 be the index of a line, and let it be expressed in
binary notation:
1 = 1i + 212 +
-
-
-
+ 2n-ln
(2)
where 11 = 0 or 1, and the total number of microprocessors
is 2n. Consider a switch node in the kth level, Sk, 1 < k S
n. The indices of the lines incident on the switch node on
either side differ only in ik. The upper line in the circuit
as drawn in Fig. 1 is the line with ik = 0, the lower with lk
= 1. Since there are (n - 1) levels, each binary coefficient,
li, is affected at some point, and can be changed by setting
some switch node to the crossed connection. Therefore, any
microprocessor can be connected to any other microprocessor.
The order in which the ith binary components of the line
indices, 1, are affected - or the order of the levels - is
arbitrary. The levels can be said to be commutative. With
any other ordering of the levels the indexing of the microprocessors can be modified to correspond and the diagram of the array redrawn accordingly. The result will be
a diagram that looks like Fig. 1, except that the microprocessors will be labeled differently. The switching network of Fig. 1, or that of the arrays of other sizes, is unique
within the different labelings of the microprocessors and
lines.
If there are N =
2n microprocessors, arrays of the type
ilutaeinFg1rqie
lutae nFg1rque
PEASE: n-CUBE MICROPROCESSOR ARRAY
461
14 = (1110)
12
=
15
13 = (1101)
(1 100)
6 =
(0110)
4 = (0100)
=
(0101)
10 = (1010)
0
=
11
/1=
(0o)
00
(0000)
=
(1011)
9 = (1001)
(1000)
A
(1111)
7 = (0111)
5
8
=
3 (=(0011)
1
=
(000 1)
Fig. 3. Map of 24 indices on binary 4-cube.
S2
sS1
S4
S3
(l2:~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 2
=;
~
~~~
~~~
2
((4)
Fig. 4. Indexed lines, indirect binary 4-cube array.
NW= n2 n-i = nN/2
substantially reduce the number of levels, although at the
price of limiting what can be done with a single pass
switch nodes arranged in n levels. While this number is through the network. The alternative approach is to accept
large if N is large, the ratio of switch nodes to micropro- the single-pass delay, while concentrating on developing
cessors is n12. The ratio does not seem excessive,
the data structures and algorithms that minimize the
What is perhaps more disturbing is that the number of number of passes. Currently, the latter approach seems the
levels is n. Whenever data are passed through the most useful.
switching network, they must pass through n switch nodes.
The particular representation of the switching network
If n is large -for example, 14 if N = 16 384 -there may in Fig. 1 emphasizes the structure's regularity. We observe,
be a significant delay. Some variations of the basic array, for example, that the size of an array can be doubled bysuch as the four-partitioned form that is described in the replicating the switching network and microprocessors and
following subsection, are of interest just because they then adding a single new level of switch nodes. More gen-
IEEE TRANSACTIONS ON COMPUTERS, MAY 1977
462
A
Si
S2
S3
B
10_
Fig. 5. 2-partitioned array of 16 microprocessors.
stage always include one datum whose index parity is even,
and one that is odd. The two-partitioned array can be used
to exploit this fact, providing the computation and data
transfer speed are adjusted appropriately.
Determining whether or not the two-partitioned array
.
B Variations
of the Array Structureis advantageous under any given circumstances requires
detailed design analysis. Here we only wish to acknowledge
There are a number of potentially useful variations of the possibility of such an array and to identify it as a
the basic configurations illustrated in Fig. 1. Their use- variation that should be considered.
fulness would depend on the application of the array and
Four-Partitioned Form: If we continue the partitioning
the technological tradeoffs involved in its design. We will to a greater degree, some of the parameters of the array
briefly describe and discuss two variations that illustrate changes significantly. In Fig. 6, an array of 16 microprosome of the possibilities.
cessors is again developed, now as a four-partitioned array.
Two-Partitioned Form: The first variation partitions In Fig. 6, only half the switching networks are shown. In
the whole set of microprocessors into two equal sets which addition, there will be another identical set for transmitwe will call the A-set and the B-set of microprocessors. The ting data in the opposite direction. The dominant effect
partitioning is done on the basis of the index parity, de- is to sharply reduce the number of levels in the switching
fined as follows. Let the microprocessors be indexed from networks - from four in Fig. 1 to one in Fig. 6.
O to (2n - 1), as before. Let the index, p, of a microproWe must now associate a nonunit fan-out and fan-in
cessor be expressed in binary notation as in (1). Its index with each microprocessor. This can be done by including
parity is even if the number of nonzero pi in (1) is even, and switching within the microprocessors if the required pins
are available. Alternatively, we could add switching exis odd if the number of nonzero pi is odd.
A given microprocessor is assigned to the A-set if its ternally to the microprocessors. In this case, the added
index parity is even, and to the B-set if odd. The micro- switches are tree switches, not the switch nodes of Fig. 2,
processors within each set are ordered monotonically by so that Fig. 6 is not simply a rearrangement of Fig. 5. What
their indices. Two switching networks are used, one to is involved is a change in the character of a part of the total
connect the A -set to the B-set, the other to connect the switching requirement.
If we include the tree switching as a switching level, the
B-set to the A-set. The switching networks have the conin
1.
figuration illustrated Fig.
apparent advantage with respect to the number of
Fig. 5 shows the array for 16 microprocessors, except switching levels of Fig. 6 over Fig. 5 disappears. However,
that we have not included the switching network con- in larger networks, a substantial advantage remains, as is
necting the B-set to the A-set, which is the mirror image discussed shortly.
of the network shown.
The array of Fig. 6 is most easily developed from that of
switch
nodes
form
fewer
The two-partitioned
requires
Fig. 5. The A and B sets of microprocessors are each diand
it
reduces
than the basic, unpartitioned form,
by one vided into two sets, A1 and A2 or B1 and B2, by dividing the
the number of switching levels between microprocessors. components, Pi, of the binary representation of the index
into two parts. In the array of Fig. 6, the two parts are
The saving, however, is comparatively small.
A more significant effect derives from the partitioning (P1,P2) and (p3,p4) where the index is expanded as in (1).
itself. If the applications are algorithms that imply a We consider the parities of these two parts separately. A
two-fold partitioning of the data, the partitioning of the microprocessor in the A set is put into A1 if the parities of
microprocessors may be useful. For example, in the radix-2 both parts are even, and into A2 if they are both odd. The
FFT the pairs of data that enter the calculations in any B set is divided into B1 and B2, similarly.
erally,.the high degree of regularity suggests that networks
of large size can be built out of a suitable set of identical
subassemblies of switch nodes. The advantage for the
construction and maintenance of the array is evident.
463
PEASE: n-CUBE MICROPROCESSOR ARRAY
Al
S
Bl
S21
A2
S12
B2
S22
Fig. 6. 4-partitioned array of 16 microprocessors.
The choice among the various possibilities is a function
Developing the array on these sets - or on the analogous sets if a larger fan-out is used we find that the of several tradeoffs. Including the fact that many switch
nodes on a single chip reduces the number of chip-to-chip
number of switch nodes required for the array is
connections. Since the delay between chips is typically
=
(3) several times the gate delay on a single chip, reducing the
Nsw (n -w)2n-1
number of chip-to-chip transfers is advantageous. On the
where w is the fan-out of a microprocessor. The number other hand, the control of each switch node requires a bit.
of switch nodes is reduced by increasing w. However, the Including many switch nodes on a single chip also requires
a corresponding number of control bits. If the control is
reduction is not substantial until w approaches n.
The effect on the number of levels of switching is more external to the chip and not microprogrammed within the
significant. With a fan-out and fan-in of w, the maximum chip, the need for control is an added demand on the scarce
number of levels is L (nlw) - 1J or the least integer not less resource of pins.
Determining the best implementation of the switching
than [(niw) - 1]. Even if we add the levels of the tree
switching necessary to obtain the fan-out and fan-in, the networks will require detailed study of the tradeoffs in the
context of the available technologies and the anticipated
improvement is substantial if n is large compared to w.
A further variation might be considered in which there uses of the array. The significant point is that a number
is only a single, one-directional network between parti- of options exist, so that there is freedom within which an
tions. An example would be similar to Fig. 6, but with all optimized design can be sought.
flows in one direction. This variation requires only half the
III. ADMISSIBLE MAPS
number of switch nodes of the four-partitioned form, and
eliminates the need for fan-in or fan-out. This form has the
In this section, we consider what permutations of data
disadvantage that pairs of microprocessors can be quite can be obtained by a single pass through the switching
far apart in the graph-theoretic sense, so that communi- network of Fig. 1, and how other permutations can be
.
cation speed may be sharply reduced.
achieved. Lawrie [7] has proven a number of theorems
Again, we cannot conclude whether or not these varia- about this problem, which can be extended in several imtions are advantageous without a detailed analysis based portant ways
on the expected use of the array and on the capabilities of
was. in this section iS limited to the unpartiTan discussion
The
the technology to be used. They are included primarily to tioned form
in Fig. 1. The
form illustrated
variations discussed
lsus
hevalalos
indicate the possibilities.
in Section II-B and illustrated in Figs. 5 and 6 would require modification of the methods discussed here. For
example, since a microprocessor in one subset cannot
C* Switching Implementation
The switching networks might be implemented in sev- communicate directly with any other microprocessor in the
eral ways. At one extreme, an LSI chip might implement same subset, the way the data are mapped onto the mia single switch node in one direction to whatever bit width croprocessors becomes important. This, then adds a new
is practical in view of the pin-limitation of the chip tech- factor to the discussion of what can be done with a given
nology used. At the other extreme, the entire switching switching network. We will not here consider the effect of
network, or as much of it as possible, could be put on a such added factors.
Let P be a nonsingular mapping of the set of microprosingle chip with a width of one bit. In ei.ther case, chips
would be used in parallel to obtain the word or byte size cessors onto itself. Alternatively, P may be regarded as
desired for the transfer operations. There are also a variety describing the result of a permutation of data which are
of intermediate possibilities, incorporating different distributed over the set of microprocessors, one datum to
each microprocessor. If the microprocessor that initially
numbers of switch nodes with different bit widths.
~ ~ ~ ~ ~ ~ ~~~~loepossibilities..
indicate~
464
IEEE TRANSACTIONS ON COMPUTERS, MAY
contains a given datum has the index x, then after the
permutation the microprocessor that contains that datum
has the index P(x). In other words, we describe the permutation as a map of addresses rather than a map of the
actual data.
We say that P is admissible if it can be obtained by a
single unit transfer. By a unit transfer we mean a single
pass through the switching network. As indicated earlier,
there may be a large number of switching levels that are
passed through in a single unit transfer. Therefore, the
delay caused by a unit transfer can be significant. However,
for the type of array we
are considering, a unit transfer is
.
. .
.
the primitive transfer..operation.
It is in this sense that we
call it a unit transfer.
Let x be the index of a datum before transfer, and y
after. Then P can be specified as a function mapping x
onto y for all integers in the range 0 < x < 2n - 1.
Let x and y be expanded in binary notation as in (1), as
(x1,x2, xXn) and (Y1Y2, *,Yn) withxl andylbeingthe
least significant bits, xn and Yn the most significant. The
function describing P can be written as a set of functions
-
-
-
Yi=
-
Pi (x 1,X2, * ** ,xn).
(4)
The principal theorem, which is a reformulation' of
Lawrie's theorem 2, is the following.
Theorem 1: P is admissible if and only if the functions
(4) defining P can be written in the form
Yi
=-
Xi + fi(Y-1 l**,yi-l)xi+l,, *
xn)
(5)
1977
- 1) to 0. The functions describing this map would normally be written
Yi = Xi + 1
Y2 = X2 + x1
Y3 = X3 + X2X1
X4 + X3X2X1
Y4 =
(6)
...
to describe the effect of the carry operations. These funccan also be written
~~~~~~~~tions
Yi = X1 +
Y2 = X2 + (1 + Yi)
X3 + (1 + Y2)(1 + Yi)
Y3 =
Y4 = X4 + (1 + y3)(1 + Y2)(1 + Y1)
)
.
as can easily be proven by induction. Equation (7) is in the
form required by the theorem. Hence, it is an admissible
permutation.
As a generalization of this map, which is used later,
consider a unit shift on some subset of the components. Let
the subset include the components with subscripts
tlyt2, ...
* tm, with
1< ti < t2 <
...
< tm < n.
The map analogous to (7) is
for 1 < i < n. The operations used are those of modulo 2
Yt1 = Xti + 1
arithmetic.
Consider a switch node in Si. This node switches a pair
=
Yt2 Xt2 + (1 + Ytl)
of lines whose indices differ only in the ith components of
Y+1 = + (1 + Yt1)
their binary representation. The more significant comYt- xt, fj
j=
ponents of the index have not been modified by any switch
node in any Sk, k < i. Hence, the more significant components must be the same as those of x, or xi+1, ,xn. The which is admissible.
(8)
It is also possible to combine shifts on disjoint subsets
less significant components must be the same as those of
of the components. For example, if there are n compoy, or Y 1,Y2, * ,y - 1. The form of fi must be as given.
The statement that yi must be (xi + fi), rather than the nents, the map which induces a unit shift on the m most
more general expression (gixi + fi), reflects the fact that significant bits (m <n) and a simultaneous unit shift on
if one input line to a switch node is switched, the other is the remaining bits is admissible.
The form of (7) and (8) suggests the following.
also. In fact, given the dependencies of fi as indicated, the
Definitions: An admissible permutation, P, is lower
initial term must be xi rather than gix1 for the map to be
nonsingular.
triangular if the functions (5) defining P can be written
The function yi describes the effect of the set Si of as
switch nodes. Individual switch nodes within Si are idenYi = Xi + C
tified by the variables (Y1,Y2, *. ,Yil,xi+l, * ,x). A
switch node is set direct if for the corresponding values of
(9)
yi = xi + ft (Y'Yi, * * ,Yi-i)
the variables, f,. = 0. It is set to the crossed state if ft = 1.
The equations given in (5) can be interpreted directly as where 2 < i <n, c = 0 or 1.
a specification of the setting of the switching network.
We now have the following theorem.
As an example, consider the permutation that describes
Theorem 2: The set of admissible lower triangular
a unit shift, connecting microprocessor 0 to 1, 1 to 2, * * ,(2n permutations is a group under composition of maps.
PEASE: n-CUBE MICROPROCESSOR ARRAY
465
By composition of maps, we mean the successive application of mapping functions, or the equivalent of successive passes through the network. The theorem implies
that the result of any sequence of passes through the network can be obtained by a single pass,- providing all the
individual permutations involved are of the lower triangular form.
To prove the theorem, we note first that the composition
of maps is associative.
Second, the identity map exists in the set. It is the permutation for which all fi are identically zero.
Third, the set has the group property. To see this, let
Theorem 3: The set of admissible upper triangular
permutations is a group under composition of maps.
The proof is similar to that of Theorem 2.
The problem now arises as to how nonadmissible maps
can be realized. An example of considerable importance
concerns permutations which are sets of independent
pairwise interchanges of the components of the index. The
transposition of a matrix is of this form. So also is the
digit-reversal map used in connection with the FFT.
Suppose the map that is required includes
Yi= Xi + fi(Y1,Y2, ... ,yi-i)
Y=
Xi(11)
and
Yi
with i > j. This is not an admissible map since it is not of
the form of (5).
Define the two maps
ZiYi=+ gi(Zb,Z2, * * * ,Zi-1)
for all i. Then
Zi
=
Xi + fi(Yl,Y2,
Ui = Xi
.,Y.ji-1) + gi(Z1,Z2, *.* )zi-1).
uj =xj
However, for a given i we can replace the variables in fi
by
,Zk-1)
where 1 < k < (i - 1). Therefore, we can write
(12)
=
fi++.gji = h~z1,z2,
hi(Zl,Z2, *-,Zi-l),
both of which
,z~1),Then
and the set has the group property.
Finally, the inverse exists. For a given i, we can setg =
fi after substituting for the Yk, k < i, in fY.
Since these four properties are those required of a group,
the entire set is a group. The theorem is proven.
Since (7) - describing the unit shift permutation - is
of the lower trian-gular form, it is a member of the group.
Hence, any power of this permutation is also. A shift by any
fixed distance forward or backward is therefore an admissible permutation, realizable by a single pass through
the network.
The same argument applies to (8), describing a unit shift
on any subset of the components of the indices. A shift by
any fixed amount on the subset is realizable by a single pass
through the network.
The statement following (8) about the admissibility of
maps which simultaneously shift on disjoint subsets of the
components is also a consequence of Theorem 2.
In an analogous way, we make the following definition.
ntion : a
p
gular if the functions (5) defining P can be written as
= x~ f~(~+1,
The following theorem then is true.
+ uj
+ Xi
and
Yk =Zk + gh(ZL,Z2,
Yn= xn + c.
=xj
by
(13)
Theorem 1.
Yi= xi + u1 = x- + (xj + xi) = xj
= (Xj + x) + (xi + Uj) = xj + Uj
= xj +
(xj + xi) = xi
so that we have obtained the desired map (11).
It follows that any map that is composed of a set of
pairwise interchanges of the components of the index can
be realized with just two passes through the switching
network.
More generally, consider any map that is linear in the
components of the index. Define the vectors
x
=
Linearity
means
Xn
Xn-
Yn
Yn- 1
XI
Yi
that there exists
an n X
(14)
n nonsingular
binary matrix P, such that
,x~)Y
(10)
are
yj = uj + ui
of the form required
=Px.
(15)
The coefficients of the matrix and the vectors are in the
field of characteristic 2 so that modulo 2 arithmetic is to
be used here.
466
IEEE TRANSACTIONS ON COMPUTERS, MAY 1977
In this form Theorem 1 states that a map from u to v is
admissible if it can be expressed in the form
IV. COMPUTATION GRIDS
admissible ifit
can be expressed in the form
In this section we consider some specific uses of a miUv = Lu
croprocessor array using the switching network described
in Section II-A. This array developed out of consideration
where L is a lower unit triangular matrix and Uis an upper of the binary n-cube. If efficient utilization of the array
unit triangular matrix. (Unit means that all coefficients capabilities depended on finding the structure of the bion the main diagonal are ones.) Since the inverse of an nary n-cube in the algorithm being executed, its use would
upper unit triangular matrix is also upper unit triangular, be severely limited. To make it useful for a reasonably
this expression can be written
broad range of applications, we must find ways of adapting
the array configuration to problems of various dimensions
(16) and different sizes along each dimension. Since this adv= U'Lu
aptation seems quite fundamental, it is considered here
whereU'
is U-1, and is upper unit triangular,
where
.
as
a problem of array manipulation, rather than described
We will show, shortly, that we can factor an arbitrary later
when various applications are considered.
nonsingular binary matrix P as
Even though the present intent is to show that the indirect
P = L1
binary n-cube array can be used in a way that sim(17)
ulates other arrays, the techniques do apply to some application problems of great importance. In particular, they
Having done so, we can define
apply to the solution of partial differential equations in
u = UL2x
several
dimensions using relaxation methods. The problem
(18)
can
be
as one of handling data that are considered
regarded
-=
(19)
Y - Llu.
(19) as distributed over the nodes of a grid having several diL1 u.
two
and with a regular
m
uusually either
e
t or
o three,
t
Equation (18) is* .
of the form of (16) and therefore is. an ~~~~~~~~~~mensions,
admissible map. Equation (19) is also admissible, and in pattern of local connections. The computational grid is
defined by the algorithm that fits the application, not by
fact, is lower unit triangular. The map defined by P in (15) the
most, two unit
~g topology of the processing array. The significant
cancabe executed in
in at
at most,
unit transfers.
Tootain the factorization of (17), we use a modified question is whether or not a particular processing array can
be used effectively for the computational grid that is apfo Of
of Gauss's
Gauss' algorithm using modulo
modul 22 arith et
form
arithmetic.prrittoherblm
Gauss's algorithm, if it can be executed without pivoting, propriate to the problemc .
we consider two aspects of
leads to a factorization of Pinto L U, the product of a lower
ingthef
and an upper unit triangular matrix. The modification that matching the indirect binary n-cube array to various
computational grids. The first subsection deals with diis
needed
is introduced when we would normally use p
voting. Suppose we have reduced the first (i - 1) columns mensionality. Although by definition an n-cube is nto upper triangular, and the (i) term is zero. Since the
grids, where generally s = 2 or 3. In the second subsection,
matrix is assumed to be nonsingular, there must be a one grider
gnrallyns =n the secondosubsection ,
in the ith row to the right of the ith column. Suppose there
is a one in the (i,j)th location. We can add, modulo 2, the taking as an example a two-dimensional hexagonal
jth column to the ith, and post-multiply by a matrix with grid.
ones on the main diagonal, and a one in the (ji)th location, A. s-Dimensional Rectangular Grids
which is a lower triangular matrix. The matrix has now
been modified to put a one in the (i,i)th location and the
Important applications employ algorithms linking data
reduction process can be resumed. The end result of this that are conceived to be on some set of neighbors on a
process, given that P is nonsingular, is a factorization into rectangular grid in, say, s dimensions, rather than on the
the form of (17).
binary n-cube. For example, Fig. 7 shows a two-dimenThese results can be easily extended to include any map sional rectangular grid with 16 grid points, so that s = 2 and
that is a/fine in the components of the index - i.e., such n = 4. It is important to establish how the indirect binary
that each of the functions (4) is the sum of a linear function n-cube connection can be used to obtain the communicaand a constant which can be either zero or one.
tion patterns required for a process based on a rectangular
We would like to be able to extend these results to ar- grid.
bitrary nonlinear and nonaffine maps to give an algorithIn general, we wish to make the array act as if it were an
mic procedure for obtaining an efficient factorization of s-dimensional array configured as an N1 X N2 X.* X Ns
an arbitrary map into a sequence of admissible maps. We array. We assume that each Ni is itself a power of 2, which
have not succeeded in doing so. Fortunately, the maps that is a somewhat restrictive but convenient assumption.
have been found to be important are of the types discussed
Given the assumption that Ni is a power of 2, we can
here. -assign sets of the binary components of the index to each
UL2.
falorizth uion
ingisubsections,
-
-
w
467
PEASE: n-CUBE MICROPROCESSOR ARRAY
123
_*14~
__8
Co12
_____
45
-4.
6
10
__.___
--
-M.1
1:
I
Fig. 7. Rectangular grid of 16 points.
dimension. For example, the labeling of Fig. 7 is such that
the two least significant bits of the index determine the
horizontal position, and the two most significant bits the
vertical position.
We have already shown in (8) that we can shift any
constant amount on any subset of the index components
with a single pass through the network. Therefore, we can
shift data in the network of Fig. 1 as if the network were
the rectangular grid of Fig. 7.
For example, a unit right shift in Fig. 7 is generated by
setting to the crossed state all the switch nodes in Si of Fig.
1 and the first, third, fifth, and seventh in S2, counting
from the top, with all other switch nodes direct. The left
shift is obtained by reversing the states of the switch nodes
in S2, setting the second, fourth, sixth, and eighth to the
crossed states.
A unit downshift in Fig. 7 is generated if all the switch
nodes in S3 and the top four nodes in S4 are set to the
crossed state, all other switch nodes being set to the direct
state. The unit up-shift is generated if the bottom four
switch nodes of S4 are set to the crossed state. Shift distances greater than one can be generated with equal simplicity. We can even handle diagonal shifts on the grid,
since these correspond to simultaneously adding to the two
subsets of the index components.
Thus, it is quite possible to program the switch nodes
of Fig. 1, or larger versions of the array, to make the array
act as if it were a rectangular grid of arbitrary dimensionality. A shift of arbitrary distance along any dimension of
the grid can be obtained by a single pass through the network.
B. Odd- Even In terleaved Grids
A frequently used technique considers the data as resident on the nodes of two interleaved rectangular grids. The
nodes of one grid are at the center of the cells of the other
grid. The algorithm computes new values on each grid in
alternate steps. In one step, the new value at a point on one
grid is determined from the values on neighboring points
on the second grid. In the succeeding step, new values are
computed on the second grid, from the values on neighboring points of the first grid.
If the array of microprocessors were used to simulate the
pair of grids, with each microprocessor assigned to one
point on one grid, some computational parallelism would
be lost. In each step the values on only one grid are being
computed, so that only half the microprocessors would be
active.
To avoid this loss of parallelism, we can locate two data
points in each microprocessor, one from each grid. For
example, in Fig. 7 the ith microprocessor is made to hold
the ith and (i')th data. The (i')th datum may be regarded,
however, as physically located in the center of the cell in
which the ith datum is located at the upper left corner. To
illustrate, microprocessor 5 contains the data associated
with nodes 5 and 5'. Node 5' is regarded as located at the
center of the cell defined by the nodes 5, 6, 9, and 10.
Assuming that the algorithm, in a given step, calls only
for the nearest neighbors, the computation of new values
for the data at node 5 calls for data from nodes 0', 1', 4', and
5', or for data resident in microprocessors 0, 1, 4, and 5 itself. Similarly, the computation of new values for the data
at node 5' requires the data at nodes 5, 6, 9, and 10, or for
data from microprocessors 6, 9, 10, and 5 itself.
The data movements required are then, left and right
one-shifts, up and down one-shifts, and one-shifts in both
directions along the diagonals from upper left to lower
right in Fig. 7.
The right, left, up, and down shifts are those required
by the simple rectangular grid of Fig. 7 itself. The diagonal
shift is a little more unusual. However, it is an example of
a map which executes a simultaneous shift on disjoint
subsets of binary components of the index. These maps
were mentioned following (8) and their admissibility, in
general, recognized as a consequence of Theorem 2. Hence,
each data shift required for computation on odd-even interleaved grids is obtainable by a single unit transfer.
<~ a b
468
Fi.8
IEEE TRANSACTIONS ON COMPUTERS, MAY 1977
eto
fa hexagoa grd
<>a~~~~~~~~~~~~
Fig. 8. Section of a hexagonal grid.
C. Other Grids
Other grid structures are sometimes useful. One is the
two-dimensional hexagonal grid, a section of which is illustrated in Fig. 8. This grid is sometimes used for the solution of two-dimensional partial differential equations
by relaxation methods. Its advantage is that each point has
only three nearest neighbors, instead of four, so that the
separate steps of the iterated calculations are somewhat
simplified. Whether or not this leads to improvement in
the total computatial effort depends on the details of the
partial differential equation and its boundary conditions.
The grid of Fig. 8 has three basic transfer operations,
indicated as a; b, and c. These are all of order 2, i.e., exchange operators.
Fig. 8 can be mapped onto the 4-cube, as shown in Fig.
9. The dashed lines in Fig. 9 are the connections that make
Fig. 8 toroidally connected in both directions.
The mappings required, when put in the form of (5), are
as follows.
For the exchanges along the a edges:
For the exchanges along the b edges:
Yi = xi + 1
Y2 = x2 + x3 + 1
Y3 = X3
y4= X4For the exchanges along the c edges:
Y2 = X2
Y3= X3 +
Y4 = X4 + Yl(22)
All of these maps are of the form required by Theorem 1.
Therefore, they can each be executed with a single unit
transfer
The map for larger size hexagonal grids can be developed
analogously, assuming the horizontal and vertical sizes are
powers of 2. The resultant maps are somewhat more involved, but there is no essential difficulty in developing
them.
Other grids can be implemented in the indirect binary
n-cube array. In general, the problem is to discover a
method of folding the grid so that it fits on the binary ncube. When this is possible, the array can be used to provide the communication paths required for computation
on the grid
V. CONTROL AND VIRTUAL ARRAYS
The general problem of controlling the array is considered in this section. Clearly, it is of vital importance to
provide a control system that both has the required flexibility and permits the array to operate at high efficien-
Yi = x1 + 1
Y2 = x2 + X3
Y3= X3
Y4 = X4.
Yi = x
(20)
cy.
We propose a two-level control system for the microprocessors, based on variable microprogramming stored
within the microprocessors. For the switch nodes, we
propose to use a set of switch controllers, each of which
would control a set of switch nodes. The switch controllers,
which may be similar if not identical to the microprocessors, may themselves be variably microprogrammed and
controlled by the same high level controller that controls
(21) the microprocessors. The control system is shown in block
form in Fig. 10.
The controller is visualized as issuing seqfuences of global
PEASE: n-CUBE MICROPROCESSOR ARRAY
469
15
a
~a
.~~~~~~12
1
Fig. 9. The hexagonal grid mapped onto the 4-cube.
CONTRO LLER
(GLOBAL COMMAN DS)
CONTRO LLE R
.CONTRO LLE R
(GLOBAL
COMMANDS)
MICROPROCESSORS
* * \=ITh*
\II=
SWITCH
NODES
SWITCH
NODES
MICROPROCESSORS
Fig. 10. Outline of control system.
commands that are broadcast to the microprocessors and
to the switch controllers. The total vocabulary of global
commands is small. In the microprocessors, each global
command is interpreted as a call on a sequence of microinstructions that is stored locally. Different microprocessors may interpret a given global command into quite
different sequences of microinstructions. The same command may also be interpreted into different sequences of
microinstructions in the same microprocessor, with the
choice determined by a previous test of the data.
Within the switch controllers, a global command is again
interpreted. Interpretation in this case may entail only the
call up of a word stored internally, the individual bits of
which are transmitted to the switch nodes controlled by
the switch controller.
It is important that the microinstruction set within a
microprocessor or switch controller can be changed between jobs. The significance of a given global command
may be altered by reprogramming the microprocessors and
switch controllers.
The purpose of organizing the control system in this way
is to obtain the required flexibility within the limitations
imposed by the microprocessor concept -in particular,
without making excessive use of the scarce resource of the
pins on the microprocessor chips.
To avoid using an excessive number of pins for the
control function, we must sharply limit the number of
commands that are recognized by a microprocessor. Yet
wide variations are needed in the operations that are actually executed. To achieve the variation needed, the device of local interpretation and reprogrammability is
used.
The two-level control system permits us to introduce the
concept of a virtual array. We have described how an array
constructed using the indirect binary n-cube connection
can be made to execute the operations of, for example, a
CN1 X CN2 array. By the microprograms that are loaded
into the switch controllers and into the microprocessors,
the array can be made to act as if it actually were a CN1 X
CN2 array, when viewed from the level of the controller. In
a similar way, the array can be made to act as if it were
hexagonal (of which Fig. 8 is a section) by entering the
appropriate microprograms before the start of computations.
Once the desired virtual array has been established, it
can be modified to take account of special local conditions.
470
IEEE TRANSACTIONS ON COMPUTERS, MAY 1977
As an example, consider the solution of a two-dimensional
partial differential equation. Several special local conditions may exist. First are the boundary conditions. If the
boundary conditions are of the Dirichlet type, the values
of the function are specified on the points of the boundary.
If they are of the Neumann type, the normal derivatives
are specified. In either case, the grid points that are defined
as the boundary must be handled specially. It is plausible
to expect that the boundary points can be accommodated
by introducing special microprograms to
particular mit paretirp
cograms
croprocessors sospeialm
that the
are
glObtal
commandls
. to
J the
. interpretedc
l
localy into a procedure. that is appropriate
boundary
condition.
conaltlon . .
*
*
Second, some application will use a grid that is large
than the array. In such cases, the grid can be folded so as
to use the memory that is part of each microprocessor. The
that handle the points at which the grid
s handle the dlata specill. y. Again,
Asl it iSl plaul
iS folded must
.l 1 that
l this
l . requirement
.
sible to expect
be
can
met
by
pro, .
gramming the microprocessors.
mio
so.
Although we have not explored the actual computational
algorithms, we have shown that we can establish the necessary communication paths.
We now consider other applications, in particular,
spectral analysis, and matrix multiplication. These applications present diverse, important, computational requirements.
A Discrete Spectrum Analysis and Related Processes
This category
of applications icludes various processes
that
conveniently executed through
iiaare itrn,adteFTisl
1] the
11FFT, such as
~~~~~~~~~digitalfiltering, and the FFT itself [13],
[141.
The radix-2 version of the FFT depends on the conof the binary n-cube. Let the data be indexed
from 0 to 2 - 1, and let the index, q, of a datum be expressed in binary form as in (1). In the kth stage of the
FFT, the pairs of data that must be combined are those
in qm-±k+1.2 This rule uses the pattern of
that differ
th
ia only
-ue
n -cube.
~~~~~~~the
binary
y. the radix-2 FFT uses the pattern of the binary
~~~~~~~~~~~Although
.
n-cube, it is not sufficient simply to map the data onto the
g.raming
the.
me
ll
can
folding, the grid may belargerthanca
If so, the grid vertices of the n-cube. The computations required are
be Third,
held ineven
thewith
memories.
microprocessor
m.
must be sectioned, and different sections paged in from the executed on pairs of data. Therefore, in the type of array
baku
inrdue special
spca conditions
codtin on
on the
This introduces
te being considered, on a computation cycle each microprobackup meor.
memory. Thi
.cessor must contain an appropriate pair of data. The data
the
boundaries
of
on
the
sections.
grid points
Fourth, using different grid sizes in different regions is lmust be organized so that successive transfers will estaboften advantageous. The boundary between regions using thiesccsiveostagessof theF alrithm.
. .requires
. ...................
lthe successive stages of the FFT algorithm.
different grid sizes again
special handling.
It is possble to find a
of the data onto the miFifth, different regions may use different computational
For
fluid
flow
divide
into
algorithms. example,
radx-2
may
regions FFT. The procedures that are used to obtainofthethemapping
and turbulent
may be
shock
flow,
Ofoflaminar andl
flOW, or
or there
there may
be shock.
been discussed elsewhere [14] and will not be repeated
waves. The boundarie
betweensuchregiohave
The
boundaries
waves.
between
such
regions
may
them-l
here.
The results can be illustrated with the following
selves be moving and may be determined by the data,
rather than preset. In this last case, we need to test the data mapping.
Suppose we have 2n+1 data with 2n microprocessors.
to determine where the boundaries are, and then execute
the computational algorithm that is locally appropriate. The data areis divided into two sets, indexed by q1 and q2.
This requires each microprocessor to hold two or more The q, set the set of data for which q has even index
microprograms, either of which is called by the global the q2set, or the data with odd index parity.
wto ginex p y
command for computation, with the choice between the The inst, or
the data
a
t dt agv b
microprograms made according to a previous local test of T
the data.
=Mp
(23)
In summary, the use of a two-level control system with where M is the (n + 1) X n matrix
variable microprogramming gives two kinds of flexibility.
00 ... 0 1
First, it provides global flexibility that can be used to set
0 0 ... 1 1
M=
up a virtual array that is suitable to the type of problem
0 0 ... 1 0
being handled. Second, it appears that we can use the
system to provide for local details of the computation,
which are either prescribed by the specific problem or are
1 1 ... 0 0 /
determined as a prescribed response to the local data.
(24)
1 0 ... 0 0/
and by q2 obtained from q1 by complementing the most
VI. APPLICATION TYPES
significant bit. As before, the matrix and vectors are constructed lwith
coefficients in the field of characteristic
2.
rtmtcpoessaedn
We have discussed how arrays of the type illlustrated in Hne
ouo2
Filg. 1 can be usedt to establish the communication paths
byrintroducing
~~~~~nectivity
iroped
be
held
in
the
microprocessor
If
so
.^>
mapping
...
require forcoputatios that re concived asbeing
executed at the nodes of a multidimensional toroidally
connected grid. The particular application
that we have
-
This is the case when the data are kept in their original order and the
spectrum is computed in digit reversed order. When the data are entered
in
digit rule
reversed
order in the
thekth
resultant
spectrum in normal
order,
applies, butand
stage it iS the kthisleast
significant bita
~~~~~~~~~~~similar
considered is the solution of partial differential equations. that differs in each pair.
PEASE: n-CUBE MICROPROCESSOR ARRAY
'
471
/001\
As an example, for n = 3,
M=( 1
(2)
0 )
(6)
(7)
/~~~
~~~~~~~~,13
A91
S14
(3)
102
giving the mapping of the data on the cube shown in Fig.
11. The data indices that are underlined are those given
(4)57
by q1, and are to be held in place. The others are those
'
/
mapped by q2, which are to be transferred.
(0)
(1) /
The distribution shown in Fig. 11 is that appropriate to o. 8
12 4
frst
tage
f
th
FFT
Follwingthe
irst
the first the
the
of
FFT.
the
first
stage
Following
stage
com-tagecom-Fig.
11. Assignment of data to the 3-cube for execution of the radix-2
putations, the q2 data are transferred horizontally in Fig.
FFT.
11. This gives the pairing required for the second stage. A be taken as a single index,
subsequent vertical transfer sets up the pairing required
for the third stage. A final transfer along the diagonal edges
q
= i2m +j.
(26)
of Fig. 11 sets up the data for the fourth and final stage. Let q, as before, be expanded in binary form as in (1). The
It is possible then, to make efficient use of the binary row index is then given by the m most significant bits, the
n-cube connection for the radix-2 FFT. That the actual column index by the m least significant.
array may be the indirect binary n-cube one, so that the
The simplest way of mapping a matrix into the array is
binary n-cube iS obtained as a virtual array obtained to put the qth coefficient of the matrix into the qth mithrough microprogramming the switch nodes, does not croprocessor, where by the qth coefficient we mean the
single index given by (26). We will call this the "identity
cause any difficulties.
The particular mapping given here, and the resultant map of the matrix to the array." There are other ways of
process executing the FFT, is not entirely unique. It is mapping the matrix onto the microprocessors which apintended rather to illustrate the way in which the data pear to be useful under some circumstances, and which are
assignment can be combined with the array structure to indicated later. The following discussion does not present
obtain efficient execution of a well structured algo- the "best" map or procedure, but only serves as an illusrithm.
tration of the effective use of the arrays we have considered
in matrix operations.
B. Matrix Operations
We observe, first, that with the identity map the transAn important class of problems requires the solution pose of the matrix can be obtained relatively simply: it
of
requires that the binary components of the index be inkth component of the index and
Ax = b
(25) terchanged, so that the
the (k + m)th, k < m < n/2 are transposed. This is an infor the N-dimensional vector x, given the N-dimensional stance of the type of map defined in (11), and can be obvector b, and the N X N matrix A. Closely related is the tained by the two maps given in (12) and (13). The transproblem of inverting an N X N nonsingular matrix.
pose, therefore, can be accomplished by just two unit
In an earlier paper [15], we have described algorithms transfers or two passes through the switching network.
for solving (8) and for inverting a matrix that depends on
Suppose, now, the matrix A is stored in the array, and
the group structure of the binary n-cube, or the group C', the matrix B has been converted to its transpose Bt; we
the direct product of n copies of C2, the reflection group. suppose that both the (i,j)th component of A, and the
We will not discuss the algorithm in detail here. However, (j,i)th component of B are in the (i2m + j)th microprowe will briefly sketch the process of matrix multiplication cessor.
since it is fundamental, not only to the algorithm but also
By multiplying the pairs of coefficients in each microto many other operations.
processor, all the terms in all the diagonal elements of AB
Let A and B be 2m x 2m matrices where, in the limiting are obtained. The various terms of the (i,i)th coefficient
case, m = n/2, assuming n is even. Larger matrices will are in the microprocessors with indices (i2 m + j), 0 < j <
have to be handled by partitioning. With smaller matrices, (2m -1).
we can operate on several simultaneously, in effect stacking
At this point the diagonal elements of AB could be dethem into a full-size, quasi-diagonal matrix. If we do not termined with a sequence of shifts and adds. The shifts~
have several required matrix products, the matrices can required are those which add constants to j, modulo 2m,
be replicated through the array to make full use of the which are the appropriate powers of the map given in (8).
available parallelism; however, we shall not discuss these As a result of Theorem 2, these are admissible maps, each
operations here.
realizable by a single unit transfer through the switching
Ljet the coefficients of A and B be A^1 and Bi1 with each network.
index ranging from 0 to (2m -1). Let the double index ij,
It is better to defer the shifts and adds, however; oth-
472
erwise there will be a rather rapid loss in the degree of
utilization of the microprocessors. If the microprocessors
have sufficient storage capacity, the individual terms
should be retained at this point.
The second step, then, is to shift the components of Bt
along the columns by 2 mr-i, modulo 2 m; we shift each
(j,i)th component of Bt to the {(i + 2M-1)2m + j}th microprocessor. This shift is admissible as in (8), and so executable as a single unit transfer. The (i2m + j)th microprocessor then contains the (i,j)th component of A and the
(j,i + 2n-1)th (modulo 2rm) component of Bt. The product
of these pairs gives all the terms of the (i,i + 2m-1)th
coefficients of AB.
The process of combining terms can now be started in
away that fully utilizes the microprocessors. The principle
used is to simultaneously shift and add half of the (i,i)
terms computed in the first step, and half of the (i,i +
2M-1) terms obtained in the second step. The (i,i) terms
shifted and added are those in the microprocessors for
which (j - i) > 2m-1 or 0 < (i - j) < 2rn1. The (i,i +
2M-i) terms shifted and, added are those in the microprocessors for which 0 < (j - i) < 2m-1 or (i - j) > 2m-1.
In other words, for the (i,i) terms we shift a diagonal band
of terms from below the main diagonal and from the upper
right corner. For the (i,i + 2m-1) terms we shift the complementary band. For both types of terms, the shift is 2r-1
in the j coordinate: j to (j + 2 m- 1) modulo 2 m. This is an
admissible shift by (8).
For the next two steps, the Bt coefficients are shifted in
the i direction by plus and minus 2m-2. After each shift,
the coefficients are multiplied by the coefficients of A in
the same microprocessor. The products are the terms of
the (i,i + 2m-2) and (i,i - 2m-2) coefficients of AB. A simultaneous shift and add of half of each of these sets of
terms can now be performed.
Another simultaneous shift and add can now be executed, shifting half of the current terms of the (i,i), the (i,i
+ 2r-i), the (i,i + 2rn2), and the (i,i - 2 2) coefficients, each of these sets having already been coalesced
once.
The process can be continued until the full matrix
product has been obtained.
To illustrate, consider the multiplication of 4 X 4 matrices. The first step generates the (i,i) terms, and the
second step the (i,i + 2) modulo 4 terms. After the first
shift and add, the array contains the terms whose indices
are as follows:
IEEE TRANSACTIONS ON COMPUTERS, MAY 1977
0,3) (0,1) (0,1) (0,3)
1,0) (1,0) (1,2) (1,2)
.
2,0) (2,1) (2,1) (2,0)
\(3,1) (3,1) (3,2) (3,2/
The two sets of terms are stored in the array of microprocessors. By properly selecting the terms to be shifted
and shifting them one unit left, modulo 4, and then adding,
the matrix AB is obtained with its terms correctly distributed among the microprocessors:
0,0) (0,1) (0,2) (0,3)
1,0) (1,1) (1,2) (1,3)
(2,0) (2,1) (2,2)
(2,3))
3,0) (3,1) (3,2) (3,3
We may note that in each operation, all the microprocessors are busy. If the matrices are N X N with N a power
of 2, and if the matrices are filled, without zero coefficients,
the procedure fully utilizes an N2 parallelism. Further,
since all microprocessors are doing the same thing in each
operation, the array can be controlled by a broadcast global
command system, as proposed.
There are a number of variations of this procedure that
may be useful under special conditions. For example, it
may be advantageous to use a different procedure from
(26) to map the matrix coefficients onto the microprocessors. Budnik and Kuck [16], for example, have argued the
importance of skewing the matrix so that each row is displaced horizontally, modulo N, by an amount that is proportional' to the row index. It can be shown that the map
from the distribution used above to a skewed representation is an admissible map so that one representation can
be converted to the other with a single unit transfer.
Another map with some interesting properties is given
by
o I
(C I)
(27)
where p is the index of a microprocessor and q the indices
of a matrix coefficient, combined as in (26), and both made
into vectors as in (14). C is the matrix with ones on the
main contradiagonal, zeros elsewhere. As usual, the indicated operations are to be interpreted as modulo 2 arithmetic or in the field of characteristic 2.
The map is not arbitrary. It first maps the diagonal elements of the matrix onto the microprocessors indexed 0
< p < (2rm - 1). It then maps the coefficients with indices
of the form {i, (i + 2rmi)}, modulo 2 m, on the next block of
microprocessors. The third block is filled with the coeffi/00w) (0,0) (02 (0,2\k
cients whose indices are of the following forms: first ti, (i
1(1,3) (1,1) (1,1) (1,3)
,
+ 2m-2)I with 0 < i < 2n1 - 1, and second {(i + 2rn2),iI,
,I
(2,0) (2,0) (2,2) (2,2) jand so on. The reason for this map's usefulness is deeply
embedded in the theory behind the inversion algorithm
\(3,3) (3,1) (3,1) (3,3)/
cited [15]. It appears to be a useful representation where
the inverse of the matrix or the solution of a set of linear
The next two multiply steps, and the subsequent shift equations is to be obtained.
The map (27) is not admissible, having zeros on the main
and add generates terms in the array whose indices are
473
PEASE: n-CUBE MICROPROCESSOR ARRAY
[2] J. H. Holland, "A universal computer capable of executing an arin 1959 Eastern
bitraryComputer
number ofConf.,
sub programs
simultaneously,"
AFIPS Conf.
Joint
Proc., pp. 108-113.
[3] D. L. Slotnick et al., "The SOLOMON computer," in 1962 FaU Joint
Computer Conf., AFIPS Conf. Proc., pp. 97-107.
W. T. Comfort,
modified Holland
in 1963 Fall Joint
[4] Computer
pp. 481-488.
Conf.,"AAFIPS
Conf. Proc., machine,"
diagonal. However, it can be factored as in (17). Hence, a
matrix stored as in (26) can be rearranged so as to be stored
according to the map (27) in just two unit transfers. The
cost of converting to or from this representation is not
great.
What other representations may be useful is not clear.
However, there are many possibilities that can be considered in relation to particular circumstances. There is a
great deal of flexibility in representing and manipulating
matrices.
[5] R. A. Gonzalez, "A multilayer iterative circuit computer," IEEE
Trans. Comput., vol. C-12, pp. 781-790, Dec. 1963.
H. Barnes
et al.,pp.
"The
ILLIAC
computer," IEEE Trans.
[6] G.
vol. C-17,
746-757,
Comput.,
Aug.IV1968.
of
data
in an array processor,"
D.
H.
"Access
and
Lawrie,
alignment
[7]
IEEE Trans. Comput., vol. C-24, pp. 1145-1155, Dec. 1975.
[8] C. Clos, "A study of non-blocking switching networks," Bell Syst.
Tech. J., vol. 32, pp. 406-424, Mar. 1953.
[91 V. E. Benes, Mathematical Theory of Connecting Networks and
VII. CONCLUSIONS
The indirect binary n-cube array is an attractive candidate for a microprocessor array to be used as a highly
parallel facility for scientific or numerical applications. It
appears to be well adapted to a variety of applications requiring a wide range of algorithmic processes. At the same
time, it permits making efficient use of the pins that are
available on the LSI chips. The regularity of its structure
is also attractive, both in terms of the building costs, and
the possibility of approaching desired sizes through a sequence of incremental steps, each of which doubles or
quadruples the previous size.
We have discussed the problem of control in general
terms and have proposed the use of a two-level system
based on global commands which are locally interpreted
into reprogrammable sequences of microinstructions. This
arrangement permits using the actual array as if it had
quite a different configuration, or setting up one of a wide
range of virtual arrays.
We have not discussed I/O problems, either for data or
for the initial loading of the microinstruction sets. This is
an aspect that requires further study. However, we can
observe that because the array is based on the binary ncube, there is a very simple and natural addressing inherent in the structure. Meeting I/O requirements does not
appear to present any inherent difficulty
Finally, we recognize that other types of arrays are also
candidates. We do not assert that the indirect binary ncube array is the best. The conclusion is not that this
particular array should be the one used, but rather that a
large-scale array of microprocessors can have great computational power and flexibility for handling a wide range
of important problems.
ACKNOWLEDGMENT
The author gratefully acknowledges the assistance of J.
Goldberg and Dr. W. Kautz in the preparation of this
[101]
Telephone Traffic. New York: Academic, 1965.
A. E. Joel, Jr., "On permutation switching networks," Bell Syst.
Tech. J., vol. 47, pp. 813-822, May-June 1968.
[111 switching
D. C. Opferman
and N.Bell
T. Tsao-Wu,
class50,ofpp.
rearrangeable
networks,"
Syst. Tech."On
J., avol.
1579-1618,
May-June 1971.
[12] H. R. Ramanujam, "Decomposition of permutation networks,"
IEEE
Trans. Comput.,
C-22, pp.
639-643, July
1973. Cliffs,
Englewood
Transform.
Fourier
0. Brigham,
The Fastvol.
[131 E.
NJ: Prentice-Hall, 1974.
[14] M. C. Pease, "An adaptation of the fast Fourier transform for parallel
1965.processing," J. Ass. Comput. Mach., vol. 15, pp. 252-264, Apr.
[15] M. C. Pease, "The C2m-algorithm for matrix inversion," J. Ass.
Comput. Mach., to be published.
Budnik and
D. J.Trans.
Kuck,Comput.,
"The organization
and1566-1569,
use of parallel
[16] P.
Dec.
vol. C-20, pp.
IEEE
memories,"
1971.
Marshall C. Pease, III (M'47-SM'51-F'62)
born in New York, NY, on July 30, 1920.
He received the B.S. degree in chemistry from
was
I
Yale University, New Haven, CT, and the M.A.
degree in physical chemistry from Princeton
University, Princeton, NJ, in 1940 and 1943, respectively.
He was employed at the Radio Research
Laboratory
at Harvard University, Cambridge,
on counfrom 1943 to June 1945
MA,
working
ter measures to radar. He served in the U.S.
Navy from June 1945 to August 1946. He was employed by Sylvania
Electric Products, Inc., from 1946 to 1960, working primarily on microwave tube devices and theory. In 1960 he joined the Stanford Research
Institute, Menlo Park, CA, where he is presently a Staff Scientist. He
worked initially on the theory of electron beam devices, and, in recent
years, on algorithmic development and computer architecture. His recent
areas of research include the study of parallel architectures and of algorithmic processes that are suitable for parallel computation. He is also
doing research in the development of techniques and algorithmic processes for the application of computers to managerial problems. He is
author of Methods of Matrix Algebra (New York: Academic, 1965). He
also contributed to Crossed-Field Microwave Devices (E. Okress, Ed.
paper.
REFERENCES
[1] S. H. Unger, "A computer oriented towards spatial problems," Proc.
IRE, vol. 36, pp. 1744-1750, Oct. 1958.
New York: Academic, 1961), Advances in Microwaves (L. Young, Ed.
New York: Academic, 1966), and Multi-Access Computing: Modern
Research and Requirements (P. H. Rosenthal and R. K. Mish, Eds. New
York: Hayden 1974). He has published numerous papers in his fields of
interest.
Dr. Pease is a member of Phi Beta Kappa and Sigma Xi.