Data-flow multiprocessor architecture with three dimensional

United States Patent [19]
[111
[45]
Campbell et al.
ARCHITECTURE WITH THREE
DIMENSIONAL MULTISTAGE
INTERCONNECI'ION NETWORK FOR
EFFICIENT SIGNAL AND DATA
PROCESSING
for Efficient Signal and Data Processing by R. Vedder
et al., International Symposium on Computer Architec
ture, Conference Proceedings, Jun. 1985, pp. 324-332.
The Hughes Data Flow Multiprocessor by R. Vedder
et al., Proceedings of the International Conference on
Distributed Computing Systems, May 1985, pp. 2-9.
Static Allocation for a Data Flow Multiprocessor by M.
L. Campbell, Proceedings of the International Confer
ence on Parallel Processing, 8/85, pp. 511-517.
Calif.
[73] Assignee:
Hughes Aircraft Company, Los
Angeles, Calif.
[21] Appl. No.: 474,707
Primary Examiner-—Thomas C. Lee
.
Attorney, Agent, or Firm-William J. Streeter; Wand
K. Denson-Low
[57]
Jan. 30, 1990
ABSTRACT
A data-?ow architecture and software environment for
high-performance signal and data procesing. The pro
Related US. Application Data
Continuation of Ser. No. 145,033, Jan. 19, 1988, aban
gramming environment allows applications coding in a
functional high-level language 20 which a compiler 30
doned. which is a continuation-in-part of Ser. No.
converts to a data-?ow graph form 40 which a global
847,087, Mar. 31, 1986, abandoned.
[51]
[52]
Jun. 4, 1991
The Hughes Data Flow Multiprocessor: Architecture '
[75] Inventors: Michael J. Campbell; Dennis J. Finn;
George K. Tucker, all of Los Angeles;'
Michael D. Valley, Manhattan Beach;
Rex W. Vedder, Playa del Rey, all of
[63]
5,021,947
and Data Processing by J. Gaudio et al., lEEE Transac
tions on Computers, 12/85, pp. 1072-1086.
[54] DATA-FLOW MULTIPROCESSOR
[22] Filed:
Patent Number:
Date of Patent:
allocator 50 then automatically partitions and distrib
utes to multiple processing elements 80, or in the case of
Int. Cl.5 ...................................... .. G06F 15/82
US. Cl. ............................. .. 364/200; 364/232.22;
smaller problems, coding in a data-flow graph assembly
language so that an assembler 15 operates directly on an
364/2801; 364/2804; 364/2816; 364/2287;
[58] Field of Search
input data-?ow graph ?le 13 and produces an output
364/2818; 382/49
364/200 MS File, 900 MS File,
which is then sent to a local allocator 17 for partitioning
and distribution. In the former case a data-?ow proces
364/736; 382/49
[56]
sor description tile 45 is read into the global allocator
50, and in the latter case a data-flow processor descrip
tion tile 14 is read into the assembler 15. The data-?ow
processor 70 consists of multiple processing elements 80
connected in a three-dimensional bussed packet routing
network. Data enters and leaves the processor 70 via
input/output devices 90 connected to the processor.
References Cited
U.S. PATENT DOCUMENTS
3,614,745 10/1971
Podvin et a1. ..................... .. 364/200
3,978,452
8/1976 Barton et a1.
364/200
4,145,733
4,153,933
3/1979
5/1979
364/200
364/200
Misunas et al. .... ..
Dennis et al. ...... ..
4,379,326 4/1983 Anastas et a1.
364/200
4,418,383 11/1983 Doyle et a1.
4,447,875 5/1984 Bolton et a1.
364/200
364/200
4,466,061
8/1984 DeSantis et a1. .
364/200
4,583,164
4/1986
364/200
Tolle .............. ..
4,644,461 2/1987 Jennings
364/200
4,727,487
2/1988 Masui et a1.
364/200
4,922,413
5/1990 Stoughton et a1. ............... .. 364/200
The processing elements are designed for implementa
tion in VLSI (Very large scale integration) to provide
realtime processing with very large throughput. The
modular nature of the computer allows adding more
processing elements to meet a range of throughout and
reliability requirements. Simulation results have demon
strated high-performance operation, with over 64 mil
lion operations per second being attainable using only
64 processing elements.
OTHER PUBLICATIONS
A Distributed VLSI Architecture for Efficient Signal
12 Claims, 16 Drawing Sheets
11
HIM!"
i
a
“1mm ill
1
$13“. I. 1.55)"!
US. Patent
June 4, 1991
Sheet 1 of 16
5,021,947
Fig. 1.
DATA FLOW GRAPH
ASSEMBLY LANGUAGE
DATA FLOW
PROCESSOR
DESCRIPTION
FILE
HOFCAL ASSEMBLY
LOCAL ALLOCATOR
20
HIGH-LEVEL
_
LANGUAGE
{|F1-ZTHENX+5ELSE X+IO
HDFL COMPILER
DATA FLOW GRAPH
DATA FLOW
PROCESSOR
DESCRIPTION
FILE
GLOBAL ALLOCATOR
60
ALLOCATED DATA
FLOW GRAPH
INPUT/OUTPUT
DEVICES
US. Patent
June 4, 1991
Sheet 2 0f 16
5,021,947
Fig. 2.
PLANE 2
1/0 PLANE
/
/
I I0 CONTROLLERS
Fig. 6.
EN
R
Mug
+
D
V. 2
d
IL E
2
.ID '21*_bOwn“
Ce
F
v
H
E
U
TNVPHF DonTS *k
MEr TR
AV
.L 5
E_
FORWARD
ACK
V I5 SENT TO ADDRESS A
US. Patent
June 4, 1991
Sheet 3 of 16
5,021,947
F i g. 3.
l6- BIT DATA PACKET
'
32- PM DATA PADAET
6
P? 9. ADDRESS
2
WORD 1 TYPE PLANE & 00L 3 Row
'7
MDPD 2
TEMPLATE ADDRESS
6
9
PE , ADDRESS
“P5 PLAME ; DDL i Pow
n
TEMPLATE ADDRESS
LT
' MDRD 3
DATA
'
DATA (LS)
l7
WORD 4
DATA (MS)
F i g. 5.
'
TEMPLATE MEMORY
mom i OPERAND
Lu
6
:
Ea
1
‘£55
3
:5
o;
IT -
FIRE
DETECT
MEMORY
DESTINATION MEMORY
sTATDs RESULT DEsTTMATTDM
4
32
E
i
DISTRIBUTED
ARRAYS
TEMPLATE
DEsTTMATTDM LIST
WERFLO‘"
RESULT DuEuE OVERFLOW
LDcAL
ARRAYS
AssocTATED
sTATE
INFORMATION
US. Patent
June 4, 1991
Sheet 4 of 16
5,021,947
so
I-
_\_
_ _ _ _
_ _
_ _ _ _
_ _ _ _ _
_ _ '__'I
I
|
I
I
I
I
I
I
I coIINuNIcNIo
I
I
ERROR“ IN
I
.
II
I46
D
E M
I|22 I
I
I48, I
NEN.
I °°IIIII°I
I
II54
l
I
I
I
I
I
M
N 0
A R
I I
|
0
N
I
_____ ___I5§_____
------- ------II
I
I I5 E
I
our
ousus
||2)I__I._J
Il
I
I
OUEUE
QUEUE
I
I
l
QUEUE
IJIIEUE
:
I
I
I
'24
l
DESI.
DESTINATION
I _.QIIEUE
TAGGER
I50 I
I
I
I40 RESULT
I
I 'HQUEUE
I PROCESSORA'I
I
ouEINE I42
ARITHMETIC
LOGIC
I
l
I
I III‘
|
|
I
|
'56 I
T
E H
I; E
L M
A 0
R
I Y
E
|
|
I
I
I
I
I
I
I
I
I
I
US. Patent
June 4, 1991
Sheet 5 of 16
5,021,947
US. Patent
June 4, 1991
Sheet 6 of 16
5,021,947
2 5.2
$523.125;
.:32:
c
n
q
d
3:‘;
o
2
5.’
f3
3
3%
3
THROUGHPUT ( MOPS )
9m
.@
.5
.w -
$52.;
2 0532. 5:
520:253:0
2:03
THROUGHPUT
:230.;2
( MOPS_)
$3
3
S2
@~_\/$
on
3
2
ON
i#vA52Siw:230:s29;:
US. Patent
June 4, 1991
a;.2
2:
cm
3
Sheet 8 of 16
52 30;2:3. 513 20:
2.
3
% OF LOCAL
5,021,947
£25:03.;
u
q
I:
2
on
8
e4
32%E.is;
3
2
PACKETS
q
a
q
NE ‘
“N:a:
3#v2“:x5<23709:2 v.3525m:g60.5;8
>2;=.22m2:<6._<
A$2v25E5123.6205- 0:
l
02
X
g
om
2:
3
cm
2..
2
cm
1
on
l
2.
.. 31
‘on Ice low low 12
OBSCURED THROUGHPUT AS % OF
MAXIMUM ACHIEVED THROUGHPUT OF PROGRAMS
US. Patent
June 4, 1991
Sheet 9 of 16
5,021,947
35 .
(
N
(O
RESULT QUEUE (R0) LENGTH- PACKETS
.2.
.
9.;i
u
a
onON3@2 15
2
1
c
O
n
O
o
o
o
o
o
o
0
G7
a:
N
LO
if)
Q’
In
N
% OF LOCAL
PACKETS
o
385:23;
US. Patent
June 4, 1991
Sheet 10 of 16
5,021,947
Fig. 16.
25.8
S
::5.5032:
I
I0
l
20
30
40
50
l
I
6064x1128
US. Patent
June 4, 1991
Sheet 12 of 16
5,021,947
PROGRAM
\
PAHSE
.
> TREE
@
FUNCTION 2
HILL
SYNTAX
I
DATA
I .
.
Q
V
.
V
.
7v
> FLOW
GRAPH
.
.
to‘
US. Patent
June 4, 1991
Sheet 13 of 16
FIG. 19.
5,021,947
1. DATA FLOW GRAPH
PROGRAM (ACTORS )
INPUT =
2. SET OF PEs
TOPOLOGICAL
'
soar
= P U] - - - P M
-
\
TRANSSTIVE
CLOSURE
II = 1
J2 = 1
\
co: = COMM_COST (Am, P[J])
PC: = PROCLCOST (A[l], P[J])
AC: = ARACLCOST (Am, P[J])
TEMP_COST: = (01 ' cc) + (02 * PC) + (02 ' AC)
17
IF (TEMP_ COST < MIN_COST) THEN
MIN_COST: = TEMP__COST
MIN_PE: = J
NEXT J
7
ADD A[I1TO LIST OF
ACTORS ALLOCATED TO P[MIN_PE]
' RETURN
US. Patent
June 4, 1991
Sheet 14 of 16
5,021,947
INPUT= 1. DATA FLOW GRAPH
PROGRAM WITH DATA
ARRAYs=AR[|1...0R[R]
FIG . 20 .
2.SET OF PEs =P[I]...P[M]
FoR J:=1 To M TNITIALIZE
MEM_AVAIL@ :=MEM PER PE
ADD AR | To |_|$T 0F
COUNT ACTORS THAT
ARRAYS HLLOCATED TO
ACCESS ARU] & ASSIGN TO
NEXT AVAILABLE PE[J]\SUCH THAT
NUNLACTORS
MEM_AVAIL[J1> SIZE(AR[|]
‘I’
?g??k??l
NUM_PEs:=2 FLOOH(LOG2(
-SgZaARm)
NUM_ACTORS.))
—1
T
CHOOSE NEXT AVAILABLE
BLOCK OF NUM_PEs PEs, (IF ANY,)
sucH THAT EACH PE[J] SATISFIES:
MEM_AVAIL[J]>ELTS_PER_PE
BLOCK
NUM__PESI=NUM_PES/2
AVMLABLE?
Y
FOR EACH PE[J] IN BLOCK:
MEM_AVA|L[J]:=
MEM_AVAIL[J]-ELTS_PER_PE
RETURN
US. Patent
June 4, 1991
Sheet 15 of 16
5,021,947
INPUT = A NODE, G, OF THE PARsE TREE WITH A POINTER
TO A LIST OF CHILDREN: G [I] G [N]
( INITIALLY, G IS THE ROOT OF THE TREE, WHICH
REPREsENTs THE ENTIRE DATA FLOW GRAPH PROGRAM )
OUTPUT = A MODULE-GRAPH WITH NODES: G [I] G [N]
AND ARCS BETWEEN NODES DEFINED BY THE
FOLLOWING ALGORITHM
IZ=1
I
FORM A LIST: A [I] A [M] OF ACTORS
UNDER NODE G [I], BY FOLLOWING THE
POINTERS DOWN THE TREE TO THE LEAVES
LET THE DESTINATION-LIST OF AOTOR A [J]
BE ACTORS: xm
X[D], (AcTORs
RECEIVING INPUTS FROM A [J1]
I
KI=1
II
FOLLOW THE POINTERS I._J_E THE TREE FROM
ACTOR X [K] TO NODE G [L] FOR SOME
L: I <_L <_N
IF (L=#I) THEN ADD AN EDGE INTO THE
MODULE-GRAPH FROM 6 [I] TO G [L]
US. Patent
June 4, 1991
FIG. 22.
TOPOLOGICAL
soRT
Sheet 16 of 16
INPUT= 1.MODULE GRAPH FROM
GRAPH PARTITDNER
‘
WITH K TEMPLETE
TRANsITIvE
CLOSURE
T
I:=1
?
5,021,947
(ACTORS)
2. SET OF PEs =P[l]...P[M]
COLLECT ONE CANDIDATE
CLUSTER OF PEs FDR G[|]
\
i
TEMPLATES:=O
i
CHOOSE NEXT AVAlLABLE'PEH] ON LIST SUCH THAT:
TEMPLATE_MEM_AVAIL[J] > GOAL '
i
ADD PEIJ1 TO LIST
FOR THIS CANDIDATE CLUSTER
TOTAL_TEMPLATE:=TOTAL_TEM PLET ES + GOAL
TOTAL_TEMPLATES2 K
CC:=COMM_COST (em, CLUSTER)
PC:=PROC_COST (Gm, CLUSTER)
AC:_=ARAC_COST (Gm, CLUSTER)
I
TEMP__COST:=(C1 ' D0) + (02 ' PC) + (03 * AC)
I
IF (TEMP__COST <'MIN_COST) THEN
MIN_COST:=TEMP_COST
MIN_CLUSTER:=TH|S CLUSTER
NEXT
CLUSTER
?
ASSIGN LIST OF PEs IN
MIN_cI_usTER TD GU]
RETURN
EVALUATE EACH CANDIDATE
cLusTER AND CHOOSE THE
LOWEST COST CLUSTER FDR can]
5,021,947
1
2
using local data for processing. Algorithms that would
DATA-FLOW MULTIPROCESSOR
ARCHITECTURE WITH THREE DIMENSIONAL
MULTISTAGE INTERCONNECT ION NETWORK
FOR EFFICIENT SIGNAL AND DATA
PROCESSING
require access to external memories between computa
tions would not be suitable for systolic array implemen
tation.
5
Tightly coupled networks of von Neumann proces
RELATED APPLICATION
This is a continuation of application Ser. No. 145,033,
?led Jan. 19, 1988, now abandoned, which is a continua
processor having local memory. In addition, some ar
sors typically have the PEs interconnected using a com
munication network, with each PE being a micro
chitectures provide global memory between PEs for
interprocessor communication. These systems are most
well suited for applications in which each parallel task
tion-in-part of application Ser. No. 06/847,087, ?led.
consists of code that can be executed efficiently on 2
Mar. 3], 1986, now abandoned.
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to methods and appara
von Neumann processor (i.e., sequential code). They
are not well suited for taking full advantage of low-level
is
(micro) parallelism that may exist within tasks. When
used for problems with low-level parallelism they typi
cally give rise to large ALU (arithmetic and logical
tus for performing high-speed digital computations of
programmed large-scale numerical and logical prob
unit) idle times.
lems, in particular to such methods and apparatuses
Data flow multiprocessor architectures based on the
making use of data-flow principles that allow for highly 20 data flow graph execution model implicitly provide for
parallel execution of computer instructions and calcula
asynchronous control of parallel process execution and
tions.
inter-process communication, and when coupled with a
2. Description of the Technology
functional high-level language can be programmed as a
In order to meet the computing requirements of fu
single PE, without the user having to explicitly identify
ture applications, it is necessary to develop architec 25 parallel processes. They are better suited to taking ad
tures that are capable of performing billions of opera
vantage of low-level parallelism than von Neumann
tions per second. Multiprocessor architectures are
multiprocessor architectures.
widely accepted as a class of architectures that will
The data flow approach, as opposed to the traditional
enable this goal to be met for applications that have
control
flow computational model (with a program
30
suf?cient inherent parallelism.
.
counter), lets the data dependencies of a group of com
Unfortunately, the use of parallel processors in
creases the degree of complexity of the task of program
putational operations determine the sequence in which
the operations are carried out. A data flow graph repre
sents this information using nodes (actors) for the opera
tions and directed arcs for the data dependencies be
ming computers by requiring that the program be parti
tioned into concurrently executable processes and dis
tributing them among the multiple processors, and that
asynchronous control be provided for parallel process
execution and inter-process communication. An appli
cations programmer must partition and distribute his
program to multiple processors, and explicitly coordi
tween actors. The output result from an actor is passed
to other actors by means of data items called tokens
which travel along the arcs. The actor execution, or
?ring occurs when all the actor’s input tokens are pres
nate communication between the processors or shared 40 ent on its input arcs. When the actor ?res, or executes,
it uses up the tokens on its input arcs, performs its in
memory.
tended operation, and puts result tokens on its output
Applications programming is extremely expensive
arcs. when actors are implemented in an architecture
even using current single-processor systems, and is
they are called templates. Each template consists of
often the dominant cost of a system. Software develop
ment and maintenance costs are already very high with 45 slots for an opcode, operands, and destination pointers,
which indicate the actors to which the results of the
out programmers having to perform the additional tasks
operation are to be sent.
described above. High-performance multiprocessor
The data flow graph representation of an algorithm is
the data dependency graph of the algorithm. The nodes
quired for the programmer and be programmable in a 50 in the graph represent the operators (actors) and the
systems for which software development and mainte
nance costs are low must performthe extra tasks re
directed arcs connecting the nodes represent the data
high-level language.
paths by which operands (tokens) travel between oper
There are different classes of parallel processing ar
chitectures that may be used to obtain high perfor
mance. Systolic arrays, tightly coupled networks of von
Neumann processors, and data flow architectures are 55
available, the actor may “?re" by consuming its input
tokens, performing its operation on them, and produc
three such classes.
ing some output tokens. In most definitions of data flow
ands (actors). When all the input tokens to an actor are
a restriction is placed on the arcs and actors so that an
Systolic arrays are regular structures of identical
arc may have at most one input token on it at a time.
processing elements (PEs) with interconnection be
This implies that an actor may not ?re unless all of its
tween PEs. High performance is achieved through the
use of parallel PEs and highly pipelined algorithms. 60 output arcs are empty. A more general de?nition allows
for each arc to be an in?nite queue into which tokens
Systolic arrays are limited in the applications for which
they may be used. They are most useful for algorithms
which may be highly pipelined to use many PEs whose
intercommunications may be restricted to adjacent Pes
(for example, array operations). In addition, systolic
arrays have limited programmability. They are “hard
wired" designs in that they are extremely fast, but in
?exible. Another drawback is that they are limited to
may be placed.
All data flow architectures consist of multiple pro
cessing elements that execute the actors in the data flow
65
graph. Data ?ow architectures take advantage of the
inherent parallelism in the data flow graph by executing
in separate PEs those actors that may ?re in parallel.
Data flow control is particularly attractive because it
5,021,947
3
4
large throughput. The modular nature of the data-?ow
can express the full parallelism of a problem and reduce
explicit programmer concern with interprocessor com
processor allows adding more processing elements to
meet a range of throughput and reliability requirements.
Simulation results have demonstrated high-perfor
mance operation.
Accordingly, it is one object of the present invention
munication and synchronization.
In U.S. Pat. No. 3,962,706-—-Dennis et al., a data pro
cessing apparatus for the highly parallel execution of
stored programs is disclosed. Unlike the present inven
tion, the apparatus disclosed makes use of a central
to provide a data-?ow multiprocessor that is a high-per- -
formance, fault-tolerant processor which can be pro
controller and global memory and therefore suffers
from the limitations imposed by such an architecture.
grammed in a high-level language for large throughput
signal and data processing applications.
U.S. Pat. No. 4,l45,733-—Misunas et al. discloses a
more advanced version of the data processing apparatus
described in U.S. Pat. No. 3,962,706. However, the‘
apparatus disclosed still contains the central control and
global memory that distinguish it from the present in
15
vention.
US. Pat. No. 4,145,733—Misunas et al. discloses
another version of the apparatus disclosed in the previ
It is another object of the present invention, based on
data-flow principles so that its operation is data-driven
rather than instruction-driven, that it allow fast, highly
parallel computation in solving complex, large-scale
problems.
ous two patents, distinguished by the addition of a new
It is a further object of the present invention to pro
vide for an interconnection network for the multiple
processing elements that is split up into as many compo
network apparently intended to facilitate expandability,
nents as there are processing elements, so that the com
but not related to the present invention.
In U.S. Pat. No. 4,418,383-Doyle et al. a large-scale
munication network is equally distributed over all the'
processing elements, and if there are N processing ele
ments, each processing element has 1/N of the intercon
nection network to support it.
It is yet another object of the present invention to
integration (LSI) data ?ow component for processor
and microprocessor systems is described. It bears no
substantive relation to the processing element of the
present invention, nor does it teach anything related to
the data flow architecture of the present invention.
None of the inventions disclosed in the patents re
provide for static allocation (that is, at compile time) of
the program to the processing elements.
It is still another object of the present invention to use
ferred to above provides a processor designed to per- '
form image and signal processing algorithms and re
processing elements which have been designed for im
plementation using only two distinct chips in VLSI
(very large scale integration) to provide realtime pro
lated tasks that is also programmable in a high-level
language which allows exploiting a maximum of low
cessing with very large throughput.
level parallelism from the algorithms for high through
put.
Finally, it is an object of the present invention that
the modular nature of the data-?ow processor allow
adding more processing elements to meet a range of
The present invention is designed for efficient realiza
tion with advanced VLSI circuitry using a smaller num
ber of distinct chips than other data ?ow machines. It is
throughput and reliability requirements.
the potential for performance of signal processing prob
FIG. 1 is a schematic block diagram of the present
invention, with illustrative information about some of
An appreciation of other aims and objects of the
present invention and a more complete and comprehen
readily expandable and uses short communication paths
sive understanding of this invention may be achieved by
that can be quickly traversed for high performance
studying the following description of a preferred em
Previous machines lack the full capability of the present
invention for large-throughput realtime applications in 40 bodiment and by referring to the accompanying draw
mgs.
data and signal processing in combination with the easy
programmability in a high-level language.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention aims specifically at providing
lems and the related data processing functions including
tracking, control, and display processing on the same
processor. An instruction-level data flow (micro data
its parts to the right of the block diagram.
FIG. 2 is a schematic representation of how the pro
cessing elements are connected together in a three-di
?ow) approach and compile time (static) assignment of
mensional bussed packet routing network.
tasks to processing elements are used to get efficient
runtime performance.
SUMMARY OF THE INVENTION
The present invention is a data flow architecture and
software environment for high performance signal and
data processing. The programming environment allows
applications coding a in a functional high-level lan
guage, the Hughes Data Flow Language, which is com
piled to a data flow graph form which is then automati
cally partitioned and distributed to multiple processing
50
FIG. 3 shows how a data packet consists of a packet
type, a PE and template address, and one or more data
words.
FIG. 4 illustrates the organization of a processing
element in schematic block diagram form.
FIG. 5 shows how templates and arrays are mapped
into physical memories.
FIG. 6 gives examples of some primitive actors
which are implemented directly in hardware.
FIG. 7 shows an example of a compiler generated
elements. As an alternative for smaller scale problems 60 data ?ow graph corresponding to the expression “if
yl < =y2 then y2‘2-l-l else y1+x0 endii“.
and for simulation studies, a data flow graph language
FIG. 8 is a simulation results graph of throughput for
assembler and local allocator allow programming di
the program radar3na (in millions of instructions per
rectly in a data flow graph form.
second) versus the number of processing elements. The
The data flow architecture consists of many process
ing elements connected by a three-dimensional bussed 65 curve marked “A“ is for random allocation algorithm,
the curve marked “8" for an allocation algorithm using
packet routing network. The processing elements re
designed for implementation in VLSI (very large scale
integration) to provide realtime processing with very
transitive closure, and the curve marked "C" for an
allocation algorithm using nontransitive closure.
5
5,021,947
6
radarb. The ordinate represents thoughput in MIPS and
the abscissa represents number of processing elements.
FIG. 22 is a flowchart of the portion of the Global
Allocator that uses the module graph of FIG. 21 and a
set of cost heuristics like those of the Local Allocator of
The lower curve is for a random allocation algorithm
and the upper curve is for nontransitive closure alloca
same manner as the Local Allocator allocates actors to
tion algorithm.
individual PEs.
FIG. 10 is a simulation results graph of percentage of
time the ALU is busy versus the number of processing
DESCRIPTION OF A PREFERRED
EMBODIMENT
FIG. 1 is a schematic. block diagram of the present
invention, a data ?ow architecture and software envi
ronment 10 for high-performance signal and data pro
FIG. 9 is a plot of simulation results for the program
FIG. 19 to allocate modules to clusters of PEs in the
elements for the program radar3na. The solid curves
marked “D" and “G" are average ALU busy time and
maximum ALU busy time, respectively, for a transitive‘
closure allocation algorithm. The curves marked “E"
and “F" are average ALU busy time and maximum
ALU busy time, respectively, for a nontransitive clo
sure allocation algorithm.
FIG. 11 shows percentage of time the ALU is busy in
simulation of the program radarb versus number of
processing elements, using a nontransitive closure allo
cation algorithm. The lower curve is for average ALU
busy time and the upper curve is for maximum ALU 20
busy time.
cessing. The programming environment allows applica
tions coding in a functional high-level language which
results in a program tile 20 which is input into a com
piler 30 which converts it to a data ?ow graph form 40
which a global allocator 50 then automatically parti
tions and distributes to multiple processing elements 80.
In the case of smaller problems, programming can be
done in a data flow graph and assembled by an assem
bler 15 that operates directly on an input data ?ow
graph ?le 13 whose output is then sent to a local alloca
tor 17 for partitioning and distribution. In the former
case a data flow processor description ?le 45 is read into
the global allocator 50 and in the latter case a data ?ow
processor description tile 14 is read into the assembler
15. The data flow processor 70 consists of many pro
cessing elements 80 connected in a three-dimensional
bussed packet routing network. Data enters or leaves
30 the processor 80 by means of input/output devices 90
FIG. 12 is a graph of percentage of maximum
achieved throughput versus percentage of time the
average ALU is busy. The solid circles are from simula
tion results for the program radarb using a nontransitive 25
closure allocation algorithm. The it-symbols and open
circles are for the program radar3na using transitive
closure and nontransitive closure allocation algorithms,
respectively.
FIG. 13 is a plot of the percentage of packet commu
nication that is local (intra-PE as opposed to inter-PE)
versus number of processing elements for the program
connected to the processor.
‘
THE THREE DIMENSIONAL BUSSED
NETWORK
As
shown
in
FIG.
2,
the data flow processor 70 com
35
transitive closure allocation algorithm.
prises ! to 512 identical processing elements connected
FIG. 14 is a plot of the percentage of packet commu
by a global inter-PE communication network. This
nication that is local (intra-PE as opposed to inter-PE)
network is a three-dimensional bussed network in
versus number of processing elements for the program
which the hardware implements a fault-tolerant store
radarb suing a nontransitive closure allocation algo
radar3na. The lower curve is for a transitive closure
allocation algorithm and the upper curve is for a non
rithm.
and-forward packet-switching protocol which allows
FIG. 15 is a graph of the length (in packets) of the
result queue versus number of processing elements for
any PE to transfer data to any other PE. Each process
ing element contains queues for storing packets in the
communication network, and the appropriate control
the nontransitive closure allocation of the program
for monitoring the health of the processing elements
radarb. The lower curve is average queue length and
45 and performing the packet routing.
the upper curve is maximum queue length.
In the three-dimensional bussed interconnection net
FIG. 16 is a plot of average communication packet
latency (in clock cycles) versus number of processing
work not all processing elements are directly con
elements for nontransitive closers allocation of the pro
nected, so a store-and-forward packet routing technique
is used. This algorithm is implemented in the communi
cation chip 81 (FIG. 4), which connects a processing
gram radarb.
_
FIG. 17 is a recursive flow chart that describes the
global allocation algorithm. 1
FIG. 18 is a schematic picture of the output of the
high-level language compiler which is the input to the
global allocator, showing that the syntactic parse tree
from the compiler is linked to the instructions produced
by the compiler in the form of a data ?ow graph.
FIG. 19 is a flowchart of the portion of a Local Allo
cator that statistically allocates actors to PBS for maxi
chip 80 to the row, column, and plane buses 82, 84, 86.
The communication chip 81 acts like a crossbar in that
is takes packets from its four input ports and routes
them to the appropriate output ports. In addition, it
provides buffering with a number of ?rst-in-?rst-out
queues, including the processor input queue 112 and
processor output queue 114.
The three-dimensional bussed network is optimized
for transmission of very short packets consisting of a
FIG. 20 is a ?owchart of the portion of the Local 60 single token. As shown in FIG. 3, each packet consists
of a packet type, an address, and a piece of data. Differ
Allocator that statistically allocates arrays to the dis
‘ent types of packets include normal token packets, ini
tributed memory.
tialization packets, and special control packets for ma
FIG. 21 is a ?owchart of the portion of a Global
chine recon?guration control. The address of each
Allocator that finds data dependencies between pro
gram modules (as identi?ed by a compiler generated 65 packet consists of a processing element address and a
template address which points to one particular actor
parse ;tree); and wherein dependencies between mod
mal performance.
ules are represented in a module graph as a set of edges
instruction within a processing element. The data can be
between the modules with dependencies.
any ofthe allowed data types ofthe high-level data flow

Download Report

Data-flow multiprocessor architecture with three dimensional

Paperzz.com

Your Paperzz