architectural support fob the efficient generation of code for

ARCHITECTURAL
S U P P O R T FOB THE E F F I C I E N T
OF CODE FOR H O R I Z O N T A L
2
B.R.
Rau,
Advanced
1.
C.D.
Glaeser,
ARCHITECTURES
E.M.
Greenawalt
Processor
Technology
Laboratory
ESL Inc.
San Jose, California
6] is a horizontal architecture with architectural
support for the scheduling task. The cause of the
complexity involved in scheduling conventional horizontal processors and the manner in which the
polycyclic architecture addresses this issue are
outlined in this paper.
INTRODUCTION
Horizontal architectures, such as the C D C
Advanced Flexible Processor [i} and the FPS APi20-B
[2}, consist of a number of resources that can
operate in parallel, each of which is controlled by
a field in the wide instruction word.
Such architectures have been developed to perform high speed
scientific computations at a modest cost:. Figure 1
displays those characteristics of horizontal architectures that are germane to the issues discussed in
this paper. The simultaneous requirements of high
performance and low cost lead to an architecture
consisting of multiple pipelined processing elements (PEs) such as adders and multipliers, a memory
(which for scheduling purposes may be viewed as yet
another PE with two operations:
a READ and a
WRITE), and an interconnect which ties them all
together'. The interconnect allows the result of one
operation to be directly routed to another PE as one
of the inputs for an operation that is to be
performed there.
The required memory bandwidth is
reduced since temporary values need not be written
to and ~ead from the memory.
The final aspect of
horizontal processors that is of interest is that
their program memories emit wide instructions which
synchronously specify the actions of the multiple
and possibly dissimilar PEs. The program memory is
sequenced by a conventional sequencer that assumes
sequential flow of control unless a branch is
explicitly specified.
2.
THE POLYCYCLIC ARCHITECTURE
The polycyclic architecture is a horizontal
architecture with a couple of special properties
which are listed below.
2.1
The Interconnect
The interconnect of a polycyclic processor
must have the following property: a dedicated delay
element exists between every resource output and
resource input that are directly connected to each
other.
This delay element enables a datum to be
delayed by an arbitrary amount of time in transit
between the corresponding output and input.
The topology of the interconnect may be arbitrary.
It is possible to design polycyclic processors with n resources in which the number of
delay elements is O(n), (a uni-or multi-bus structure), O(nlogn), (e.g., delta networks [7]), or
O(n*n), (a cross-bar). The trade-offs involve cost,
interconnect bandwidth and interconnect latency.
Thus, it is possible to design polycyclic processors
lying in various cost-performance brackets.
As a consequence of the simplicity of such an
architecture, it is inexpensive relative to the
potential performance of the multiple pipelined
PEs. However, if this potential performance is to
be realized, the multiple resources of a horizontal
processor must be scheduled effectively. The scheduling task for conventional horizontal processors
is quite complex and the construction of highly
optimizing compilers for them is a difficult and
expensiw3 project. The polycyclic architecture [3-
A sample polycyclic processor is shown in
Figure 2.
The topology of the interconnect is a
complete crossbar with a delay element at each
cross-point (thereby providing the polycyclic property).
The interconnect has two output ports
(columns) and one input port (row) for each of the
two PEs.
Each cross-point has a delay element,
similar to that described below, which is capable of
one read and one write each cycle.
A PE can
simultaneously distribute its output to any or all
of the delay elements with are in the row of the
interconnect corresponding to its output port. A P E
can obtain its left (right) input from any delay
element in the column of the interconnect that
corresponds to its left (right) input port.
If a
value is written into a delay element at the same
that an attempt is made to read it, the value is
transmitted through the interconnect with no delay.
Any positive delay may be obtained merely by leaving
the datum in the delay element for a suitable length
of time.
Permission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct
commercial advantage, the ACM copyright notice and the title of the
publication and its date appear, and notice is given that copying is by
permission of the Association for Computing Machinery. To copy
otherwise, or to republish, requires a fee and/or specific permission.
(~) 1982 ACM 0-89791-066-4 82/03/0096
GENERATION
$00.75
96
2.2
Iterative computations (loops) account for a
major fraction of the execution time in scientific
computations and, so, merit special attention.
Although considerable work has been performed on
developing scheduling techniques for directed acyclic graphs, e.g., [ii, 12], relatively few techniques are available for automatically scheduling
loops in a near-optimal manner. Subject to the data
dependencies between them, successive iterations of
a loop may be overlapped in any way that does not
result in conflict for the use of the resources.
This can significantly improve the throughput of the
processor. Algorithms for achieving this are presented in [ 4 ] . These algorithms require that the
number of times that each resource is used per
iteration be known before the scheduling of the
operations can begin.
This requirement has been
proven to be necessary for any scheduling algorithm
which yields an identical schedule for each iteration of the loop [4].
The Delay Elements
The structure of an individual delay element
consists of a shift register file, any location of
which may be read by providing an explicit read
address. Optionally, one may specify that the value
accessed be deleted. This option would be exercised
if this were the last access tc that value.
The
result of doing so is that every value, with address
greater than the address of the deleted ~alue, is
simultaneously shifted down on the same clock pulse,
to the location with the next lower address.
Consequently, all values present in the delay element
are compacted into the lowest locations.
An incoming value is written into the lowest empty
location which is pointed to by the Write Pointer
which is maintained by hardware. The Write Pointer
is automatically incremented each time a value is
written and is decremented each time one is deleted.
As a consequence of deletions, a value, during its
residence in the delay element, drifts down to lower
addresses, and is read from various locations before
it is itself deleted.
Due to resource and dependency constraints,
the two operations that create and use a particular
value may be scheduled arbitrarily far apart in
time. In other words, the value will be transmitted
at one point in time by the resource that creates
it, but must be received after an arbitrarily long
delay by the resource which is to use it.
This
necessitates the existence of a capability to delay
values while in transit between any source and any
destination.
A value's current position at each instant
during execution must be known by the compiler so
that the appropriate read address may be specified
by the program when the value is to be read. Keeping
track of this is a simple, if somewhat tedious, task
which is easily performed by a compiler during codegeneration.
This task and the advantage of this
unorthodox structure for the delay elements are
discussed in greater detail in [4, 6].
In conventional horizontal machines, the only
means for obtaining the necessary delays is through
the use of scratch-pad register files.
These
scratch-pads constitute resources and their usage
must be known before the scheduling can begin.
However, the number of times that the scratch-pads
need to be used, and the points in the computation
graph at which scratch-pad operations need to be
inserted, depend upon the schedule.
The description of a VLSI building block chip
for polycyclic interconnects may be found in [3].
The functional specification and logic design of
this chip, in TRW's triple diffusion bipolar technology, have been performed.
3.
FACTORS AFFECTING CODE GENERATION
As a result of this circularity, it is not
possible to specify a viable schedule in a deterministic manner [2, 4, 6].
Instead one must first
guess at the scratch-pad usage and then attempt to
generate a schedule based on that estimate.
If a
schedule is not obtained, then another guess must be
made and this process must be repeated until a
schedule is obtained.
Effective heuristics, to
guide the compiler in "guessing" at the scratch-pad
usage, are difficult to formulate.
It is useful, conceptually, to view the compilation process as consisting of two major steps
although, in practice, these steps may be inextricably interwoven [8-10]. The first step consists
of analyzing the source program to extract its
structure and meaning. The complexity of this step
is largely determined by the nature of the source
language.
The second step, code generation, is one of
synthesis.
It involves many sub-tasks such as
register allocation and assignment, code selection,
operation scheduling, peep-hole optimization and
instruction assembly. This step is affected mainly
by the nature of the target machine's architecture.
The emphasis in this paper is on the scheduling
function which assumes considerably more importance
in a horizontal architecture than in conventional
serial architectures. The scheduling task consists
of determining, for each instruction cycle, which
operations are to be performed upon which resources.
This must be performed subject to constraints such
as the functionality of each resource, the data
dependencies in the computation and the availability of resources, and with the objective of
minimizing the execution time of the computation.
For all of these reasons, code generation for
horizontal processors is an ad hoc process.
Consequently, the code generator is a complex piece of
software that is difficult to construct, to comprehend and to maintain.
In fact, it has been
suggested that the efficient programming of a horizontal processor from a high level language is not
possible [ i ~ .
The remedy for this situation lies
in first formalizing the code generation process and
forming a better understanding of how horizontal
processors should be designed to conform to the
model of code generation. Processor designers must
then discipline themselves to restrict their designs to fall within the recommended class of
architectures.
97
At the same time, the designer should not make
the mistake of proposing a solution which makes code
generation trivial but results in terribly costineffective utilization if the hardware. It is the
overall cost-effectiveness of the hardware-software
system that matters, and this is determined by the
weakest link in the chain. Thus, the designer must
balance performance and hardware cost against the
cost of developing a compiler for his processor and
the cost of laborious assembly language programming
when the compiler is found to be unsatisfactory in
its ability to generate high quality code.
4.
read from it, there is no possibility of conflicting
demands being placed upon the delay element. Hence,
allocation of delay elements is not an issue since,
in effect, they have been pre-allocated at the time
that the hardware was built.
There are other problems involved in generating code for horizontal processors that are minimized by the structure of the delay elements,
especially their shift-and-compact capability. The
above issues are discussed in [3,4,6].
As a consequence of the existence and the
nature of the delay elements in the polycyclic
architecture, the circular problem of determining
the usage of resources, with implicit demands upon
them, is eliminated.
In turn, this minimizes the
back-tracking needed when scheduling the overlapped
execution of iterative computations. The polycyclic architecture addresses the conventional horizontal architecture's code generation problem by
the use of extra hardware. However, the manner in
which this extra hardware expense is utilized has
been chosen very carefully. There are many ways of
adding extra register files and many different types
of register files that could have been used which
would not have solved the fundamental problem. The
determination of the precise manner in which extra
hardware was to be used so as to be effective was
based on first formalizing the task, understanding
the essence of the problem, and then providing the
hardware that exactly met the need.
SCHEDULING OF THE POLYCYCLIC ARCHITECTURE
An explicitly scheduled resource is a resource.
which needs to be explicitly scheduled by the
compiler if simultaneous, conflicting demands are
not to be placed upon it.
Such resources may
include PEs, memories, register files and shared
buses.
Implicitly scheduled resources are those
which are implicitly scheduled along with some
explicitly scheduled resource. The demands upon the
implicitly scheduled resources are guaranteed not
to conflict if the demands upon the corresponding
explicitly scheduled resources do not conflict. One
example of an implicitly scheduled resource is a
dedicated bus which has only one source.
Since
there can never be a conflict in its use, it can be
viewed as having been scheduled along with its
source. Another example of an implicitly scheduled
resource is the last stage in a pipelined adder
since it is implicitly scheduled for use a constant
number of cycles after the first stage (which is
explicitly scheduled).
Again, there can be no
conflict in the use of the last stage if there is no
conflict in the use of the first stage.
For
scheduling purposes these constitute part of the
explicitly scheduled resource and may initially be
ignored without any repercussions later in the
scheduling process.
The polycyclic architecture has been designed
to support code generation by simplifying the task
of scheduling the resources of horizontal processors. The advantages that we expect to observe are:
i) that the scheduler portion of the compiler
will be easier to implement,
2) that the code generated will be of a higher
quality,
The key to the ease with which the polycyclic
architecture can be scheduled lies in the fact that
explicitly scheduled resources perform only those
operations that are explicit in the computation
graph.
Their usage can, therefore, be determined
trivially. This is in contrast to the case of the
conventional processor where explicitly scheduled
resources (the scratch-pads) had implicit and unknown demands placed upon them.
All implicit
operations (delay element usages) in the case of the
polycyclic architecture use implicitly scheduled
resources with the guarantee that no conflict can
occur in so doing.
Thus, these unknown, implicit
operations may be ignored while scheduling.
3)
that the compiler will execute faster,
4)
that the automatic generation
generators might be facilitated.
of
code
REFERENCES:
i) CDC Advanced Flexible Processor Microcode Cross
Assembler (MICA) Reference Manual, Control Data
Corp., Publication No. 77900500, Apr. 1980.
2)
A.E. Charlesworth, "An approach to scientific
array processing:
the architectural design of the
AP-120B/FPS-164 family", Computer, Vol. 14, No. 9,
pp. 18-27, Sep. 1981.
As a result, the scheduling process may be
divided into two phases. The first phase schedules
the explicitly scheduled resources, and can do so in
a deterministic manner [4] since the resource usage
depends only upon the explicit operations in the
computation. The second phase consists of the bookkeeping task of determining which delay elements are
used and at what point in time. Once the operations
that produce and use a value have been assigned to
PEs by the operation shceduling phase, the dedicated
delay element between those two PEs must be assigned
to hold (and delay) that value. There Js no other
option.
Furthermore, since only one PE may write
into any given delay element and only one PE may
3)
B.R. Rau, P.J. Kuekes and C.D. Glaeser, "A
statically scheduled VLSI interconnect for parallel
processors", Proc. CMU Conference on VLSI Systems
and Computations, Carnegie-Mellon Univ., Pittsburgh, Pennsylvania, pp. 389-395, Oct. 1981.
4)
B.R. Rau and C.D. Glaeser, "Some scheduling
techniques and an easily schedulable horizontal
architecture for high performance scientific computing", Proc. 14th Annual Workshop on Microprogramming, pp. 183-198, Oct. 1981.
98
5)
B.R. Rau, R.L. Picard, C.D. Glaeser
Greenawalt, "The polycyclic architecture:
cally scheduled data flow architecture",
(to appear).
and E.M.
a statiComputer
I INSTRUCTION
SEQUENCER
6)
B.R. Rau, C.D. Glaeser and E.M. Greenawalt,
"Efficient code generation for horizontal architectures;
compiler techniques and architectural
support", (submitted for publication).
PROGRAMMEMORY
7) J.H. Patel, "Processor-memory interconneetions
for multiprocessors", Proc. 6th Annual Symposium on
Computer Architecture, pp. 168-177, April 1979.
HORIZONTAL
INSTRUCTION
8) A.V. Aho and J.D. Ullman, Principles of Compiler
Design, Addison-Wesley, New York, 1977.
z-- l
I
.
.
.
.
.
o.°,°~
9) W.A. Wulf, et al., The Design of an Optimizing
Compiler, Elsevier, New York, 1975.
10. F.L. Bauer and J. Eichel
(Eds.), Compiler
Construction, An Advanced Course, Springer-Verlag,
New York, 1976.
INTERCONNECT
l l . D . Landskov, S. Davidson and B. Shriver, "Local
microcode compaction techniques", Computing Surveys, Vol. ii, No. 3, pp. 261-294, Sep. 1980.
FIGUREi:
A TYPICAL HORIZONTAL ARCHITECTURE
12. M.J. Gonzalez, "Deterministic processor scheduling", Computing Surveys, Vol. 9, NO. 3, pp. 173204, Sep. 1977.
13) W.J. Karplus and D. Cohen, "Architectural and
software issues in the design and application of
peripheral array processors", Computer, Vol. 14,
No. 9, pp. 11-17, Sep. 1981.
)
V
/
Figure 2: Relevant features of a polycyclic processor.
99