ARCHITECTURAL S U P P O R T FOB THE E F F I C I E N T OF CODE FOR H O R I Z O N T A L 2 B.R. Rau, Advanced 1. C.D. Glaeser, ARCHITECTURES E.M. Greenawalt Processor Technology Laboratory ESL Inc. San Jose, California 6] is a horizontal architecture with architectural support for the scheduling task. The cause of the complexity involved in scheduling conventional horizontal processors and the manner in which the polycyclic architecture addresses this issue are outlined in this paper. INTRODUCTION Horizontal architectures, such as the C D C Advanced Flexible Processor [i} and the FPS APi20-B [2}, consist of a number of resources that can operate in parallel, each of which is controlled by a field in the wide instruction word. Such architectures have been developed to perform high speed scientific computations at a modest cost:. Figure 1 displays those characteristics of horizontal architectures that are germane to the issues discussed in this paper. The simultaneous requirements of high performance and low cost lead to an architecture consisting of multiple pipelined processing elements (PEs) such as adders and multipliers, a memory (which for scheduling purposes may be viewed as yet another PE with two operations: a READ and a WRITE), and an interconnect which ties them all together'. The interconnect allows the result of one operation to be directly routed to another PE as one of the inputs for an operation that is to be performed there. The required memory bandwidth is reduced since temporary values need not be written to and ~ead from the memory. The final aspect of horizontal processors that is of interest is that their program memories emit wide instructions which synchronously specify the actions of the multiple and possibly dissimilar PEs. The program memory is sequenced by a conventional sequencer that assumes sequential flow of control unless a branch is explicitly specified. 2. THE POLYCYCLIC ARCHITECTURE The polycyclic architecture is a horizontal architecture with a couple of special properties which are listed below. 2.1 The Interconnect The interconnect of a polycyclic processor must have the following property: a dedicated delay element exists between every resource output and resource input that are directly connected to each other. This delay element enables a datum to be delayed by an arbitrary amount of time in transit between the corresponding output and input. The topology of the interconnect may be arbitrary. It is possible to design polycyclic processors with n resources in which the number of delay elements is O(n), (a uni-or multi-bus structure), O(nlogn), (e.g., delta networks [7]), or O(n*n), (a cross-bar). The trade-offs involve cost, interconnect bandwidth and interconnect latency. Thus, it is possible to design polycyclic processors lying in various cost-performance brackets. As a consequence of the simplicity of such an architecture, it is inexpensive relative to the potential performance of the multiple pipelined PEs. However, if this potential performance is to be realized, the multiple resources of a horizontal processor must be scheduled effectively. The scheduling task for conventional horizontal processors is quite complex and the construction of highly optimizing compilers for them is a difficult and expensiw3 project. The polycyclic architecture [3- A sample polycyclic processor is shown in Figure 2. The topology of the interconnect is a complete crossbar with a delay element at each cross-point (thereby providing the polycyclic property). The interconnect has two output ports (columns) and one input port (row) for each of the two PEs. Each cross-point has a delay element, similar to that described below, which is capable of one read and one write each cycle. A PE can simultaneously distribute its output to any or all of the delay elements with are in the row of the interconnect corresponding to its output port. A P E can obtain its left (right) input from any delay element in the column of the interconnect that corresponds to its left (right) input port. If a value is written into a delay element at the same that an attempt is made to read it, the value is transmitted through the interconnect with no delay. Any positive delay may be obtained merely by leaving the datum in the delay element for a suitable length of time. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. (~) 1982 ACM 0-89791-066-4 82/03/0096 GENERATION $00.75 96 2.2 Iterative computations (loops) account for a major fraction of the execution time in scientific computations and, so, merit special attention. Although considerable work has been performed on developing scheduling techniques for directed acyclic graphs, e.g., [ii, 12], relatively few techniques are available for automatically scheduling loops in a near-optimal manner. Subject to the data dependencies between them, successive iterations of a loop may be overlapped in any way that does not result in conflict for the use of the resources. This can significantly improve the throughput of the processor. Algorithms for achieving this are presented in [ 4 ] . These algorithms require that the number of times that each resource is used per iteration be known before the scheduling of the operations can begin. This requirement has been proven to be necessary for any scheduling algorithm which yields an identical schedule for each iteration of the loop [4]. The Delay Elements The structure of an individual delay element consists of a shift register file, any location of which may be read by providing an explicit read address. Optionally, one may specify that the value accessed be deleted. This option would be exercised if this were the last access tc that value. The result of doing so is that every value, with address greater than the address of the deleted ~alue, is simultaneously shifted down on the same clock pulse, to the location with the next lower address. Consequently, all values present in the delay element are compacted into the lowest locations. An incoming value is written into the lowest empty location which is pointed to by the Write Pointer which is maintained by hardware. The Write Pointer is automatically incremented each time a value is written and is decremented each time one is deleted. As a consequence of deletions, a value, during its residence in the delay element, drifts down to lower addresses, and is read from various locations before it is itself deleted. Due to resource and dependency constraints, the two operations that create and use a particular value may be scheduled arbitrarily far apart in time. In other words, the value will be transmitted at one point in time by the resource that creates it, but must be received after an arbitrarily long delay by the resource which is to use it. This necessitates the existence of a capability to delay values while in transit between any source and any destination. A value's current position at each instant during execution must be known by the compiler so that the appropriate read address may be specified by the program when the value is to be read. Keeping track of this is a simple, if somewhat tedious, task which is easily performed by a compiler during codegeneration. This task and the advantage of this unorthodox structure for the delay elements are discussed in greater detail in [4, 6]. In conventional horizontal machines, the only means for obtaining the necessary delays is through the use of scratch-pad register files. These scratch-pads constitute resources and their usage must be known before the scheduling can begin. However, the number of times that the scratch-pads need to be used, and the points in the computation graph at which scratch-pad operations need to be inserted, depend upon the schedule. The description of a VLSI building block chip for polycyclic interconnects may be found in [3]. The functional specification and logic design of this chip, in TRW's triple diffusion bipolar technology, have been performed. 3. FACTORS AFFECTING CODE GENERATION As a result of this circularity, it is not possible to specify a viable schedule in a deterministic manner [2, 4, 6]. Instead one must first guess at the scratch-pad usage and then attempt to generate a schedule based on that estimate. If a schedule is not obtained, then another guess must be made and this process must be repeated until a schedule is obtained. Effective heuristics, to guide the compiler in "guessing" at the scratch-pad usage, are difficult to formulate. It is useful, conceptually, to view the compilation process as consisting of two major steps although, in practice, these steps may be inextricably interwoven [8-10]. The first step consists of analyzing the source program to extract its structure and meaning. The complexity of this step is largely determined by the nature of the source language. The second step, code generation, is one of synthesis. It involves many sub-tasks such as register allocation and assignment, code selection, operation scheduling, peep-hole optimization and instruction assembly. This step is affected mainly by the nature of the target machine's architecture. The emphasis in this paper is on the scheduling function which assumes considerably more importance in a horizontal architecture than in conventional serial architectures. The scheduling task consists of determining, for each instruction cycle, which operations are to be performed upon which resources. This must be performed subject to constraints such as the functionality of each resource, the data dependencies in the computation and the availability of resources, and with the objective of minimizing the execution time of the computation. For all of these reasons, code generation for horizontal processors is an ad hoc process. Consequently, the code generator is a complex piece of software that is difficult to construct, to comprehend and to maintain. In fact, it has been suggested that the efficient programming of a horizontal processor from a high level language is not possible [ i ~ . The remedy for this situation lies in first formalizing the code generation process and forming a better understanding of how horizontal processors should be designed to conform to the model of code generation. Processor designers must then discipline themselves to restrict their designs to fall within the recommended class of architectures. 97 At the same time, the designer should not make the mistake of proposing a solution which makes code generation trivial but results in terribly costineffective utilization if the hardware. It is the overall cost-effectiveness of the hardware-software system that matters, and this is determined by the weakest link in the chain. Thus, the designer must balance performance and hardware cost against the cost of developing a compiler for his processor and the cost of laborious assembly language programming when the compiler is found to be unsatisfactory in its ability to generate high quality code. 4. read from it, there is no possibility of conflicting demands being placed upon the delay element. Hence, allocation of delay elements is not an issue since, in effect, they have been pre-allocated at the time that the hardware was built. There are other problems involved in generating code for horizontal processors that are minimized by the structure of the delay elements, especially their shift-and-compact capability. The above issues are discussed in [3,4,6]. As a consequence of the existence and the nature of the delay elements in the polycyclic architecture, the circular problem of determining the usage of resources, with implicit demands upon them, is eliminated. In turn, this minimizes the back-tracking needed when scheduling the overlapped execution of iterative computations. The polycyclic architecture addresses the conventional horizontal architecture's code generation problem by the use of extra hardware. However, the manner in which this extra hardware expense is utilized has been chosen very carefully. There are many ways of adding extra register files and many different types of register files that could have been used which would not have solved the fundamental problem. The determination of the precise manner in which extra hardware was to be used so as to be effective was based on first formalizing the task, understanding the essence of the problem, and then providing the hardware that exactly met the need. SCHEDULING OF THE POLYCYCLIC ARCHITECTURE An explicitly scheduled resource is a resource. which needs to be explicitly scheduled by the compiler if simultaneous, conflicting demands are not to be placed upon it. Such resources may include PEs, memories, register files and shared buses. Implicitly scheduled resources are those which are implicitly scheduled along with some explicitly scheduled resource. The demands upon the implicitly scheduled resources are guaranteed not to conflict if the demands upon the corresponding explicitly scheduled resources do not conflict. One example of an implicitly scheduled resource is a dedicated bus which has only one source. Since there can never be a conflict in its use, it can be viewed as having been scheduled along with its source. Another example of an implicitly scheduled resource is the last stage in a pipelined adder since it is implicitly scheduled for use a constant number of cycles after the first stage (which is explicitly scheduled). Again, there can be no conflict in the use of the last stage if there is no conflict in the use of the first stage. For scheduling purposes these constitute part of the explicitly scheduled resource and may initially be ignored without any repercussions later in the scheduling process. The polycyclic architecture has been designed to support code generation by simplifying the task of scheduling the resources of horizontal processors. The advantages that we expect to observe are: i) that the scheduler portion of the compiler will be easier to implement, 2) that the code generated will be of a higher quality, The key to the ease with which the polycyclic architecture can be scheduled lies in the fact that explicitly scheduled resources perform only those operations that are explicit in the computation graph. Their usage can, therefore, be determined trivially. This is in contrast to the case of the conventional processor where explicitly scheduled resources (the scratch-pads) had implicit and unknown demands placed upon them. All implicit operations (delay element usages) in the case of the polycyclic architecture use implicitly scheduled resources with the guarantee that no conflict can occur in so doing. Thus, these unknown, implicit operations may be ignored while scheduling. 3) that the compiler will execute faster, 4) that the automatic generation generators might be facilitated. of code REFERENCES: i) CDC Advanced Flexible Processor Microcode Cross Assembler (MICA) Reference Manual, Control Data Corp., Publication No. 77900500, Apr. 1980. 2) A.E. Charlesworth, "An approach to scientific array processing: the architectural design of the AP-120B/FPS-164 family", Computer, Vol. 14, No. 9, pp. 18-27, Sep. 1981. As a result, the scheduling process may be divided into two phases. The first phase schedules the explicitly scheduled resources, and can do so in a deterministic manner [4] since the resource usage depends only upon the explicit operations in the computation. The second phase consists of the bookkeeping task of determining which delay elements are used and at what point in time. Once the operations that produce and use a value have been assigned to PEs by the operation shceduling phase, the dedicated delay element between those two PEs must be assigned to hold (and delay) that value. There Js no other option. Furthermore, since only one PE may write into any given delay element and only one PE may 3) B.R. Rau, P.J. Kuekes and C.D. Glaeser, "A statically scheduled VLSI interconnect for parallel processors", Proc. CMU Conference on VLSI Systems and Computations, Carnegie-Mellon Univ., Pittsburgh, Pennsylvania, pp. 389-395, Oct. 1981. 4) B.R. Rau and C.D. Glaeser, "Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing", Proc. 14th Annual Workshop on Microprogramming, pp. 183-198, Oct. 1981. 98 5) B.R. Rau, R.L. Picard, C.D. Glaeser Greenawalt, "The polycyclic architecture: cally scheduled data flow architecture", (to appear). and E.M. a statiComputer I INSTRUCTION SEQUENCER 6) B.R. Rau, C.D. Glaeser and E.M. Greenawalt, "Efficient code generation for horizontal architectures; compiler techniques and architectural support", (submitted for publication). PROGRAMMEMORY 7) J.H. Patel, "Processor-memory interconneetions for multiprocessors", Proc. 6th Annual Symposium on Computer Architecture, pp. 168-177, April 1979. HORIZONTAL INSTRUCTION 8) A.V. Aho and J.D. Ullman, Principles of Compiler Design, Addison-Wesley, New York, 1977. z-- l I . . . . . o.°,°~ 9) W.A. Wulf, et al., The Design of an Optimizing Compiler, Elsevier, New York, 1975. 10. F.L. Bauer and J. Eichel (Eds.), Compiler Construction, An Advanced Course, Springer-Verlag, New York, 1976. INTERCONNECT l l . D . Landskov, S. Davidson and B. Shriver, "Local microcode compaction techniques", Computing Surveys, Vol. ii, No. 3, pp. 261-294, Sep. 1980. FIGUREi: A TYPICAL HORIZONTAL ARCHITECTURE 12. M.J. Gonzalez, "Deterministic processor scheduling", Computing Surveys, Vol. 9, NO. 3, pp. 173204, Sep. 1977. 13) W.J. Karplus and D. Cohen, "Architectural and software issues in the design and application of peripheral array processors", Computer, Vol. 14, No. 9, pp. 11-17, Sep. 1981. ) V / Figure 2: Relevant features of a polycyclic processor. 99
© Copyright 2026 Paperzz