A fourth-generation computer organization by STAN LEY E. LASS Scientific Data Systems, Inc. Santa Monica, California sive use of combinational logic and separate functional units (e.g., an add/subtract/logical unit and a multiply/divide unit). The. implications of this procedure can be emphasized by estimating the arithmetic operation speeds that will result. These estimates are based on extrapolations from published papers 1•2 •3 and include an allowance for the additional logic levels required. Also, the estimates assume a I-nsec delay in the environment for one level of AND/OR logic along the critical path. The critical-path distance will be minimized by a combination of staying on the integrated circuit chip and keeping the path distance between chips short. INTRODUCTION A single processor's performance is limited by its organizational efficiency and the technology available. Paralleling of processors and/or improving the organizational efficiency are the ways of obtaining greater performance with a given technology. Much research has been done on multiple processors and single processors which perform operations on vectors in parallel. Howerer, significant portions of problems are sequential, and performance in the sequential portions is limited to that of a single processor. This paper describes a proposed new medium- to large-scale computer organization designed to improve single-processor organizational efficiency. The basis of this approach is the separation of memory operations (fetching, storing) control from the arithmetic unit control. Each control unit executes its own programs. Memory operations programs fetch instructions, fetch operands, and store results for the arithmetic unit. Buffering allows a maximum of asynchronism between the arithmetic operations and the memory operations. To perform a given computation, each control unit executes fewer and less complex instructions than a third-generation computer control unit. The less complex instructions require less time to execute and, since fewer instructions per control unit are required, the computer can operate much faster. Cost -performance of logic A logic circuit delay of approximately 0.2 nsec has been achieved on an integrated circuit chip. H ighspeed logic circuit delay of 1.8 nsec has been achieved in the third generation. Low-cost bipolar logic with 250 gates on a chip at 5 cents/gate has been predicted for 1970. Per-gate costs are presently about 50 cents. Cost-performance of logic will thus be about two orders of magnitude better than in the third generation. Arithmetic operation times As a result of this cheaper and faster logic, it will be reasonable to minimize operation times by exten- Estimated arithmetic operation speeds are: Operation Pipelined Time! Elapsed Time Operation 32-bit fixed-point add/subtract 32-bit fixed-point mUltiply 32-bit fixed-point divide 32-bit logic functions 32-64 bit floating-point add/subtract 32-64 bit floating-point multiply 32-64 bit floating-point divide 8 nsec 16 nsec 56 nsec 8 nsec 16 nsec 20 nsec 70 nsec 4nsec 8 nsec 4 nsec 8 nsec 10 nsec With separate functional units, time can also be saved by using functional unit outputs directly as inputs without intervening storage. Pipelining can also be used to increase the throughput. For pipelined operation, the execution of a function is divided into two or more stages, and a set of inputs can be in execution in each stage. The time between successive inputs can be much less than the elapsed time for the execution of a function. Cost-performance of memory Memory costs will be roughly halved by batchprocessed fabrication. Access times on the order of 100 nsec and cycle times on the order of 200 nsec will be achieved. This represents nearly an order of magnitude improvement in cost-performance over thirdgeneration memories. 435 From the collection of the Computer History Museum (www.computerhistory.org) 436 Spring Joint Computer C"onference, 1968 I mplications of memory technology Logic speed is increasing relatively faster than memory speed. Cheaper logic makes it reasonable to perform the arithmetic operations in fewer logic levels. As a result, the disparity between arithmetic operation tim.es and memory access times will increase by a factor of roughly two to three. This implies greater instruction lookahead to efficiently utilize the arithmetic unit's capacity - and increased instruction lookahead is difficult to achieve. 4 However, a partial solution to this disparity exists and is described in the sections that follow. Associative bufler and block-organized main memory A scratchpad memory buffers the processor and main memory. Blocks of words are transferred between the scratchpad memory and main memory. The scratchpad memory and the associative memory together comprise the associative buffer. The operation proceeds as follows: The virtual address of a requested word is associatively checked with the virtual addresses of the blocks in the scratchpad. If the word is in a block in the scratchpad memory, it is output to the processor. If not, the block containing the word is obtained from main memory and stored in the scratchpad memory, and the word is output to the processor. Similarly, when storing a word, the block must be in the scratchpad memory. This is similar to paging in third-generation tImesharing systems and it involves the same problems (e.g., which block to delete or store when room is needed for a new block). The net result is a substantial reduction in access time when the word is in the scratchpad memory. 5,6 To provide a basis for comparison, assume a blockorganized main memory with each block consisting of 16 consecutive 32-bit words. Eight interleaved block-organized memories of 100-nsec access time and 200-nsec cycle time provide a combined memory bandwidth of over 2 X 1010 bits/second. Access times from processor to memory are approximately 30 nsec for words in the scratchpad, and 150 nsec for words in main memory. Pipelining through the associative memory and parallel scratchpads is used to achieve a high associative buffer bandwidth. Assume a fetch or store every 10 nsec, where six percent of these require accessing main memory. The six percent is based on data5 modified to reflect the differences in computer organization. This corresponds to an instruction rate of approximately 80 million per second. This also corresponds to six blocks per microsecond from main memory or 15 per- cent of bandwidth. With" bandwidth usage this low, another processor could be added without severe degradation in performance due to interference. It also allows high input-output transfer rates with modest interference. It is desirable, with this design, to group operands and sequence the addressing to minimize the number of block transfers. This lowers the average access time and lowers the main memory bandwidth usage. Programming implications Most programming will be in higher-level languages. The computer cost will be a smaller and smaller portion of the total costs of solving a given problem. The main goal of the designer is to maximize the system throughput with programs written in higherlevel languages. The user sees a system that executes programs written in higher-level languages. The average job execution time does not decrease significantly when the computer speed is increased significantly. The explanation for this seems to be that the number of programmers and the number of jobs they submit each day do not change appreciably, but the jobs they do submit are longer in terms of number of instructions executed; e.g., they try more cases or parameterize in finer increments. The number of instructions executed per job by the operating system (including. compilers) will probably not increase by more than a factor of five, even if increased optimization of compiled code and decreased efficiency due to use of table-driven compiler techniques (for lower software cost) are"factored in. Operating systems, compiling, and input conversions (e.g., decimal-binary) are essentially inputoutput functions and their volume is proportional to the number of programmers and people preparing input and reading output. If the computer speed is increased by a factor of twenty-five, then the operating system (including compilers) time will decrease by a factor of more than five; and the computer will be executing jobs more of the time. Similarly, the proportion of time devoted to byte manipulation, binary-todecimal conversions, etc., will decrease. Byte, halfword, and shifting operations may not be included in the hardware for the above reasons. Shifting would be accomplished by multiplying by a power of two. . The equivalence of logic design and programming Both the logic designer and the programmer implement algorithms. Each has to choose a representation of the data involved. Whereas the programmer uses instructions to implement algorithms, the logic designer uses combinations of logic elements (AND, From the collection of the Computer History Museum (www.computerhistory.org) A Fourth-Generation Computer Organization OR, NOT, and storage). In addition to verifying that the logic is correct, the designer must observe the electrical limitations of the logic elements and their connections (i.e., circuit delay, fan-in, fan-out, and wire propagation delay) in order to execute the logic function correctly within the time allotted. Hardware instruction lookahead is, in effect, a recoding of several instructions to obtain the instantaneous control actions. The hardware recoding and the resulting asynchronism depend on conditions within the computer (e.g., variations in instantaneous memory access time due to interrerence). Hardware recoding operates in real time at execution time and is strictly limited in complexity by time and economic considerations. The recoding can also be performed by software at compile time if execution-time asynchronism is sacrificed. All concurrency is planned at compile time. If an instruction or operand were not available when needed (due to memory interference), the control would halt until it became available. The recoded program, containing control timing and sequencing information, would require several times as many bits as the unrecoded program. It would resemble a microprogram with groups of microinstructions to be executed in parallel. The- computer time required for recoding at compile time is proportional to the length of the program, not the number of instruction executions required to complete the program. Also, software recoding is not limited by the real-time constraint. As a result, the software recoding can economically be much more complex and more effective. In the recoded form, operand fetches are initiated several instruction cycles before they are used. For example, the recoded form of the inner loop of a matrix multiply would be several operand fetches, followed by concurrent operand fetches and arithmetic operations and finally by the last arithmetic operations. The same result could be obtained by an independent operand fetch loop which starts several instruction cycles before the arithmetic operation loop is started. Two separate centers of control are implied. Fewer bits are required to represent the program by specifying the two loops separately, but the number of bits is still more than third-generation instructions require. The proposed computer organization has a separate control unit for fetching and storing (the data channel control unit) and an arithmetic control unit. F or comparison, note that the CDC 6600 and the LIMAC7 have separate instructions (but not separate program&) for arithmetic operations and memory operations. A'l1I -rJ Data channels and their control Figure 1 shows the data channels which are the information-flow paths in the computer. DAT.A CHANNEL CONTROL UN!T ASSOCIATIVE BUFFER MAIN MEMORY Figure 1- Information flow diagni.m. Arrows indicate data paths in the computer. Instructions are transmitted to the arithmetic units over paths indicated by dashed arrows. Double arrows are the data paths for each set of eight data channels. Two data paths suffice for eight data channels, since two data items at most are transferred at a time Channel commands for multiple-word transfers consist of a virtual memory address, the channel number, a flag indicating load or store pushdown stack, address increment, and count. For loop control during arithmetic operations, an end-of-record marker follows the last operand. An attempt to read the end-of-record marker as data will terminate the loop. A channel can be cleared of previous contents by flagging the first command of a new channel program. All store commands of the old channel program for which data was stored in the channel are properly executed. A channel must be cleared or sufficient time must elapse to store the data before subsequent commands reference that data. Another channel capability is the capability to load a variable number of words (limited by the buffer size) in a circular register. Its use is primarily for storing instructions and constants within loops. It can be entered by flagging the channel command which specifies the last word in the loop. The first word will then follow the last word until the channel is cleared. This usage of the channel will be later referred to as circular mode. The input-output register of the channel has a datapresence bit to indicate data availability. The register functions in four ways: From the collection of the Computer History Museum (www.computerhistory.org) 438 Spring Joint Computer Conference, 1968 1. Nondestructive read: sequence, but they will go to the correct word in the The presence bit is left on and the register conscratchpad. teilts remain t.he same. 2. Destructive read: The presence bit· is turned off, the register is fined with the next data word in the .channel~ and . . the presence bit turned back on. 3. Nondestructive store: The current contents of the register are pushed down one and the presence bit is left on. CHANNEL SI DE POINTER INPUT-OUTPUT 4. Destructive store: REGISTER POINTER The current contents of the register are replaced. Provision is made for saving channel status, using the channel for another purpose, and later restoring the channel to its original status. Figure 2 - Data channel buffer The master control unit, fixed-point arithmetic unit, and. input-output unit each have a data channel The mast~r control unit. reserved for commands. Any data stored in these data channels are transmitted to the data channel conEight double-word data channels supply the master trol unit for immediate execution as a command. control unit with instructions. The control is selected The second source of commands is the input -output to one of the eight data channels. Instructions present registers of specified data channels. Commands presin the sel~cted data channel (indicated by the presence ent in the specified data channels (as indicated by the bit) are read destructively from the input-output regispresence bit) are read destructively from the inputter of the data channel and executed. There are five output register and executed. The commands are types of instructions: executed by small, fast, special-purpose computers in . 1. Arithmetic instructions are transmitted to the the data channel control unit. appropriate arithmetic unit. As a~ example, the execution of a single command 2. Channel commands appearing in the instruction (received from the master control unit) loads a data stream are transmitted to the channel control unit channel with commands, -the first command loads through a data channel. another data channel with commands for fetching 3. An all-zero instruction is a no-operation. instructions, the second command loads still another 4. An instruction is provided to conditionally data channel with commands for storing data, and switch between the instruction unit data chanthe remaining commands fetch operands. nels. Channel command programs loading data can fetch 5. An instruction is .provided to conditionally skip ahead of arithmetic execution a number of words a specified number of instructions. limited by the size of the data channel buffer. Channel Arithmetic unit control command programs end by running out of commands. 10\ rI1\ ~\ Data channel buffering Channel buffers would be implemented as circular buffers using integrated scratchpad memories. Channel action when used for loading operands is as follows: Initially, both input and output pointers are set ~o 0 (see Figure 2). The first input requested goes into word 0, and a data-presence bit is set when the input arrives .. Each successive input requested goes to the next higher word (modulo 7). The fetch-ahead depth of our example is limited to 8. Output can only occur if the data-presence bit is set. If the instruction turns off the presence bit, the next output comes from the next higher word. While inputs are requested in the command order, they may arrive at the buffer out of An arithmetic instruction specifies the two inputs, destructive or nondestructive read, and the operation to be performed. The inputs are from data channels and functional unit outputs (see Figure 3). A store instruction transmits the data on an output bus or a data channel to a data channel. One data channel leads to the channel control unit for computed channel commands. The first stage of instruction execution is testing whether the specified inputs are present. If they are not, the control hangs up until they arrive. During the last stage, the inputs are latched in the functional unit while the operation proceeds. The output-presence bit is set when the operation is completed. The presence bit is turned off by a destructive read of the functional unit output (by an instruction), or by test- From the collection of the Computer History Museum (www.computerhistory.org) A Fourth~Generation ComputerOrganization ADDER, SUBTRACTOR. AND LOGIC UNIT MULTIPLY-DIVIDE UNIT INPUT-OUTPUT BUS INPUT BUS EIGHT DATA CHANNEL BUFFERS 439 modes. Leaving the loop is accomplished by switching to another data channel (by conditional branches) or by trying to read data and getting an end-of-record marker. If this data channel is itself in circular mode, . we have a loop within a loop. Conditional branches are handled by anticipatory loading of a channel with the successful branch instruction stream and switching to the channel if the branch is successful. Unconditional branches are handled at the channel command level by loading the channel with the branched instruction stream. Subroutines are ha'ndled by loading a channel with the subroutine (or at least with the beginning of it) and then switching control to that channel. Returning is accomplished by switching back to the original channel. Some subroutining can be specified at the channel command level by channel commands for an unconditional branch to the subroutine and an unconditional branch back. However, sooner or later all channels will be in use, in which case a channel status is saved, the channel is used, and later the channel status is restored. This is analogous to saving a program location counter, executing a subroutine, and then restoring the location counter. Hardware design and packaging Figure 3 - Arithmetic unit organization. Arrows indicate data paths and direction. ing the condition code generated by the operation. For example, a compare is accomplished by testing the condition code of a subtract operation. To pipeline, two pairs of inputs must be latched in before the result of the first pair is read. Trying to read anend-of-record marker results in switching control to the next data channel in the control unit. This is used mainly for terminating loops. To facilitate data exchange between the separate fixed- and floating-point arithmetic units, two data channels are common to both. Channel commandsfor instruction and data sequencing The channel commands for a set of instructions and their data are normally located together in memory. Sequential instructions are executed by transmitting a channel command to the data channel control unit specifying the instructions and their data. SmaUloops are the same as sequential instructions, except that circular mode is specified in the channel command. Arithmetic data may also use circular The computer is naturally partitioned into nearly autonomous units. Repetition of parts is found in the mUltiple uses of data channels (which are mostly memory), the special-purpose computers in the data channel control unit, and the associative buffer (which is mostly associative and addressable memory). The hardware complexity of control needed to achieve a high level of concurrency is minimized by separate contro! of memory operations and of the arithmetic unit. Complex instruction-Iookahead hardware is not required. Input-output Input-output would be controlled by a small computer with a scratchpad memory for input-output comlLands and buffering. The small computer is the interface between the peripherals and the data channels. It also generates commands to control input-output data transfers III the data channels, Time-sharing and multiprogramming Paging from a rotating memory is the currently popular solution to managing main memory in a timesharing environment. If, in a typical third-generation system, a 50 x 106 instructions/sec processor is sub- From the collection of the Computer History Museum (www.computerhistory.org) 440 Spring Joint Computer Conference, 1968 stituted and the page access time plus transfer time not changed significantly, the processor will be waiting on pages most of the time. Access time from rotating memories cannot be improved significantly, but transfer rates with head-per-track systems can be very high. This suggests an approach based on the ability to read complete programs from the disc into main memory quickly, process them rapidly, and return them to the disc quickly. This minimizes the prorated memory usage by a program and allows high throughput without an excessively large memory. The scheduler maximizes processor utilization within the constraint of system response times. Programs normally reside on the disc. The scheduler selects the next program to be transferred to memory. One factor in the selection is the amount of time before program transfer would begin (instantaneous access time). The program is transferred at 109 bits/second (e.g., a 106 bit program is transferred in 1 msec). The program is put on a queue of programs to be processed. Having the complete program in memory allows processing without paging to the next input or output by the program, and then writing the processed program into available disc space (generally the first available disc space) without regard to its previous disc location. The previous disc location is added to the available disc space. Scheduler considerations include the distribution of available space around the disc, distribution of ready-to-be processed programs around the disc, and nearness of ready-to-be-processed programs to their response-time limit. Programs whose processing time exceeds a specified limit are not allowed to degrade the system response time of the other programs. Programs larger than the memory require partitioning into files or pages. Main memory (or optionally a slower, lower-cost random-access memory) and the disc buffer the inputoutput activity of the programs. The system could be organized to place FORTRAN users in one group, JOSS users in another group, etc. Each user group would have its own compiler and supporting operating system. A portion of the operating system would be common to all of the groups. The computer would be a "dedicated" FORTRAN system for a fraction of a second, then a "dedicated" JOSS system for another fraction of a second, etc. As a result, operating systems would be simpler and a change could be made in on"e system without affecting the other systems. CONCLUSIONS AND OBSERVATIONS The system described here achieves concurrency of fetching, arithmetic operations, and storing without the need for complex instruction lookahead hardware. The complexity of control is in software. The bandwidth of the processor is over 100 million equivalent third-generation instructions per second. This rate will be achieved for some problems. However, delays due to \vaiting for operands or instructions in the data channels will lower the processing rate in many cases. Some types of probelms seem to inherently have a great deal of delay - for example, table-lookup using computed addresses. Problems in which the flow of control and addressing are not data-dependent could run near the bandwidth of the system. (An additional requirement for this is that the addressing be such that the block transfer rate between main memory and the associative buffer is reasonable.) Optimizing the code consists of (!) minimizing the delays caused by instructions and operands not being available when needed, and (2) pipelining and overlapping the arthmetic operations. To program for this processor in its machine language, a master control program (instructions) and channel programs (commands) are prepared. There are many chances to make an error and lose synchronism between instructions and commands. As a result of the difficulty in machine-language programming with this organization, even more programming would be in higher-level languages. Initially, the compiler for this computer could be relatively crude and unsophisticated. As time passes the subtleties and characteristics of the design would be assimilated and experience gained by the compiler writers. As a result, midway through the fourth generation the computer should average 50-80 million equivalent third-generation instructions per second. A more powerful data channel command set than is described here may be desirable for non-numeric applications. 8 The only significant way to reduce software cost by hardware is to build a faster computer (with a lower cost per computation), which will then allow the programmer to reduce total costs by using algorithms that are simpler to program but require more computer processing. REFERENCES 1 C S WALLACE A suggestion for afast multiplier IEEE Transactions on Electronic Computers Vol EC-13 February 1964 2 M LEHMAN N BURLA Skip techniques for high-speed carry propagation in binary arithmetic units iRE Transactions on Electronic Computers Vol EC-IO December 1961 From the collection of the Computer History Museum (www.computerhistory.org) l\ Fourth-Generation Co~puter Organization 3 S F ANDERSON 1 G EARLE R E GOLDSCHMIDT D M POWER The IBM system/360 model 91: Floating-point execution unit IBM Journal of Research and Development Vol 11 No 1 January 1967 4 D W ANDERSON F 1 SPARACIO R M TOMASULO The IBM system/360 model 91: Machine philosophy and instruction handling IBM lournal of Research and Development Vol 11 No 1 lanuary 1967 5 D H GIBSON Considerations in block-oriented system design AA1 "T"T.l Proceedings of the 1967 Spring loint Computer Conference 6 G G SCARROT The efficient use of multilevel storage Proceedings of the IFlPS Congress Spartan Books 1965 7 H R BEELITZ S Y LEVY R 1 LINHARDT .H S MILLER System architecture for large-scale integration Proceedings of the 1967 Fall loint Computer Conference 8 B CHEYDLEUR Summary session, proceedings of the ACM programming languages and pragmatics conference Communications of the ACM Vol 9 No 3 Marcj 1966 From the collection of the Computer History Museum (www.computerhistory.org) From the collection of the Computer History Museum (www.computerhistory.org)
© Copyright 2025 Paperzz