A fourth-generation computer organization

A fourth-generation computer organization
by STAN LEY E. LASS
Scientific Data Systems, Inc.
Santa Monica, California
sive use of combinational logic and separate functional units (e.g., an add/subtract/logical unit and a
multiply/divide unit). The. implications of this procedure can be emphasized by estimating the arithmetic
operation speeds that will result.
These estimates are based on extrapolations from
published papers 1•2 •3 and include an allowance for
the additional logic levels required. Also, the estimates assume a I-nsec delay in the environment for
one level of AND/OR logic along the critical path.
The critical-path distance will be minimized by a combination of staying on the integrated circuit chip and
keeping the path distance between chips short.
INTRODUCTION
A single processor's performance is limited by its
organizational efficiency and the technology available. Paralleling of processors and/or improving the
organizational efficiency are the ways of obtaining
greater performance with a given technology. Much
research has been done on multiple processors and
single processors which perform operations on vectors
in parallel.
Howerer, significant portions of problems are
sequential, and performance in the sequential portions is limited to that of a single processor. This paper
describes a proposed new medium- to large-scale computer organization designed to improve single-processor organizational efficiency. The basis of this
approach is the separation of memory operations
(fetching, storing) control from the arithmetic unit
control. Each control unit executes its own programs.
Memory operations programs fetch instructions,
fetch operands, and store results for the arithmetic
unit. Buffering allows a maximum of asynchronism
between the arithmetic operations and the memory
operations. To perform a given computation, each
control unit executes fewer and less complex instructions than a third-generation computer control unit.
The less complex instructions require less time to execute and, since fewer instructions per control unit
are required, the computer can operate much faster.
Cost -performance of logic
A logic circuit delay of approximately 0.2 nsec
has been achieved on an integrated circuit chip. H ighspeed logic circuit delay of 1.8 nsec has been achieved
in the third generation. Low-cost bipolar logic with
250 gates on a chip at 5 cents/gate has been predicted
for 1970. Per-gate costs are presently about 50 cents.
Cost-performance of logic will thus be about two
orders of magnitude better than in the third generation.
Arithmetic operation times
As a result of this cheaper and faster logic, it will
be reasonable to minimize operation times by exten-
Estimated arithmetic operation speeds are:
Operation
Pipelined Time!
Elapsed Time Operation
32-bit fixed-point add/subtract
32-bit fixed-point mUltiply
32-bit fixed-point divide
32-bit logic functions
32-64 bit floating-point add/subtract
32-64 bit floating-point multiply
32-64 bit floating-point divide
8 nsec
16 nsec
56 nsec
8 nsec
16 nsec
20 nsec
70 nsec
4nsec
8 nsec
4 nsec
8 nsec
10 nsec
With separate functional units, time can also be
saved by using functional unit outputs directly as
inputs without intervening storage. Pipelining can
also be used to increase the throughput. For pipelined
operation, the execution of a function is divided into
two or more stages, and a set of inputs can be in execution in each stage. The time between successive
inputs can be much less than the elapsed time for
the execution of a function.
Cost-performance of memory
Memory costs will be roughly halved by batchprocessed fabrication. Access times on the order of
100 nsec and cycle times on the order of 200 nsec will
be achieved. This represents nearly an order of magnitude improvement in cost-performance over thirdgeneration memories.
435
From the collection of the Computer History Museum (www.computerhistory.org)
436 Spring Joint Computer C"onference, 1968
I mplications of memory technology
Logic speed is increasing relatively faster than
memory speed. Cheaper logic makes it reasonable to
perform the arithmetic operations in fewer logic levels.
As a result, the disparity between arithmetic operation tim.es and memory access times will increase by
a factor of roughly two to three. This implies greater
instruction lookahead to efficiently utilize the arithmetic unit's capacity - and increased instruction lookahead is difficult to achieve. 4
However, a partial solution to this disparity exists
and is described in the sections that follow.
Associative bufler and block-organized main memory
A scratchpad memory buffers the processor and
main memory. Blocks of words are transferred between the scratchpad memory and main memory. The
scratchpad memory and the associative memory
together comprise the associative buffer. The operation proceeds as follows:
The virtual address of a requested word is associatively checked with the virtual addresses of the
blocks in the scratchpad. If the word is in a block in
the scratchpad memory, it is output to the processor.
If not, the block containing the word is obtained from
main memory and stored in the scratchpad memory,
and the word is output to the processor. Similarly,
when storing a word, the block must be in the scratchpad memory.
This is similar to paging in third-generation tImesharing systems and it involves the same problems
(e.g., which block to delete or store when room is
needed for a new block). The net result is a substantial
reduction in access time when the word is in the
scratchpad memory. 5,6
To provide a basis for comparison, assume a blockorganized main memory with each block consisting
of 16 consecutive 32-bit words. Eight interleaved
block-organized memories of 100-nsec access time
and 200-nsec cycle time provide a combined memory
bandwidth of over 2 X 1010 bits/second.
Access times from processor to memory are approximately 30 nsec for words in the scratchpad, and
150 nsec for words in main memory. Pipelining
through the associative memory and parallel scratchpads is used to achieve a high associative buffer bandwidth.
Assume a fetch or store every 10 nsec, where six
percent of these require accessing main memory. The
six percent is based on data5 modified to reflect the
differences in computer organization. This corresponds to an instruction rate of approximately 80
million per second. This also corresponds to six
blocks per microsecond from main memory or 15 per-
cent of bandwidth. With" bandwidth usage this low,
another processor could be added without severe
degradation in performance due to interference.
It also allows high input-output transfer rates with
modest interference.
It is desirable, with this design, to group operands
and sequence the addressing to minimize the number
of block transfers. This lowers the average access
time and lowers the main memory bandwidth usage.
Programming implications
Most programming will be in higher-level languages.
The computer cost will be a smaller and smaller portion of the total costs of solving a given problem.
The main goal of the designer is to maximize the
system throughput with programs written in higherlevel languages. The user sees a system that executes
programs written in higher-level languages.
The average job execution time does not decrease
significantly when the computer speed is increased
significantly. The explanation for this seems to be
that the number of programmers and the number of
jobs they submit each day do not change appreciably,
but the jobs they do submit are longer in terms of
number of instructions executed; e.g., they try more
cases or parameterize in finer increments. The number of instructions executed per job by the operating
system (including. compilers) will probably not increase by more than a factor of five, even if increased
optimization of compiled code and decreased efficiency due to use of table-driven compiler techniques
(for lower software cost) are"factored in.
Operating systems, compiling, and input conversions (e.g., decimal-binary) are essentially inputoutput functions and their volume is proportional to
the number of programmers and people preparing input and reading output. If the computer speed is increased by a factor of twenty-five, then the operating
system (including compilers) time will decrease by a
factor of more than five; and the computer will be executing jobs more of the time. Similarly, the proportion
of time devoted to byte manipulation, binary-todecimal conversions, etc., will decrease.
Byte, halfword, and shifting operations may not be
included in the hardware for the above reasons. Shifting would be accomplished by multiplying by a power
of two.
.
The equivalence of logic design and programming
Both the logic designer and the programmer implement algorithms. Each has to choose a representation of the data involved. Whereas the programmer
uses instructions to implement algorithms, the logic
designer uses combinations of logic elements (AND,
From the collection of the Computer History Museum (www.computerhistory.org)
A Fourth-Generation Computer Organization
OR, NOT, and storage). In addition to verifying that
the logic is correct, the designer must observe the
electrical limitations of the logic elements and their
connections (i.e., circuit delay, fan-in, fan-out, and
wire propagation delay) in order to execute the logic
function correctly within the time allotted.
Hardware instruction lookahead is, in effect, a
recoding of several instructions to obtain the instantaneous control actions. The hardware recoding and
the resulting asynchronism depend on conditions
within the computer (e.g., variations in instantaneous memory access time due to interrerence). Hardware recoding operates in real time at execution time
and is strictly limited in complexity by time and economic considerations.
The recoding can also be performed by software
at compile time if execution-time asynchronism is
sacrificed. All concurrency is planned at compile
time. If an instruction or operand were not available
when needed (due to memory interference), the control would halt until it became available. The recoded
program, containing control timing and sequencing
information, would require several times as many bits
as the unrecoded program. It would resemble a
microprogram with groups of microinstructions to be
executed in parallel. The- computer time required
for recoding at compile time is proportional to the
length of the program, not the number of instruction
executions required to complete the program. Also,
software recoding is not limited by the real-time constraint. As a result, the software recoding can economically be much more complex and more effective. In
the recoded form, operand fetches are initiated several
instruction cycles before they are used.
For example, the recoded form of the inner loop of
a matrix multiply would be several operand fetches,
followed by concurrent operand fetches and arithmetic
operations and finally by the last arithmetic operations. The same result could be obtained by an independent operand fetch loop which starts several instruction cycles before the arithmetic operation loop
is started. Two separate centers of control are implied. Fewer bits are required to represent the program
by specifying the two loops separately, but the number
of bits is still more than third-generation instructions
require.
The proposed computer organization has a separate
control unit for fetching and storing (the data channel
control unit) and an arithmetic control unit.
F or comparison, note that the CDC 6600 and the
LIMAC7 have separate instructions (but not separate
program&) for arithmetic operations and memory
operations.
A'l1I
-rJ
Data channels and their control
Figure 1 shows the data channels which are the
information-flow paths in the computer.
DAT.A CHANNEL CONTROL UN!T
ASSOCIATIVE BUFFER
MAIN MEMORY
Figure 1- Information flow diagni.m. Arrows indicate data paths in
the computer. Instructions are transmitted to the arithmetic units
over paths indicated by dashed arrows. Double arrows are the data
paths for each set of eight data channels. Two data paths suffice for
eight data channels, since two data items at most are transferred at
a time
Channel commands for multiple-word transfers
consist of a virtual memory address, the channel
number, a flag indicating load or store pushdown
stack, address increment, and count.
For loop control during arithmetic operations, an
end-of-record marker follows the last operand. An
attempt to read the end-of-record marker as data will
terminate the loop.
A channel can be cleared of previous contents by
flagging the first command of a new channel program.
All store commands of the old channel program for
which data was stored in the channel are properly
executed. A channel must be cleared or sufficient
time must elapse to store the data before subsequent
commands reference that data.
Another channel capability is the capability to load
a variable number of words (limited by the buffer size)
in a circular register. Its use is primarily for storing
instructions and constants within loops. It can be
entered by flagging the channel command which specifies the last word in the loop. The first word will then
follow the last word until the channel is cleared. This
usage of the channel will be later referred to as circular mode.
The input-output register of the channel has a datapresence bit to indicate data availability. The register
functions in four ways:
From the collection of the Computer History Museum (www.computerhistory.org)
438
Spring Joint Computer Conference, 1968
1. Nondestructive read:
sequence, but they will go to the correct word in the
The presence bit is left on and the register conscratchpad.
teilts remain t.he same.
2. Destructive read:
The presence bit· is turned off, the register is
fined with
the next
data word in the .channel~ and
.
.
the presence bit turned back on.
3. Nondestructive store:
The current contents of the register are pushed
down one and the presence bit is left on.
CHANNEL SI DE POINTER
INPUT-OUTPUT
4. Destructive store:
REGISTER POINTER
The current contents of the register are replaced.
Provision is made for saving channel status, using
the channel for another purpose, and later restoring
the channel to its original status.
Figure 2 - Data channel buffer
The master control unit, fixed-point arithmetic unit,
and. input-output unit each have a data channel
The mast~r control unit.
reserved for commands. Any data stored in these
data channels are transmitted to the data channel conEight double-word data channels supply the master
trol unit for immediate execution as a command.
control unit with instructions. The control is selected
The second source of commands is the input -output
to one of the eight data channels. Instructions present
registers of specified data channels. Commands presin the sel~cted data channel (indicated by the presence
ent in the specified data channels (as indicated by the
bit) are read destructively from the input-output regispresence bit) are read destructively from the inputter of the data channel and executed. There are five
output register and executed. The commands are
types of instructions:
executed by small, fast, special-purpose computers in
. 1. Arithmetic instructions are transmitted to the
the data channel control unit.
appropriate arithmetic unit.
As a~ example, the execution of a single command
2. Channel commands appearing in the instruction
(received from the master control unit) loads a data
stream are transmitted to the channel control unit
channel with commands, -the first command loads
through a data channel.
another data channel with commands for fetching
3. An all-zero instruction is a no-operation.
instructions, the second command loads still another
4. An instruction is provided to conditionally
data channel with commands for storing data, and
switch between the instruction unit data chanthe remaining commands fetch operands.
nels.
Channel command programs loading data can fetch
5. An instruction is .provided to conditionally skip
ahead of arithmetic execution a number of words
a specified number of instructions.
limited by the size of the data channel buffer. Channel
Arithmetic unit control
command programs end by running out of commands.
10\
rI1\
~\
Data channel buffering
Channel buffers would be implemented as circular
buffers using integrated scratchpad memories.
Channel action when used for loading operands is as
follows:
Initially, both input and output pointers are set ~o 0
(see Figure 2). The first input requested goes into
word 0, and a data-presence bit is set when the input
arrives .. Each successive input requested goes to the
next higher word (modulo 7). The fetch-ahead depth
of our example is limited to 8. Output can only occur
if the data-presence bit is set. If the instruction turns
off the presence bit, the next output comes from the
next higher word. While inputs are requested in the
command order, they may arrive at the buffer out of
An arithmetic instruction specifies the two inputs,
destructive or nondestructive read, and the operation
to be performed. The inputs are from data channels
and functional unit outputs (see Figure 3).
A store instruction transmits the data on an output bus or a data channel to a data channel. One
data channel leads to the channel control unit for
computed channel commands.
The first stage of instruction execution is testing
whether the specified inputs are present. If they are
not, the control hangs up until they arrive. During the
last stage, the inputs are latched in the functional
unit while the operation proceeds. The output-presence bit is set when the operation is completed. The
presence bit is turned off by a destructive read of the
functional unit output (by an instruction), or by test-
From the collection of the Computer History Museum (www.computerhistory.org)
A Fourth~Generation ComputerOrganization
ADDER,
SUBTRACTOR.
AND LOGIC UNIT
MULTIPLY-DIVIDE UNIT
INPUT-OUTPUT
BUS
INPUT BUS
EIGHT DATA CHANNEL BUFFERS
439
modes. Leaving the loop is accomplished by switching
to another data channel (by conditional branches)
or by trying to read data and getting an end-of-record
marker. If this data channel is itself in circular mode,
. we have a loop within a loop.
Conditional branches are handled by anticipatory
loading of a channel with the successful branch
instruction stream and switching to the channel if
the branch is successful.
Unconditional branches are handled at the channel
command level by loading the channel with the
branched instruction stream.
Subroutines are ha'ndled by loading a channel with
the subroutine (or at least with the beginning of it)
and then switching control to that channel. Returning
is accomplished by switching back to the original
channel. Some subroutining can be specified at the
channel command level by channel commands for an
unconditional branch to the subroutine and an unconditional branch back.
However, sooner or later all channels will be in use,
in which case a channel status is saved, the channel
is used, and later the channel status is restored. This
is analogous to saving a program location counter,
executing a subroutine, and then restoring the location counter.
Hardware design and packaging
Figure 3 - Arithmetic unit organization. Arrows indicate data paths
and direction.
ing the condition code generated by the operation.
For example, a compare is accomplished by testing
the condition code of a subtract operation.
To pipeline, two pairs of inputs must be latched in
before the result of the first pair is read. Trying to read
anend-of-record marker results in switching control
to the next data channel in the control unit. This is
used mainly for terminating loops.
To facilitate data exchange between the separate
fixed- and floating-point arithmetic units, two data
channels are common to both.
Channel commandsfor instruction and data
sequencing
The channel commands for a set of instructions
and their data are normally located together in
memory.
Sequential instructions are executed by transmitting a channel command to the data channel control
unit specifying the instructions and their data.
SmaUloops are the same as sequential instructions,
except that circular mode is specified in the channel
command. Arithmetic data may also use circular
The computer is naturally partitioned into nearly
autonomous units. Repetition of parts is found in
the mUltiple uses of data channels (which are mostly
memory), the special-purpose computers in the data
channel control unit, and the associative buffer (which
is mostly associative and addressable memory).
The hardware complexity of control needed to
achieve a high level of concurrency is minimized by
separate contro! of memory operations and of the
arithmetic unit. Complex instruction-Iookahead hardware is not required.
Input-output
Input-output would be controlled by a small computer with a scratchpad memory for input-output
comlLands and buffering.
The small computer is the interface between the
peripherals and the data channels. It also generates
commands to control input-output data transfers III
the data channels,
Time-sharing and multiprogramming
Paging from a rotating memory is the currently
popular solution to managing main memory in a timesharing environment. If, in a typical third-generation
system, a 50 x 106 instructions/sec processor is sub-
From the collection of the Computer History Museum (www.computerhistory.org)
440
Spring Joint Computer Conference, 1968
stituted and the page access time plus transfer time
not changed significantly, the processor will be waiting
on pages most of the time. Access time from rotating
memories cannot be improved significantly, but transfer rates with head-per-track systems can be very
high.
This suggests an approach based on the ability to
read complete programs from the disc into main memory quickly, process them rapidly, and return them to
the disc quickly. This minimizes the prorated memory usage by a program and allows high throughput without an excessively large memory.
The scheduler maximizes processor utilization within the constraint of system response times. Programs
normally reside on the disc. The scheduler selects the
next program to be transferred to memory. One factor
in the selection is the amount of time before program
transfer would begin (instantaneous access time).
The program is transferred at 109 bits/second (e.g.,
a 106 bit program is transferred in 1 msec). The program is put on a queue of programs to be processed.
Having the complete program in memory allows processing without paging to the next input or output by
the program, and then writing the processed program
into available disc space (generally the first available
disc space) without regard to its previous disc location.
The previous disc location is added to the available
disc space.
Scheduler considerations include the distribution
of available space around the disc, distribution of
ready-to-be processed programs around the disc, and
nearness of ready-to-be-processed programs to their
response-time limit. Programs whose processing
time exceeds a specified limit are not allowed to degrade the system response time of the other programs.
Programs larger than the memory require partitioning into files or pages.
Main memory (or optionally a slower, lower-cost
random-access memory) and the disc buffer the inputoutput activity of the programs.
The system could be organized to place FORTRAN users in one group, JOSS users in another
group, etc. Each user group would have its own compiler and supporting operating system. A portion of the
operating system would be common to all of the
groups. The computer would be a "dedicated"
FORTRAN system for a fraction of a second, then
a "dedicated" JOSS system for another fraction of
a second, etc. As a result, operating systems would
be simpler and a change could be made in on"e system
without affecting the other systems.
CONCLUSIONS AND OBSERVATIONS
The system described here achieves concurrency of
fetching, arithmetic operations, and storing without
the need for complex instruction lookahead hardware.
The complexity of control is in software.
The bandwidth of the processor is over 100 million
equivalent third-generation instructions per second.
This rate will be achieved for some problems. However, delays due to \vaiting for operands or instructions in the data channels will lower the processing
rate in many cases. Some types of probelms seem to
inherently have a great deal of delay - for example,
table-lookup using computed addresses.
Problems in which the flow of control and addressing are not data-dependent could run near the bandwidth of the system. (An additional requirement for
this is that the addressing be such that the block transfer rate between main memory and the associative
buffer is reasonable.) Optimizing the code consists
of (!) minimizing the delays caused by instructions
and operands not being available when needed,
and (2) pipelining and overlapping the arthmetic
operations.
To program for this processor in its machine language, a master control program (instructions) and
channel programs (commands) are prepared. There
are many chances to make an error and lose synchronism between instructions and commands. As a
result of the difficulty in machine-language programming with this organization, even more programming
would be in higher-level languages.
Initially, the compiler for this computer could be
relatively crude and unsophisticated. As time passes
the subtleties and characteristics of the design would
be assimilated and experience gained by the compiler
writers. As a result, midway through the fourth generation the computer should average 50-80 million
equivalent third-generation instructions per second.
A more powerful data channel command set than
is described here may be desirable for non-numeric
applications. 8
The only significant way to reduce software cost by
hardware is to build a faster computer (with a lower
cost per computation), which will then allow the programmer to reduce total costs by using algorithms
that are simpler to program but require more computer
processing.
REFERENCES
1 C S WALLACE
A suggestion for afast multiplier
IEEE Transactions on Electronic Computers Vol EC-13
February 1964
2 M LEHMAN N BURLA
Skip techniques for high-speed carry propagation in binary
arithmetic units
iRE Transactions on Electronic Computers Vol EC-IO
December 1961
From the collection of the Computer History Museum (www.computerhistory.org)
l\ Fourth-Generation Co~puter Organization
3 S F ANDERSON 1 G EARLE
R E GOLDSCHMIDT D M POWER
The IBM system/360 model 91: Floating-point execution unit
IBM Journal of Research and Development Vol 11 No 1
January 1967
4 D W ANDERSON F 1 SPARACIO
R M TOMASULO
The IBM system/360 model 91: Machine philosophy and instruction handling
IBM lournal of Research and Development Vol 11 No 1
lanuary 1967
5 D H GIBSON
Considerations in block-oriented system design
AA1
"T"T.l
Proceedings of the 1967 Spring loint Computer Conference
6 G G SCARROT
The efficient use of multilevel storage
Proceedings of the IFlPS Congress Spartan Books 1965
7 H R BEELITZ S Y LEVY R 1 LINHARDT
.H S MILLER
System architecture for large-scale integration
Proceedings of the 1967 Fall loint Computer Conference
8 B CHEYDLEUR
Summary session, proceedings of the ACM programming languages and pragmatics conference
Communications of the ACM Vol 9 No 3 Marcj 1966
From the collection of the Computer History Museum (www.computerhistory.org)
From the collection of the Computer History Museum (www.computerhistory.org)