Funclionally Parallel Architecture for Array Processors

I
Based on the natural division of mathematical problems, functional
parallelism becomes the architectural key for improving speed/cost
ratios for array processors.
Funclionally Parallel Architecture
for Array Processors
Edmund U. Cohler and James E. Storer
CSPI
Modern array processors can give more floatingpoint calculations per dollar than conventional computers by the efficient use of parallel equipment. At the
same tirne, they conform quite well to programmability
characteristics found in conventional computers. This article describes the general design philosophy and some architectural features of the CSPI MAP-200, a mrodern array processor that achieves these desirable characteristics
by using asynchronous functional parallelism.
cifically adapted. Moreover, the hardware efficiency
gains were offset by the burden of programming in
microcode, which was like doing logical design in programs. Although one present-day array processor, the
API20B, has a Fortran compiler that targets to microcode, the compiled code runs more slowly than equivalent
microcoded routines.
The efficient use of the multiplier units was our
measure for good architecture efficiency. The actual
speed of a multiplier is a function of the money and skill
applied to designing and realizing the multiplier, but it
does not measure the machine's architecture. The ratio of
Parallelism: Efficiency and types
the multiply rate achieved to the multiply rate possible is a
Since the consistent escalation of component speed measure of architecture that is independent of the amount
characterizing earlier decades did not occur in the last of money spent on the realization. This observation led to
decade, one must seek greater speed through the parallel the MAP architecture.
use of conventional components. Several architectural
techniques have been proposed for realizing parallelism
Balance in parallelism
with efficiency and ease of programming.
Functional parallelism arose from the recognition that,
For a functionally parallel array processor, we can
while a conventional computer was well modularized by
"parse" a program into the following hardware
usually
function for programming a wide range of problems,
functions:
not
there was latent parallelism in the hardware that was
available to the user because the controller did not permit
* Floating-point arithmetic calculations,*
such parallel operation. For example, a STORE to mem* Data-address calculations (integer) and loop countory could logically be accomplished at the same time as a
ing,
JUMP to a branch routine. The hardware was there, but
* Instruction fetching,
the controller did not allow this conjunction in a single in* Data-memory transfers,
struction.
* Program module parametrization, executive proA microcoded controller presented the programmer
cessor, and
with an instruction field for each piece of equipment in
* I/O communications and addressing.
the structure. The more fields he could fill on a given line
'We are treating processors for real and double-precision arrays only.
of code, the more efficient the usage of the parallel hard- While
it makes sense to have similar processors for logical, character, and
ware. Gains in efficiency were made, but generally only integer arrays, array processor technology has only begun to attack these
for those algorithms for which the machine was spe- sort of problems. Therefore, we will not treat these architectures.
28
0018-9162/81/09000028S00.75 © 1981 IEEE
COMPUTER
In the MAP, an executive processor, CSPU, handles interpreting host commands, binding programs to the
specific buffers, and sequencing commands to the other
processors. A number of processors, called I/O scrolls,
may be provided to handle the communication peripheral
devices and MAP memory. A similar processor, the host
interface module, handles communications between the
host and the MAP memory. The major part of the array
processor job is handled by the floating-point arithmetic
calculator and the addresser-the targets of this discussion.
Problem analysis showed that we could further divide
these functions into parallel hardware to provide an optimum balance among functions. Just as the capacity of
each production machine in a factory should match that
of the others to prevent bottlenecks, so calculation
facilities must have commensurate capabilities to avoid
computational bottlenecks. For example, a study of
various algorithms revealed a ratio of adds to multiplies
that varied from one to four, with important algorithms
clustering around 1.5:1. Thus, separate add and multiply
units were included, with the add unit twice as fast as the
multiplier. The add unit 'Was also given the power to accomplish other less frequently used but very useful instructions-approximate reciprocal, MAX, FLOAT, etc.
Besides the resulting improved arithmetic speed, we
found that interregister moves of data were consuming a
time comparable with parallel hardware which could be
separately controlled. Furthermore, memory access
could be divided into input and output transfers whose
times were comparable to arithmetic time. Thus, the
floating-point arithmetic unit was suitably divided into
controllable units which, in a conventional machine,
would have to act sequentially. If the division were perfect
for all problems, a 4:1 speedup over a conventional architecture with the same unit speeds (the memory accesses
are single-ported) could be achieved. Similar "balancing
acts" led to the following subdivision of the major functional divisions mentioned above.
Despite its prominence in design considerations, the
FFT is not at the top of the range. However, since this
ratio is only one measure of how well the balance was
accomplished, it is inadequate if used alone. If the dominant interval is the time of a very weak sister in a family of
parallel equipment, it would be quite simple to have it
consume most of the time, regardless of the achievement
of parallelism.
To measure the extent to which there has been genuine
improvement in the overall time, the sum of the time for
the individual operations may be compared to the measured time. If the individual units are perfectly balanced
in time, then the best achievable ratio for this architecture
would be 4: 1, an indication that the task has been broken
down into four equal parts now being accomplished
completely in parallel.This is very difficult to achieve
over a variety of algorithms. Nevertheless, it can be seen
that the architecture permits over half of the facilities to
be employed on a wide range of arbitrary mathematical
algorithms.
Decoupling the units
Achieving the desired efficiency and programming
ease depends on the choice of functional breakdown and
on the decoupling of these functions so that each can
proceed at its own pace without waiting for another unit
to complete its task.The MAP architecture uses several
Table 1.
Balance of hardware in various problems*
using the MAP-200 with 300-ns memory.
5 x 5
1K
1K 2-D
HISTOGRAM TRI-BLOCK 100 x 100
COMPLEX DOORDINATE PERCENTILES DIAGONAL MATRIX
FFT TRANSFORM 1K SAMPLES EQUAT.SQL. INVERSE
MEMORY
* Floating-point arithmetic unit:
TIME
6.0 jis
.9 Is
2.26 As
2.2 is
93 its
Multiplier
FLOATINGAdder/miscellaneous instructions
POINT
Internal register transfers
ARITHMETIC 5.0 ,s
1.12 ,as
.9 ys
1.9 As
46 As
Input data queue
ARITHMETIC
Output data queue
PROGRAM
* Address calculation and loop counting unit:
MEMORY
4.1 zs
CYCLES
2.2 is
57 As
.85 ts
1.26 [s
Calculation and counting
Memory transfer controller
ADDRESS
CALCULAInput address FIFO
4.7 As
TIONS
1.24 ys
2.7 As
.52 is
43 its
FIFO
address
Output
MEASURED
While the division of functions was made on the basis EXECUTION
1.1 As
7.9 As
3.2 As
3.5 ,us
98 zs
of an ensemble of problems, the fast Fourier transform,
DOMINANT
MEMW
or FFT, was given the greatest weight. Our experience had PROCESS
MEM
MUL/APU
MEM
MEM
APS
shown this algorithm to be tough, important, and varied
DOMINANT
in requirements. How good this decision was for the INTERVAL/
MAP-200 design can be seen in Table 1. In a perfect bal- MEASURED
82%
76%
71%
77%
95%
ance, the dominant interval would cover all others, equal- TIME
ing the measured time. The architectural balance is RATIO SUM/
2.4
2.5
2.9
1.9
measured by the ratio of dominant interval to measured MEASURED
2.6
time. In this sense, the balance of the MAP-200 is better
*Times are per output component except for histogram, which is per input component.
than 70 percent for a wide variety of algorithms.
September 1981
29,
techniques for decoupling the balanced units already described-data and access queues, automatic memory access sequencing, loop completion semaphores, and sequential-instruction overlap.
The MAP-200 has an asynchronous integer addresser
separate from its floating-point arithmetic unit. In this
architecture, data queues and address FIFOs decouple the
data stream accesses from the address stream creation. To
illustrate the effects of this decoupling, let us start with a
arithmetic computation, but only in the number, types,
and sequences of mathematical entities to be treated.
Thus, for example, the function Y = [exp(A * X) + log
(B + Y)] * C would use the same addresser program as the
function Y=A*X - B*U+C. In other words, the
arithmetic operation can be programmed first without
consideration of addressing; then, the addressing can be
treated in the absence of arithmetic considerations.
Basic mode of queue and FIFO operation
The MAP-200 has an asynchronous
integer addresser separate from
its floating-point arithmetic unit.
A block diagram of the floating-point calculator, or
AP, portion of a MAP-200 system is presented in Figure
1. Three processing units are evident: the addresser, or
APS; the memory transfer controller, or MTC; and the
arithmetic processing unit, or APU. Connecting them are
rather typical instruction sequence for a minicomputer. the read address FIFO, or RAF; the write address FIFO,
In such a sequence, if we wished to add a pair of vec- or WAF; the input queue, or IQ; and the output queue, or
tors-each from a separate buffer-and put them in a OQ.
These units operate asynchronously from each other.
third buffer, we might see the following set of instrucThus, the addresser's objective is to produce data adtions:
dresses and try to keep the RAF and the WAF full. The
MTC, seeing a read address in the RAF and space in the
FETCH. (R1) <- X(I)
LOOP: FETCH Rl, X(R2)
FETCH and ADD. (R1) <- (Rl)+Y(l)
ADD Rl, Y(R2)
IQ, will execute a read from memory. Alternatively, a
STORE. Z(I) <- (R1)
STORE Rl, Z(R2)
write address in the WAF and data in the OQ will cause
DECR AND JUMP IF+ R2, LOOP Count, test, and loop
the MTC to execute a memory write. The APU takes the
data from the IQ, executes the described arithmetic
The instructions that determine the address stream operation, and places the result in the OQ. In the
structure are interleaved with or are actually part of the in- MAP-200 system, the APS and APU both have program
structions that do the arithmetic operations or access the memories, whereas the MTC is a fixed set of logic.
It should be emphasized that these three units operate
operands. One has to consider them simultaneously,
making the programming difficult wherever the address- at their own speeds, independently of each other. The
ing structure is the least bit complicated.The synchroniza- FIFOs and queues provide elastic coupling. The basic
tion becomes even more difficult when the computer is philosophy of operation is "do it if you can-if not,
microprogrammed, since specific timing signals and wait." The basic status of the FIFOs-full, not full, not
parallel operation of equipment are under programmer empty, and empty-is used to communicate the need to
control, and the sequential operation of specific instruc- delay execution of an operation.
To understand the operation of this system, consider
tions becomes his responsibility.
Now, compare the same arithmetic process programmed the following program segment:
in the MAP-200 processor, as shown below:
COMMON/BUS 2/ Y(1000), X(2000)
N = 757
LOOP: MOV(IQA, Al)
FETCH Al <- X
DO l J=1,N
FETCH A2 <- Y
MOV(IQA, A2)
1 Y(J)=A*X(J)
ADD(A1, A2)
RESULT <- X + Y
II TI,s
kAn%/1RFCI
IVIVV lLo.UL
1W
VU)
JUMtC(LOOP, Fl)
DPCIII T
CTOPF
Q I vnJflL
fL.UI
JUMP TO LOOP UNLESS INPUT
IS FINISHED
The integer parts of this program in the DO loops along
with the preceding COMMON statement, which define
data areas in memory, are assigned to the APS; the
floating-point data operations are assigned to the APU.
A straightforward linear program for the MAP-200 to
It should be noted that there are no addresses in this program. When an input is desired, the arithmetic processor execute this function is as follows:
unit fetches it from the input queue. The arithmetic pro- APU:
gram takes no cognizance of where the data is coming from
Input AX to Register MO a IA:
MOV(IOA,MO)
or where it is going. It simply assumes that data is coming in
Advance I
the desired order. The only synchronization with the adA*X(J)
A+2: MUL(MO,M4)
dress processor is the JUMP at the end of the program.
Move Product, Y(J), to 00
A + 3: MOV(P,OQ)
That JUMP is a test to see whether or not the addresser has
A+a4: JUMPC( A+ 1Fl) Test if Input finished, if not go
said that the arithmetic processor has finished its task
Halt APU
A+ 5: CLEAR(RA)
because all data has been properly handled.
where: JUMPC = JUMP it flag "Finished
Similarly, the program for the address processor takes
Input" is not set.
no cognizance of the actual mathematics involved in the
30
COM PUTER
APS:
B:
LOAD(BRO, ABASE, TF)
B+1:
LOAD(BRO, XBASE)
B+2:
LOAD(BW4, YBASE)
B+3:
LOAD(BR1, 756)
B+4:
B+5:
SET(RA)
ADD(BRO, XSEP, TF)
B+6: ADD(BWO, YSEP, TF)
Put address, A Base,
into RAF
Load Register BRO with
Base Address, X
Load Register BW4 with
Base Address, Y
Load Register BR1 with
N-1
Start Arithmetic
BRO <-- BRO+XSEP,
result also placed in
RAF as read address.
BWO <-- BWO +
YSEP, result also
placed in-WAF as
write address.
B+7:
SUBL(BR1, 1), JUMPP
B+8:
positive go back to
B+5
CLEAR(RI)
Halt APS
where: JUMPP = JUMP it BR1 is positive.
(B+ 5)
Decrement count,
if
In the APS program, the instructions with an argument
TF supply addresses to the RAF and the WAF for memory reads and writes. An instruction such as ADD (BRO,
XSEP, TF) will delay execution if the RAF is full- that
is, automatically wait until the "RAF not full" state appears. This program would initiate execution by the executive processor "setting" RI-that is, turning the APS
on. During execution, the APS will automatically lead the
memory transfer controller, essentially keeping a few addresses ahead. The memory transfer controller will do its
best to keep the input queue full of data and the output
queue empty. However, the amount of lead will vary automatically as looping takes place. Thus, in the program
above, the inner loop ofthe addresser has two useful commands, (B + 5) and (B + 6), and one command, (B + 7),
related to iterating the loop. The execution time of this instruction will automatically be covered since the supply of
addresses stored in the RAF and the WAF can be used
during its execution time.
It should be emphasized that the entire operation is a
The instruction MOV(IQA, MO) removes the contents of data-driven type. Thus, each processing unit executes its
the bottom register of the input queue to the multiply part of the task until a road block is encountered; then it
register, MO. The execution of this instruction will automatically waits until the road block is removed.
automatically wait until the input queue has data available. Similarly, the instruction A + 3: MOV(P, OQ) will
wait until the product is available. The loop test at A + 4 Program optimization
is,based on "input finished"-that is, no more input adWhile we do not intend to thoroughly discuss techdresses are being generated (APS halted), the RAF is empty, and the IQ is empty. This state flag, and others used niques of program optimization, a few comments are in
for sequencing, are discussed below.
order. In any architecture, software pipelining to improve
Figure 1. Architecture of MAP-200 system.
September 1981
31
execution times is machine-dependent. In other machines, it is quite complicated because the time sequencing of the addressing must be properly intermixed with the
arithmetic operations. The separation of the addressing
from the arithmetic means that this problem has been
divided into two parts, each much simpler, which can be
dealt with separately.
For example, a little thought will indicate that the code
given above for the APU is not optimum. The multiply
command at A + 2 is executed is sequence with the loop
testing command at A + 4. In other words, the multiplier
is not being used for some portion of the loop. The following code sequence has pipelined this operation, and
the multiply now covers the jump:
APU:
C:
C+1:
C+ 2:
C+ 3:
C+4:
C+5:
C + 6:
C+7:
C + 8:
MOV(IQA, MO)
MO=A
MOV(IQA, M4)
M4=X(1)
MUL(MO, M4)
A*X(1)
JUMP(FI, C+10)
: TEST IF DONE
MOV(IQA, M5)
M5=X(2K)
MOV(P,OQ), MUL(MO, M5) 00 = Y(2K -1 ),A*X(2K)
JUMPS(FI, C+10)
TEST IF DONE
MOV(IQA, M4)
M4=X(2K+1)
MOV(P, 00), MUL(MO, M4) OQ = Y(2K),A*X(2K + 1)
JUMPC(FI, C+4)
TEST IF DONE
C + 9:
C + 1 0: MOV(P,00)
C+11: CLEAR(RA)
OQ=Y(N)
The APU program required the following sequence
where R= > read and W= > write:
R,A: R,X(1): W,Y(1): R,X(2):
...
:R,X(N) :W,Y(N)
The pipelined version above, however, requires this
se-
quencing:
R,A: R,X(1): R,X(2): W,Y(1): R,X(3): W,Y(2):
W,Y(N-1): W,Y(N)
...:
R,X(N):
The modified APS program is as follows:
APS:
D:
LOAD(BRO, ABASE, TF)
LOAD(BRO, XBASE)
LOAD(BW4, YBASE)
SET(RA)
D+4: ADD(BRO, XSEP, TF)
Input X(1)
D + 5: LOAD(BR1, 755)
D+6: ADD(BRO, XSEP, TF)
; Input X(K)
D+7: ADD(BWO, YSEP, TF)
Output Y(K-1)
D + 8: SUBL(BR1, 1), JUMPP(D+6)
D+9: ADD(BWO, YSEP, TF)
; Output Y(757)
D+10: CLEAR(RI)
D+1:
D+2:
D+3:
In summary, the separation of the addresser from the
arithmetic vastly simplifies program optimization since
one can deal with two separable pieces, each with simpler
constraints. Most software pipelining of the MAP-200 is
similar to the example given above in that-addresses simply result in a delay (usually one loop's worth) of write addresses from read addresses.
Table 2 summarizes the performance results of this
elementary example. Note that there is one output per
loop.
32
FIFO depth
The four FIFOs which connect the processors together-that is, RAF, WAF, IQ, and OQ-provide an interesting problem in design optimization. The basic tradeoffs which must be dealt with in determining their size are:
* Making them deep-e.g., 16 or more-provides very
loose coupling and, hence, maximum ability to permit
each of the processors to optimize its throughput.
* Deep FIFOs have, however, the disadvantage of
either a large number of components or a relatively
long time for drop-through. Long drop-through will
adversely impact start-up time and, consequently,
the processing of very short vectors.
The initial design for the MAP-200 was done by studying
performance as a function of FIFO depth for six selected
algorithms. This resulted in a choice of a depth of two for
the RAF and the WAF, and three for the IQ and the OQ. It
should be pointed out that the address register (see Figure 1)
is also a functional part of the RAF and the WAF. Frequently, we refer to the RAF and the WAF as having a
depth of 2 1/2. This choice of sizes was essentially made by
observing that adding another element did not appreciably
improve throughput for the six selected algorithms.
Since the MAP-200 has been in existence, this study has
been extended to include over 50 different algorithms.
With one exception, the original choices were confirmed.
The exception was the OQ, where several algorithms had
the property of rapidly dumping into the OQ several complex numbers-that is, four or more data values at one
point in the loop. For these algorithms, an increase in OQ
depth to four would have somewhat improved performance. The FFT "butterfly" is an example where the
adder places values in the OQ in four successive operations.
The basic size arrived at as a result of the optimization,
essentially two or three, points out the precise benefit obtainable from this decoupling-namely, local smoothing.
One has essentially provided the capability of smoothing
a local burst of activity over a small loop. Thus, looping in
the addresser can be smoothed, or covered, by the simple
process of storing a few addresses in the RAF and the
WAF. Similarly, getting ahead by a few data values
Table 2.
Performance of scalar multiply.
CODE
APS
APU
MULTIPLIES
MEM TRANSFERS
DOMINANCE
MEASURED
DOMINANCE/
MEASURED
LINEAR CODE
NS/OUTPUT
330 ns
630 ns
450 ns
600 ns
(APU) 630 ns
920 ns
OPTIMIZED
NS/OUTPUT
330 ns
430 ns
450 ns
600 ns
(MEM) 600 ns
670 ns
69%
90%
COMPUTER
smoothes out the first part of most loops where several data
points are often required in rapid succession. The attempt
to smooth over larger segments is seldom effective, because
of the nature of algorithms as well as the fact that local
smoothing has already achieved a throughput close to the
maximum possible.
Sequencing and synchronization
As previously mentioned, one assigns the integer
operation to the addresser and the data operations to the
APU. For most algorithms, this results in the APS determining the branching sequence. For example, in an FFT
or a matrix factorization, the control of the process is
completely determined by the integer arithmetic related to
the DO loops. The APS must be able to communicate this
sequencing structure to the APU.
An example of such communication- between the APS
and the APU is given in the linear program sample discussed earlier. There, the state "input finished" was used
as a criterion for the APU to break out of the loop. This is a
typical means of providing termination.
In many instances, however, the program must continue on to other operations. The logical evolution of this
is to simply let the APS "wait. " Thus, one has the flag WI
available, which when set causes the addresser to wait.
With it is the state variable FWI, finished with input,
which implies that the APS is waiting, the RAF is empty,
and the IQ is empty-that is, all available inputs in this
group have been processed.
With these, the sequencing can be established with the
structure:
APS:
A:
A+1:
A+ 2:
A + 3:
APU:
B:
Loop Instructions
SUBL(BR1, 1), JUMPP ( )
SET(WI)
Continue
Loop Instructions
B+1: JUMPC(B, FWI)
B + 2:
B + 3:
CLEAR(WI)
Continue
Test for Loop end
; Wait
; Test for Loop end
Turn on APS
Thus, at the end of the loop-that is, when the contents of
BRl have gone negative-the sequence falls through and
the APS executes SET(WI), i.e., it waits. The APU uses
the FWI flag as the criterion for loop completion. Upon
falling through, it releases the APS.
When the APU executes CLEAR(WI), the two processors are synchronized-that is, the instructions A + 3 in
the APS and B + 3 in the APU will be executed simultaneously. This fact can be utilized when an algorithm requires certain branching communication. When an
arithmetic calculation in the APU governs a branching
decision, it must be communicated to the APS. This is
achieved by using one of the system flags in conjunction
with WI. For example:
APU:
A:
A+1:
---
A+2: SET(AF3)
A+3: CLEAR(WI)
34
; APU decides the
branch
Release APS
B:
B+1: SET(WI)
B + 2:
JUMPS(AF3,B + N)
Wait for decision
Branch if AF3 set
An interesting commentary on the architecture is to
note that once one has become used to the decoupling of
the APS and the APU, the need to synchronize, as in the
examples above, becomes quite disturbing. For example,
one can see in the process above how the APS must wait
for the APU to catch up; then, after SET(WI), the APU
will in most cases be waiting until the APS gets the first address out and the IQ has data. Clearly, both of these waits
represent idle hardware-and resulting inefficiency.
Memory transfer controller
Because the APS and the APU are directly related to
actual programming, and hence come under close scrutiny, one tends to forget there is really a third processor in
the system-the memory transfer controller, or MTC, as
shown in Figure 1. The MTC surveys the status of the
RAF, the WAF, the IQ, and the OQ, and executes memory reads and writes as expeditiously as possible.
The addresser has an instruction that
permits it to wait on WAF empty,
which ensures that the transfer into memory
of a given word is taking place.
Obviously, the MTC will have to make a decision regarding whether to do a read or a write when the RAF and
the WAF both contain addresses, there is space in the IQ,
and data is in the OQ. Since reads tend to be needed
before writes, the MAP-200 was designed to give read
priority in such circumstances. It would also have been
possible to give the write priority or to make the decision
alternate. However, an examination of the performance
obtained by giving reads priority over a wide base of
algorithms did not turn up any situation where decision
alternation would have improved performance. Nevertheless, examples did turn up where giving output (writes)
preference would have improved throughput.
These cases are directly related to a type of sequencing
problem occasionally encountered. Consider an algorithm
which uses the main memory as a working buffer-that is,
data is written into it and then read back from it during the
execution of the algorithm. By examining Figure 1, we can
see how the data to be written back into memory may still
be in the OQ when the MTC decides to execute the read. In
other words, a missequencing can occur with the read done
before the write.
Clearly, the small FIFO depths minimize the occurrence of this problem. However, in some algorithms,
prevention must occur by programming. For this purpose, the addresser has an instruction that permits it to
wait on WAF empty, which ensures that the transfer into
memory of a given word is taking place. In these instances, it is clear that giving writes priority improves perCOMPUTER
formance. This can be viewed as a particular algorithm
forcing a certain type of synchronization on the three processors. When it occurs, one can only note with regret that
certain of the processors are waiting and that efficiency
has dropped back to that of a synchronous system for a
short interval.
Speed/cost ratios for array processors may be improved
computers by using functional
parallelism. This architectural approach is based on the
natural division of a wide range of mathematical problems into their component parts. It results in a modularization of the programming effort which simplifies
the programming and, in certain ways, eliminates redundant programming efforts.
To achieve efficient functional parallelism, not only
must the separation be relatively pervasive in a wide range
of mathematical problems but the functional hardware
units also must be balanced in the sense that each hardware division must achieve about the same throughput.
To be able to truly operate the various pieces of equipment simultaneously, thereby achieving optimal efficiency in the system, it is necessary to decouple the operations
in simple ways that do not involve programmer ingenuity.
In particular, the use of queues and FIFOs is one technique which can be used to decouple arithmetic and ad-
dressing functions effectively. The result of proper balance and decoupling is more uniformity in the efficient use
of parallelism over a variety of algorithms than is found
with pure pipeline or iterative architectures. While this subdivision into functions does not eliminate the need for software, it does seem to substantially simplify the process of
achieving it. G
over those of conventional
SOFTWARE ENGINEERS
2540K
We are conducting a nationwide search for
two of our most valued clients. Both these
clients are leaders in their industries.
New and exciting work is going on at their
engineering facilities in the areas of compiler development, computer architecture,
networking, operating systems design, tools
development, software test and diagnostics, robotics, artificial intelligence, and
many others.
Currently, some of the finest minds in the
computer science field are engaged with
complex and challenging projects at these
companies. Sustained, well-managed growth
has created an urgent, continuing need for
other highly motivated software engineers.
You should have a minimum of a Bachelor's
Degree in C.S. or E.E. and several years of experience in minicomputer software systems
development/applications and some familiarity with microprocessor assembly coding.
The compensation and relocation package offered by these companies is comprehensive
and very competitive.
To inquire, phone (collect) Dan Meagher at
(617) 329-2660, or forward a resume In complete confidence, to him at:
I
I
!E
888 Washington Street, P.O. Box 228, Dedham, MA 02026
(617) 329-2660 Member NPC...Coast to Coast Placement.
CLIENTS ARE EOE.
36
Edmund U. Cohler is chairman of the
board of directors for CSPI of Billerica,
Massachusetts,
a company
in 1968. In 1970, under his
he cofounded
guidance,
CSPI
introduced the first minicomputer with
100-ns instructions and bipolar
It was used for array
ber of
memory.
processing in a num-
military and commercial operations.
the MAP line of 32-bit floatingpoint array processors was announced, and
more recently the first 64-bit floating-point array processor was
added to the line.
While with Sylvania 1956-1968, he supervised the design and
development of Sylvania's military computers and other digital
systems, including the first 5-tis core memory, the first highspeed transistorized digital computer for military applications,
the first l-gs magnetic core memory,and the first digital processor for signal processing.
Cohler holds 11 US patents on computer and peripheral circuits, and has authored a number of articles on digital technology. He is a member of the IEEE, the ACM, the Acoustic
Society of America, the Society of Exploration Geophysicists,
Sigma Xi, Eta Kappa Nu, Pi Mu Epsilon, and Tau Beta Pi.
He received the BS, MS, and PhD in electrical engineering
from Northwestern University in 1949, 1951, and 1953.
In 1975,
James E. Storer is a member of the board
of directors and chief scientist for CSPI.
He is also a member of the board of directors for Mutron and Megapulse.
Storer is an Atomic Energy Commission
fellow, a John Simon Guggenheim fellow,
and a fellow of the IEEE. In 1970-1971, he
served as a member of the IEEE board of
directors. He has been
a
member of the
Defense Science Board ASW Task Force
and the Naval Warfare Panel of the President's Scientific Advisory Committee.
Previously he served as project engineer and technical program manager on major military electronic systems and as director of Sylvania's Applied Research Laboratory. At Sylvania,
Storer was a member of technical/management teams on a
number of major programs, including UHF/VHF communications antennas, direction-finding antennas, security and intrusion systems, intelligence and reconnaissance systems, and communications and switching systems. Besides being a codeveloper
of many products, he has participated in programs for the construction and testing of several high-speed digital processors for
use in signal processing areas such as speech recognition, communications, waveform analysis, and acoustic signature analysis.
Storer received the BA from Cornell University in 1947, and the
MA and PhD from Harvard University in 1948 and 1951.
COM PUTER

Download Report

Funclionally Parallel Architecture for Array Processors

Paperzz.com

Your Paperzz