I Based on the natural division of mathematical problems, functional parallelism becomes the architectural key for improving speed/cost ratios for array processors. Funclionally Parallel Architecture for Array Processors Edmund U. Cohler and James E. Storer CSPI Modern array processors can give more floatingpoint calculations per dollar than conventional computers by the efficient use of parallel equipment. At the same tirne, they conform quite well to programmability characteristics found in conventional computers. This article describes the general design philosophy and some architectural features of the CSPI MAP-200, a mrodern array processor that achieves these desirable characteristics by using asynchronous functional parallelism. cifically adapted. Moreover, the hardware efficiency gains were offset by the burden of programming in microcode, which was like doing logical design in programs. Although one present-day array processor, the API20B, has a Fortran compiler that targets to microcode, the compiled code runs more slowly than equivalent microcoded routines. The efficient use of the multiplier units was our measure for good architecture efficiency. The actual speed of a multiplier is a function of the money and skill applied to designing and realizing the multiplier, but it does not measure the machine's architecture. The ratio of Parallelism: Efficiency and types the multiply rate achieved to the multiply rate possible is a Since the consistent escalation of component speed measure of architecture that is independent of the amount characterizing earlier decades did not occur in the last of money spent on the realization. This observation led to decade, one must seek greater speed through the parallel the MAP architecture. use of conventional components. Several architectural techniques have been proposed for realizing parallelism Balance in parallelism with efficiency and ease of programming. Functional parallelism arose from the recognition that, For a functionally parallel array processor, we can while a conventional computer was well modularized by "parse" a program into the following hardware usually function for programming a wide range of problems, functions: not there was latent parallelism in the hardware that was available to the user because the controller did not permit * Floating-point arithmetic calculations,* such parallel operation. For example, a STORE to mem* Data-address calculations (integer) and loop countory could logically be accomplished at the same time as a ing, JUMP to a branch routine. The hardware was there, but * Instruction fetching, the controller did not allow this conjunction in a single in* Data-memory transfers, struction. * Program module parametrization, executive proA microcoded controller presented the programmer cessor, and with an instruction field for each piece of equipment in * I/O communications and addressing. the structure. The more fields he could fill on a given line 'We are treating processors for real and double-precision arrays only. of code, the more efficient the usage of the parallel hard- While it makes sense to have similar processors for logical, character, and ware. Gains in efficiency were made, but generally only integer arrays, array processor technology has only begun to attack these for those algorithms for which the machine was spe- sort of problems. Therefore, we will not treat these architectures. 28 0018-9162/81/09000028S00.75 © 1981 IEEE COMPUTER In the MAP, an executive processor, CSPU, handles interpreting host commands, binding programs to the specific buffers, and sequencing commands to the other processors. A number of processors, called I/O scrolls, may be provided to handle the communication peripheral devices and MAP memory. A similar processor, the host interface module, handles communications between the host and the MAP memory. The major part of the array processor job is handled by the floating-point arithmetic calculator and the addresser-the targets of this discussion. Problem analysis showed that we could further divide these functions into parallel hardware to provide an optimum balance among functions. Just as the capacity of each production machine in a factory should match that of the others to prevent bottlenecks, so calculation facilities must have commensurate capabilities to avoid computational bottlenecks. For example, a study of various algorithms revealed a ratio of adds to multiplies that varied from one to four, with important algorithms clustering around 1.5:1. Thus, separate add and multiply units were included, with the add unit twice as fast as the multiplier. The add unit 'Was also given the power to accomplish other less frequently used but very useful instructions-approximate reciprocal, MAX, FLOAT, etc. Besides the resulting improved arithmetic speed, we found that interregister moves of data were consuming a time comparable with parallel hardware which could be separately controlled. Furthermore, memory access could be divided into input and output transfers whose times were comparable to arithmetic time. Thus, the floating-point arithmetic unit was suitably divided into controllable units which, in a conventional machine, would have to act sequentially. If the division were perfect for all problems, a 4:1 speedup over a conventional architecture with the same unit speeds (the memory accesses are single-ported) could be achieved. Similar "balancing acts" led to the following subdivision of the major functional divisions mentioned above. Despite its prominence in design considerations, the FFT is not at the top of the range. However, since this ratio is only one measure of how well the balance was accomplished, it is inadequate if used alone. If the dominant interval is the time of a very weak sister in a family of parallel equipment, it would be quite simple to have it consume most of the time, regardless of the achievement of parallelism. To measure the extent to which there has been genuine improvement in the overall time, the sum of the time for the individual operations may be compared to the measured time. If the individual units are perfectly balanced in time, then the best achievable ratio for this architecture would be 4: 1, an indication that the task has been broken down into four equal parts now being accomplished completely in parallel.This is very difficult to achieve over a variety of algorithms. Nevertheless, it can be seen that the architecture permits over half of the facilities to be employed on a wide range of arbitrary mathematical algorithms. Decoupling the units Achieving the desired efficiency and programming ease depends on the choice of functional breakdown and on the decoupling of these functions so that each can proceed at its own pace without waiting for another unit to complete its task.The MAP architecture uses several Table 1. Balance of hardware in various problems* using the MAP-200 with 300-ns memory. 5 x 5 1K 1K 2-D HISTOGRAM TRI-BLOCK 100 x 100 COMPLEX DOORDINATE PERCENTILES DIAGONAL MATRIX FFT TRANSFORM 1K SAMPLES EQUAT.SQL. INVERSE MEMORY * Floating-point arithmetic unit: TIME 6.0 jis .9 Is 2.26 As 2.2 is 93 its Multiplier FLOATINGAdder/miscellaneous instructions POINT Internal register transfers ARITHMETIC 5.0 ,s 1.12 ,as .9 ys 1.9 As 46 As Input data queue ARITHMETIC Output data queue PROGRAM * Address calculation and loop counting unit: MEMORY 4.1 zs CYCLES 2.2 is 57 As .85 ts 1.26 [s Calculation and counting Memory transfer controller ADDRESS CALCULAInput address FIFO 4.7 As TIONS 1.24 ys 2.7 As .52 is 43 its FIFO address Output MEASURED While the division of functions was made on the basis EXECUTION 1.1 As 7.9 As 3.2 As 3.5 ,us 98 zs of an ensemble of problems, the fast Fourier transform, DOMINANT MEMW or FFT, was given the greatest weight. Our experience had PROCESS MEM MUL/APU MEM MEM APS shown this algorithm to be tough, important, and varied DOMINANT in requirements. How good this decision was for the INTERVAL/ MAP-200 design can be seen in Table 1. In a perfect bal- MEASURED 82% 76% 71% 77% 95% ance, the dominant interval would cover all others, equal- TIME ing the measured time. The architectural balance is RATIO SUM/ 2.4 2.5 2.9 1.9 measured by the ratio of dominant interval to measured MEASURED 2.6 time. In this sense, the balance of the MAP-200 is better *Times are per output component except for histogram, which is per input component. than 70 percent for a wide variety of algorithms. September 1981 29, techniques for decoupling the balanced units already described-data and access queues, automatic memory access sequencing, loop completion semaphores, and sequential-instruction overlap. The MAP-200 has an asynchronous integer addresser separate from its floating-point arithmetic unit. In this architecture, data queues and address FIFOs decouple the data stream accesses from the address stream creation. To illustrate the effects of this decoupling, let us start with a arithmetic computation, but only in the number, types, and sequences of mathematical entities to be treated. Thus, for example, the function Y = [exp(A * X) + log (B + Y)] * C would use the same addresser program as the function Y=A*X - B*U+C. In other words, the arithmetic operation can be programmed first without consideration of addressing; then, the addressing can be treated in the absence of arithmetic considerations. Basic mode of queue and FIFO operation The MAP-200 has an asynchronous integer addresser separate from its floating-point arithmetic unit. A block diagram of the floating-point calculator, or AP, portion of a MAP-200 system is presented in Figure 1. Three processing units are evident: the addresser, or APS; the memory transfer controller, or MTC; and the arithmetic processing unit, or APU. Connecting them are rather typical instruction sequence for a minicomputer. the read address FIFO, or RAF; the write address FIFO, In such a sequence, if we wished to add a pair of vec- or WAF; the input queue, or IQ; and the output queue, or tors-each from a separate buffer-and put them in a OQ. These units operate asynchronously from each other. third buffer, we might see the following set of instrucThus, the addresser's objective is to produce data adtions: dresses and try to keep the RAF and the WAF full. The MTC, seeing a read address in the RAF and space in the FETCH. (R1) <- X(I) LOOP: FETCH Rl, X(R2) FETCH and ADD. (R1) <- (Rl)+Y(l) ADD Rl, Y(R2) IQ, will execute a read from memory. Alternatively, a STORE. Z(I) <- (R1) STORE Rl, Z(R2) write address in the WAF and data in the OQ will cause DECR AND JUMP IF+ R2, LOOP Count, test, and loop the MTC to execute a memory write. The APU takes the data from the IQ, executes the described arithmetic The instructions that determine the address stream operation, and places the result in the OQ. In the structure are interleaved with or are actually part of the in- MAP-200 system, the APS and APU both have program structions that do the arithmetic operations or access the memories, whereas the MTC is a fixed set of logic. It should be emphasized that these three units operate operands. One has to consider them simultaneously, making the programming difficult wherever the address- at their own speeds, independently of each other. The ing structure is the least bit complicated.The synchroniza- FIFOs and queues provide elastic coupling. The basic tion becomes even more difficult when the computer is philosophy of operation is "do it if you can-if not, microprogrammed, since specific timing signals and wait." The basic status of the FIFOs-full, not full, not parallel operation of equipment are under programmer empty, and empty-is used to communicate the need to control, and the sequential operation of specific instruc- delay execution of an operation. To understand the operation of this system, consider tions becomes his responsibility. Now, compare the same arithmetic process programmed the following program segment: in the MAP-200 processor, as shown below: COMMON/BUS 2/ Y(1000), X(2000) N = 757 LOOP: MOV(IQA, Al) FETCH Al <- X DO l J=1,N FETCH A2 <- Y MOV(IQA, A2) 1 Y(J)=A*X(J) ADD(A1, A2) RESULT <- X + Y II TI,s kAn%/1RFCI IVIVV lLo.UL 1W VU) JUMtC(LOOP, Fl) DPCIII T CTOPF Q I vnJflL fL.UI JUMP TO LOOP UNLESS INPUT IS FINISHED The integer parts of this program in the DO loops along with the preceding COMMON statement, which define data areas in memory, are assigned to the APS; the floating-point data operations are assigned to the APU. A straightforward linear program for the MAP-200 to It should be noted that there are no addresses in this program. When an input is desired, the arithmetic processor execute this function is as follows: unit fetches it from the input queue. The arithmetic pro- APU: gram takes no cognizance of where the data is coming from Input AX to Register MO a IA: MOV(IOA,MO) or where it is going. It simply assumes that data is coming in Advance I the desired order. The only synchronization with the adA*X(J) A+2: MUL(MO,M4) dress processor is the JUMP at the end of the program. Move Product, Y(J), to 00 A + 3: MOV(P,OQ) That JUMP is a test to see whether or not the addresser has A+a4: JUMPC( A+ 1Fl) Test if Input finished, if not go said that the arithmetic processor has finished its task Halt APU A+ 5: CLEAR(RA) because all data has been properly handled. where: JUMPC = JUMP it flag "Finished Similarly, the program for the address processor takes Input" is not set. no cognizance of the actual mathematics involved in the 30 COM PUTER APS: B: LOAD(BRO, ABASE, TF) B+1: LOAD(BRO, XBASE) B+2: LOAD(BW4, YBASE) B+3: LOAD(BR1, 756) B+4: B+5: SET(RA) ADD(BRO, XSEP, TF) B+6: ADD(BWO, YSEP, TF) Put address, A Base, into RAF Load Register BRO with Base Address, X Load Register BW4 with Base Address, Y Load Register BR1 with N-1 Start Arithmetic BRO <-- BRO+XSEP, result also placed in RAF as read address. BWO <-- BWO + YSEP, result also placed in-WAF as write address. B+7: SUBL(BR1, 1), JUMPP B+8: positive go back to B+5 CLEAR(RI) Halt APS where: JUMPP = JUMP it BR1 is positive. (B+ 5) Decrement count, if In the APS program, the instructions with an argument TF supply addresses to the RAF and the WAF for memory reads and writes. An instruction such as ADD (BRO, XSEP, TF) will delay execution if the RAF is full- that is, automatically wait until the "RAF not full" state appears. This program would initiate execution by the executive processor "setting" RI-that is, turning the APS on. During execution, the APS will automatically lead the memory transfer controller, essentially keeping a few addresses ahead. The memory transfer controller will do its best to keep the input queue full of data and the output queue empty. However, the amount of lead will vary automatically as looping takes place. Thus, in the program above, the inner loop ofthe addresser has two useful commands, (B + 5) and (B + 6), and one command, (B + 7), related to iterating the loop. The execution time of this instruction will automatically be covered since the supply of addresses stored in the RAF and the WAF can be used during its execution time. It should be emphasized that the entire operation is a The instruction MOV(IQA, MO) removes the contents of data-driven type. Thus, each processing unit executes its the bottom register of the input queue to the multiply part of the task until a road block is encountered; then it register, MO. The execution of this instruction will automatically waits until the road block is removed. automatically wait until the input queue has data available. Similarly, the instruction A + 3: MOV(P, OQ) will wait until the product is available. The loop test at A + 4 Program optimization is,based on "input finished"-that is, no more input adWhile we do not intend to thoroughly discuss techdresses are being generated (APS halted), the RAF is empty, and the IQ is empty. This state flag, and others used niques of program optimization, a few comments are in for sequencing, are discussed below. order. In any architecture, software pipelining to improve Figure 1. Architecture of MAP-200 system. September 1981 31 execution times is machine-dependent. In other machines, it is quite complicated because the time sequencing of the addressing must be properly intermixed with the arithmetic operations. The separation of the addressing from the arithmetic means that this problem has been divided into two parts, each much simpler, which can be dealt with separately. For example, a little thought will indicate that the code given above for the APU is not optimum. The multiply command at A + 2 is executed is sequence with the loop testing command at A + 4. In other words, the multiplier is not being used for some portion of the loop. The following code sequence has pipelined this operation, and the multiply now covers the jump: APU: C: C+1: C+ 2: C+ 3: C+4: C+5: C + 6: C+7: C + 8: MOV(IQA, MO) MO=A MOV(IQA, M4) M4=X(1) MUL(MO, M4) A*X(1) JUMP(FI, C+10) : TEST IF DONE MOV(IQA, M5) M5=X(2K) MOV(P,OQ), MUL(MO, M5) 00 = Y(2K -1 ),A*X(2K) JUMPS(FI, C+10) TEST IF DONE MOV(IQA, M4) M4=X(2K+1) MOV(P, 00), MUL(MO, M4) OQ = Y(2K),A*X(2K + 1) JUMPC(FI, C+4) TEST IF DONE C + 9: C + 1 0: MOV(P,00) C+11: CLEAR(RA) OQ=Y(N) The APU program required the following sequence where R= > read and W= > write: R,A: R,X(1): W,Y(1): R,X(2): ... :R,X(N) :W,Y(N) The pipelined version above, however, requires this se- quencing: R,A: R,X(1): R,X(2): W,Y(1): R,X(3): W,Y(2): W,Y(N-1): W,Y(N) ...: R,X(N): The modified APS program is as follows: APS: D: LOAD(BRO, ABASE, TF) LOAD(BRO, XBASE) LOAD(BW4, YBASE) SET(RA) D+4: ADD(BRO, XSEP, TF) Input X(1) D + 5: LOAD(BR1, 755) D+6: ADD(BRO, XSEP, TF) ; Input X(K) D+7: ADD(BWO, YSEP, TF) Output Y(K-1) D + 8: SUBL(BR1, 1), JUMPP(D+6) D+9: ADD(BWO, YSEP, TF) ; Output Y(757) D+10: CLEAR(RI) D+1: D+2: D+3: In summary, the separation of the addresser from the arithmetic vastly simplifies program optimization since one can deal with two separable pieces, each with simpler constraints. Most software pipelining of the MAP-200 is similar to the example given above in that-addresses simply result in a delay (usually one loop's worth) of write addresses from read addresses. Table 2 summarizes the performance results of this elementary example. Note that there is one output per loop. 32 FIFO depth The four FIFOs which connect the processors together-that is, RAF, WAF, IQ, and OQ-provide an interesting problem in design optimization. The basic tradeoffs which must be dealt with in determining their size are: * Making them deep-e.g., 16 or more-provides very loose coupling and, hence, maximum ability to permit each of the processors to optimize its throughput. * Deep FIFOs have, however, the disadvantage of either a large number of components or a relatively long time for drop-through. Long drop-through will adversely impact start-up time and, consequently, the processing of very short vectors. The initial design for the MAP-200 was done by studying performance as a function of FIFO depth for six selected algorithms. This resulted in a choice of a depth of two for the RAF and the WAF, and three for the IQ and the OQ. It should be pointed out that the address register (see Figure 1) is also a functional part of the RAF and the WAF. Frequently, we refer to the RAF and the WAF as having a depth of 2 1/2. This choice of sizes was essentially made by observing that adding another element did not appreciably improve throughput for the six selected algorithms. Since the MAP-200 has been in existence, this study has been extended to include over 50 different algorithms. With one exception, the original choices were confirmed. The exception was the OQ, where several algorithms had the property of rapidly dumping into the OQ several complex numbers-that is, four or more data values at one point in the loop. For these algorithms, an increase in OQ depth to four would have somewhat improved performance. The FFT "butterfly" is an example where the adder places values in the OQ in four successive operations. The basic size arrived at as a result of the optimization, essentially two or three, points out the precise benefit obtainable from this decoupling-namely, local smoothing. One has essentially provided the capability of smoothing a local burst of activity over a small loop. Thus, looping in the addresser can be smoothed, or covered, by the simple process of storing a few addresses in the RAF and the WAF. Similarly, getting ahead by a few data values Table 2. Performance of scalar multiply. CODE APS APU MULTIPLIES MEM TRANSFERS DOMINANCE MEASURED DOMINANCE/ MEASURED LINEAR CODE NS/OUTPUT 330 ns 630 ns 450 ns 600 ns (APU) 630 ns 920 ns OPTIMIZED NS/OUTPUT 330 ns 430 ns 450 ns 600 ns (MEM) 600 ns 670 ns 69% 90% COMPUTER smoothes out the first part of most loops where several data points are often required in rapid succession. The attempt to smooth over larger segments is seldom effective, because of the nature of algorithms as well as the fact that local smoothing has already achieved a throughput close to the maximum possible. Sequencing and synchronization As previously mentioned, one assigns the integer operation to the addresser and the data operations to the APU. For most algorithms, this results in the APS determining the branching sequence. For example, in an FFT or a matrix factorization, the control of the process is completely determined by the integer arithmetic related to the DO loops. The APS must be able to communicate this sequencing structure to the APU. An example of such communication- between the APS and the APU is given in the linear program sample discussed earlier. There, the state "input finished" was used as a criterion for the APU to break out of the loop. This is a typical means of providing termination. In many instances, however, the program must continue on to other operations. The logical evolution of this is to simply let the APS "wait. " Thus, one has the flag WI available, which when set causes the addresser to wait. With it is the state variable FWI, finished with input, which implies that the APS is waiting, the RAF is empty, and the IQ is empty-that is, all available inputs in this group have been processed. With these, the sequencing can be established with the structure: APS: A: A+1: A+ 2: A + 3: APU: B: Loop Instructions SUBL(BR1, 1), JUMPP ( ) SET(WI) Continue Loop Instructions B+1: JUMPC(B, FWI) B + 2: B + 3: CLEAR(WI) Continue Test for Loop end ; Wait ; Test for Loop end Turn on APS Thus, at the end of the loop-that is, when the contents of BRl have gone negative-the sequence falls through and the APS executes SET(WI), i.e., it waits. The APU uses the FWI flag as the criterion for loop completion. Upon falling through, it releases the APS. When the APU executes CLEAR(WI), the two processors are synchronized-that is, the instructions A + 3 in the APS and B + 3 in the APU will be executed simultaneously. This fact can be utilized when an algorithm requires certain branching communication. When an arithmetic calculation in the APU governs a branching decision, it must be communicated to the APS. This is achieved by using one of the system flags in conjunction with WI. For example: APU: A: A+1: --- A+2: SET(AF3) A+3: CLEAR(WI) 34 ; APU decides the branch Release APS B: B+1: SET(WI) B + 2: JUMPS(AF3,B + N) Wait for decision Branch if AF3 set An interesting commentary on the architecture is to note that once one has become used to the decoupling of the APS and the APU, the need to synchronize, as in the examples above, becomes quite disturbing. For example, one can see in the process above how the APS must wait for the APU to catch up; then, after SET(WI), the APU will in most cases be waiting until the APS gets the first address out and the IQ has data. Clearly, both of these waits represent idle hardware-and resulting inefficiency. Memory transfer controller Because the APS and the APU are directly related to actual programming, and hence come under close scrutiny, one tends to forget there is really a third processor in the system-the memory transfer controller, or MTC, as shown in Figure 1. The MTC surveys the status of the RAF, the WAF, the IQ, and the OQ, and executes memory reads and writes as expeditiously as possible. The addresser has an instruction that permits it to wait on WAF empty, which ensures that the transfer into memory of a given word is taking place. Obviously, the MTC will have to make a decision regarding whether to do a read or a write when the RAF and the WAF both contain addresses, there is space in the IQ, and data is in the OQ. Since reads tend to be needed before writes, the MAP-200 was designed to give read priority in such circumstances. It would also have been possible to give the write priority or to make the decision alternate. However, an examination of the performance obtained by giving reads priority over a wide base of algorithms did not turn up any situation where decision alternation would have improved performance. Nevertheless, examples did turn up where giving output (writes) preference would have improved throughput. These cases are directly related to a type of sequencing problem occasionally encountered. Consider an algorithm which uses the main memory as a working buffer-that is, data is written into it and then read back from it during the execution of the algorithm. By examining Figure 1, we can see how the data to be written back into memory may still be in the OQ when the MTC decides to execute the read. In other words, a missequencing can occur with the read done before the write. Clearly, the small FIFO depths minimize the occurrence of this problem. However, in some algorithms, prevention must occur by programming. For this purpose, the addresser has an instruction that permits it to wait on WAF empty, which ensures that the transfer into memory of a given word is taking place. In these instances, it is clear that giving writes priority improves perCOMPUTER formance. This can be viewed as a particular algorithm forcing a certain type of synchronization on the three processors. When it occurs, one can only note with regret that certain of the processors are waiting and that efficiency has dropped back to that of a synchronous system for a short interval. Speed/cost ratios for array processors may be improved computers by using functional parallelism. This architectural approach is based on the natural division of a wide range of mathematical problems into their component parts. It results in a modularization of the programming effort which simplifies the programming and, in certain ways, eliminates redundant programming efforts. To achieve efficient functional parallelism, not only must the separation be relatively pervasive in a wide range of mathematical problems but the functional hardware units also must be balanced in the sense that each hardware division must achieve about the same throughput. To be able to truly operate the various pieces of equipment simultaneously, thereby achieving optimal efficiency in the system, it is necessary to decouple the operations in simple ways that do not involve programmer ingenuity. In particular, the use of queues and FIFOs is one technique which can be used to decouple arithmetic and ad- dressing functions effectively. The result of proper balance and decoupling is more uniformity in the efficient use of parallelism over a variety of algorithms than is found with pure pipeline or iterative architectures. While this subdivision into functions does not eliminate the need for software, it does seem to substantially simplify the process of achieving it. G over those of conventional SOFTWARE ENGINEERS 2540K We are conducting a nationwide search for two of our most valued clients. Both these clients are leaders in their industries. New and exciting work is going on at their engineering facilities in the areas of compiler development, computer architecture, networking, operating systems design, tools development, software test and diagnostics, robotics, artificial intelligence, and many others. Currently, some of the finest minds in the computer science field are engaged with complex and challenging projects at these companies. Sustained, well-managed growth has created an urgent, continuing need for other highly motivated software engineers. You should have a minimum of a Bachelor's Degree in C.S. or E.E. and several years of experience in minicomputer software systems development/applications and some familiarity with microprocessor assembly coding. The compensation and relocation package offered by these companies is comprehensive and very competitive. To inquire, phone (collect) Dan Meagher at (617) 329-2660, or forward a resume In complete confidence, to him at: I I !E 888 Washington Street, P.O. Box 228, Dedham, MA 02026 (617) 329-2660 Member NPC...Coast to Coast Placement. CLIENTS ARE EOE. 36 Edmund U. Cohler is chairman of the board of directors for CSPI of Billerica, Massachusetts, a company in 1968. In 1970, under his he cofounded guidance, CSPI introduced the first minicomputer with 100-ns instructions and bipolar It was used for array ber of memory. processing in a num- military and commercial operations. the MAP line of 32-bit floatingpoint array processors was announced, and more recently the first 64-bit floating-point array processor was added to the line. While with Sylvania 1956-1968, he supervised the design and development of Sylvania's military computers and other digital systems, including the first 5-tis core memory, the first highspeed transistorized digital computer for military applications, the first l-gs magnetic core memory,and the first digital processor for signal processing. Cohler holds 11 US patents on computer and peripheral circuits, and has authored a number of articles on digital technology. He is a member of the IEEE, the ACM, the Acoustic Society of America, the Society of Exploration Geophysicists, Sigma Xi, Eta Kappa Nu, Pi Mu Epsilon, and Tau Beta Pi. He received the BS, MS, and PhD in electrical engineering from Northwestern University in 1949, 1951, and 1953. In 1975, James E. Storer is a member of the board of directors and chief scientist for CSPI. He is also a member of the board of directors for Mutron and Megapulse. Storer is an Atomic Energy Commission fellow, a John Simon Guggenheim fellow, and a fellow of the IEEE. In 1970-1971, he served as a member of the IEEE board of directors. He has been a member of the Defense Science Board ASW Task Force and the Naval Warfare Panel of the President's Scientific Advisory Committee. Previously he served as project engineer and technical program manager on major military electronic systems and as director of Sylvania's Applied Research Laboratory. At Sylvania, Storer was a member of technical/management teams on a number of major programs, including UHF/VHF communications antennas, direction-finding antennas, security and intrusion systems, intelligence and reconnaissance systems, and communications and switching systems. Besides being a codeveloper of many products, he has participated in programs for the construction and testing of several high-speed digital processors for use in signal processing areas such as speech recognition, communications, waveform analysis, and acoustic signature analysis. Storer received the BA from Cornell University in 1947, and the MA and PhD from Harvard University in 1948 and 1951. COM PUTER
© Copyright 2026 Paperzz