691
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-30, NO. 9, SEPTEMBER 1981
Design of High-Speed
Digital
Divider Units
J. H. P. ZURAWSKI, MEMBER, IEEE, AND J. B. GOSLING
Abstract-The division operation has proved to be a much more
difficult function to generate efficiently than the other elementary
arithmetic operations. This is due primarily to the need to test the result
of one iteration before proceeding to the next. The technique described
in this contribution reduces the iteration time by the use of a redundant
quotient representation, which avoids the need to complete the arithmetic operation. A borrow-save subtractor (analogous to a carry-save
adder) can then be used for the arithmetic. Further improvement is
obtained by use of a lookahead decoding technique. Cost reductions
are obtained either by use of uncommitted logic arrays, or by a novel
borrow-save system using commercially available adder circuitry. A
comparison of a number of divider units with a wide range of cost and
speed is included.
cycle in order to determine what to do in the next cycle, and
hence the borrow must be fully propagated. It is this need to
complete one calculation and test its result which limits the
speed of division, and which prevents it being made as fast as
multiplication. Most of the efforts to produce fast dividers are
aimed at circumventing this problem.
The second technique used in multiplication is to use several
multiplier bits in each cycle. To generate several quotient bits
in each cycle it is necessary to be able to select the multiple of
D closest to the partial remainder. This again is a nontrivial
procedure.
Index Terms-Borrow-save subtraction, carry-save addition,
digital arithmetic, digital division, group subtractor, iterative division,
uncommitted logic arrays (gate arrays).
II. REDUNDANCY
I. INTRODUCTION
DIVISION is the least common of the four basic arithmetic operations, and is generally used less than one
tenth as frequently as either addition/subtraction or multiplication. Part of the reason for this is the relative slowness of
the operation, which causes users to design their programs to
use multiplication wherever possible. In a "good" machine
division may be a factor of two slower than multiplication, and
may be as much as a factor of ten slower in others.
Division may be defined as the process of computing Q, and
possibly R, in the equation
N= QD+R
where R has the same sign as N, I R I < I D j, and D is nonzero
[1]. In the pencil-and-paper approach with positive operands,
the result is obtained by first aligning the most significant digits
of N and D and then performing successive subtract, test, and
shift operations until sufficient digits of Q have been formed.
In the decimal case some multiple of D is subtracted, but in the
binary case only one nonzero multiple is possible, namely one
times D. If the result of the subtraction is positive, a Q-bit of
one is recorded, and D shifted down and subtracted again from
the new partial remainder (the value left after previous subtractions). If the result of the subtraction is negative, the Q-bit
is zero, and the old partial remainder is used again in the next
cycle.
With multiplication, which consists of repeated additions,
a number of procedures can be used to speed up the process.
First, the carry need not be propagated, since successive cycles
are dependent only on data available from the start of the operation. With division it is necessary to test the result of each
Manuscript received February 4, 1980; revised April 16, 1981.
The authors are with the Department of Computer Science, University of
Manchester, Manchester, England.
Both of these problems can be partially solved by the use of
redundancy in a nonrestoring type of division. In a nonrestoring
procedure, if the subtraction of D from the partial remainder
causes the result to become negative, then in the next cycle D
is added to the partial remainder (PR). Thus, if PR is positive,
D is subtracted and vice versa. This can be illustrated by the
Robertson diagram of Fig. 1 [2]. The unit of length on each
axis is DI the divisor. To start a division the most significant
bits of the operands are aligned, and hence INI < 21 DI. Either
D or zero is to be subtracted from N, and the result will be
multiplied by 2 (shift left). Lines of slope 2 are therefore drawn
from the origin to represent subtraction of zero, and from D
on the X axis to represent subtraction of D.
Suppose N = 11 D/8 [Fig. 1 (a)]. Since this is greater than
D, D is subtracted. This is represented by the vertical to the
"1" line and a horizontal to the 2PR axis. 2PR is shown as
3D/4, since it is in fact the left-shifted version due to the slope
of the obliques. This 3D/4 is transferred to the horizontal axis
by means of a quadrant of a circle. A vertical drawn from this
point meets the "zero" line, so the next quotient bit is zero. The
new PR is 3D/2, as is seen by following the arrows. The next
subtraction would give a PR of D. This is the pencil-and-paper
procedure.
The nonrestoring procedure is shown in Fig. 1 (b). A subtraction must now be performed if PR is greater than zero, and
an addition otherwise. A subtraction of D is performed in two
successive cycles, leaving a PR of -D/2. In the following cycle
an addition of D is performed to obtain a PR of +D, as before.
This is illustrated using two oblique lines, one for subtraction,
through D, and one for addition, through -D.
Redundancy is introduced into the method by including the
oblique line for zero as well, thus requiring two bits to represent
the quotient values of + 1, -1, and 0. The previous example
is repeated in Fig. 1 (c), but in this case the operations per-
formed are -D, 0, and -D, rather than -D, -D, and +D of
0018-9340/81/0900-0691$00.75 © 1981 IEEE
692
IEEE TRANSACTIONS ON COMPUTERS, VOL.
2PR
C-30,
NO.
9, SEPTEMBER 1981
2PR
(a)
(b)
kc)
Fig. 1. Robertson diagrams. (a) Restoring division. (b) Nonrestoring division. (c) Nonrestoring division with redundancy.
Fig. 1 (b). The same result is achieved. Mathematically, the
three operations have weight 4, 2, and 1 due to the shifting,
and
(-4 + 0 - 1)D = (-4 -2 + 1)D = -5D.
The ability to choose one of two possible operations makes it
possible to be slightly "wrong" in the choice of the multiple of
the divisor and/or to use a result that is not known to full accuracy. It will be appreciated that the choice of operations is
determined by the overlap of the areas subtended by the obliques on the horizontal axis.
The quotient can be recovered from the actions given by
recording 1 for a subtraction, -1 for an addition, and zero for
a shift. Allowing for one further subtraction in the above example the result is
1011 or 11-11.
In the latter case we have
1101
-0010
1011
which is as expected.
693
ZURAWSKI AND GOSLING: HIGH-SPEED DIGITAL DIVIDER UNITS
III. USE OF BORROW-SAVE SUBTRACTION
The process of borrow-save subtraction is analogous to
carry-save addition. A subtraction is performed, and the result
is left in redundant form of a difference d and a borrow b. The
true result is obtained by performing d - b in conventional
"propagate" mode. Thus
101010
-011011
d 110001
b 010001
Result 001 11 1.
(Note: b must be shifted left one place before the subtraction.)
With the result in a redundant form, it is not possible to
determine the sign of the result (and thus the "correct" operation in the next cycle) in all cases without performing the full
subtraction. However, as shown in the previous section, this
will not matter provided that some subsequent operation introduces a suitable correction factor.
With this technique determination of the action requires
inspection of the most significant few bits of the result. To do
this a full propagate is performed over these bits. The result
can thus be declared as follows.
1) Definitely Positive, e.g., 0.10: Any borrow waiting will
not cause a change of sign.
2) Definitely Negative, e.g., 1.01: A borrow waiting will
make it more negative.
3) Close to Zero, e.g., 0.00: A borrow waiting will cause this
result to become negative. No borrow leaves it positive.
Consider now the pencil-and-paper division correspondinig
to Fig. 1 (a). We are required to perform one of two actions.
Either shift the PR left one place, or subtract the divisor and
shift the result one place. Suppose that both operations are
performed in parallel, as shown in Fig. 2 [3]. The shift result
is placed in register A, and the subtract and shift result in
register B. When the sign of B is determined, it is used to select
A or B for the next cycle. Thus, if the subtract result is negative, Fig. 2(a) illustrates the procedure, and Fig. 2(b) when the
result is positive. Note that a negative result is never used.
The above procedure still requires the system to wait for
determination of the result of the subtraction, and the fan-out
of the selection signals. Suppose, however, that we permit both
operations to be performed on each of the registers A and B
in cycle P + 1 without waiting for this decoding to be performed. The result of the decode is now used to select which
of the two results is used at the end of the cycle P + 1. This is
illustrated in Fig. 3. The arithmetic of cycle P + 1 is now overlapped in time with the operation decode from the end of
cycle P.
Decode time is still the limiting factor however. A further
improvement can be achieved by decoding the PR at the start
of cycle P to select the data source for cycle P + 1. This means
that the multiplexer selection can also be overlapped in time
with the arithmetic. A few extra bits must be examined to allow
this to be achieved.
Cycle P
A
-1B
PR=-0.00 100 d
000 b
Cycle P+1
A
Cycle P
(a)
4
4
A
Cycle
B PR=0.00 010 d
100 b
(negative when b
is propagated. )
P+1
B
(b)
Fig. 2. Possible actions in cycle P + 1. (a) Shift result correct in cycle P.
(b) Subtract result correct in cycle P.
subtract and
shift result
shift
Cycle P
result
A
-
.
subtract and
shift.
Cycle P+ 1
rsubtract and
shif t
shift -.
A-
-shift
'select
resul ts.
\
A---------B-----Fig. 3. Result of cycle P used late in cycle P + 1.
The above discussion presumes a complete knowledge of the
sign of PR at the start of cycle P by the time that the selection
of results at the end of cycle P + 1 is performed. If the subtraction is a borrow-save subtraction, then the case of near zero
partial remainder requires further consideration. Since only
positive partial remainders are acceptable, both result routes
must be preserved until the true result has been determined.
Consider first the case where the true PR is positive and
"close to zero" [Fig. 4(a)]. (The lookahead will be ignored for
the moment to simplify the argument.) This is defined as a case
where a shift is required by the next cycle. The source is B (a
subtraction has just been performed) and hence the route required is B to A. Now take the case where the true PR is negative and close to zero [Fig. 4(b)]. This means that the previous
operation should have been a shift, and hence the source for
the next operation is the A register. However, as the result was
close to zero, the original PR was only just less than D. Hence,
the shifted PR in A will be definitely greater than D, and a
subtraction is required. The route is thus A to B. Since in fact
it is not known yet which of the two is required, both are implemented. However, at the end of cycle P + 1, the result of
the subtraction that was uncertain after cycle P is now in
register A, and hence this result is the one which must be examined for subsequent cycles.
The result in register A at the end of cycle P + 1 may now
be definitely positive or definitely negative. In this case the
694
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-30, NO. 9, SEPTEMBER 1981
shif t
result
subtract and
shift result
0.100 00 1
A
Dividend
Divisor
=
=
0.100001
0.100000
Cycle P
B
Cycle P+1
shift
01.00001
subtract and
shif t
I
A
B
Ja
s hift
Cycle P
Cycle P
00.00010
000I
A
8
0. 00001 d
000 b
+ - - - --' select cross-over
Cycle P +1
subtract and
shift result
esult
B
A
A- -
Cycle P4+ 2
B
01 .oo 00
1
01.01
B
01.10 ooo
ooo
select
paralLel
Cycle P+ 1
s/ift
A
(b)
subtract and
shift
00.00 100
000I
B
Fig. 4. Crossover route. (a) Required when true PR is positive. (b)
Required when true PR is negative.
system proceeds as before, again using the result in register B
to determine further action. However, if the result in register
A is still uncertain, then it is necessary to continue shifting this
result until a definite decision can be made. Thus, A is shifted
to itself (with borrow propagation in the top few bits).
Again there are two cases, as illustrated in Fig. 5. Fig. 5(a)
shows the case of a positive (true) PR. Not until the end-of
cycle P + 3 does the sign of A become certain (any borrows
waiting will leave the result positive). Thus, the A register is
examined in each cycle until the end of cycle P + 4.
The problem now arises as to what happens if the PR of
cycle P had been negative. This is illustrated in Fig. 5(b). At
the end of cycle P + 2 the PR in register A is definitely negative, and since there is no means of recovery the required result
must be in register B. This has been found by repeated subtractions following the crossover of cycle P + 1, as shown in
the figure. This follows from an argument similar to that used
for the redundant nonrestoring description earlier. The procedure being adopted can be expressed as
new PR = 2(old PR - qD)
where q = 0 for the shift route and 1 for the subtract route.
Considering Fig. 5(b), the correct result in cycle P + 2 can be
obtained by performing a restoring addition. We thus have an
addition in cycle P + 2, a shift in cycle P + 1, and a subtraction
in cycle P. Thus
P
P+ 1 P+ 2.
cycle P-1
-o +(-O + D/4)
PR(cycleP+2)=PRA -D
=PRA -3D/4
However, we do not wish to perform an addition at the end of
cycle P + 2. The same result is achieved by the alternative
route of Fig. 5(b), viz.
P+ 1 P+2.
cycleP-1 P
PR (cycle P + 2) = PRA -O
-D/2 -D/4
= PRA -3D/4
Although this description has ignored the lookahead, by
examining more bits in the decoder, the lookahead can still be
incorporated.
A.-
Cycle P + 3
o
select
paralLeL
00.01 000
000
seL ec t
A- source
(a)
0.10 0000
Cycle
A
Dividend = 0.100000
Divisor = 0.100001
P
01.00 000
Cycle P+1
00.00 010
01.00 001
010
100
Cycle P 2
11.11 100
C
00.11 111
010
000
Cycle P. 3
A
(b)
Fig. 5. Crossover and parallel routes. (a) True PR is positive. (b) True
PR is negative.
IV. QUOTIENT FORMATION
Since the "wrong" result is never used, the quotient bit in
cycle P will be zero whenever register A is used as the result
register (i.e., a shift is performed) and one when register B is
used for the result. It should be noted, however, that this information is not available until almost the end of cycle P + 1.
Difficulty also arises in cases like those illustrated in Fig. 5,
when the correct determination is unknown for several cycles.
In Fig. 5(a) we require the quotient bits
695
ZURAWSKI AND GOSLING: HIGH-SPEED DIGITAL DIVIDER UNITS
1 000
and in Fig. 5(b)
0 11.
The same effect can be obtained in the latter case by recording
1 0-1.
This requires the quotient to be in redundant form. Whenever
the crossover route is selected, the quotient is recorded as a one.
Subsequent quotient bits are recorded as zero until a final
decision is made on register A. If A is correct, the final quotient
bit of this string is zero. If B is correct, the final quotient bit
of this string is -1.
There is one case in which this is unsatisfactory, namely that
in which the true sign of A has not been determined by the last
cycle of the division. In this case a full propagate subtraction
must be performed on register A in order to determine the
correct result. It is then possible to round the result in accordance with the floating point standard [ 1]. However, if speed
is more important it may be sufficient to perform one or two
extra cycles, and accept the remaining error.
V. IMPLEMENTATION USING COMMERCIAL MSI
Fig. 6 shows the hardware necessary to implement the algorithm. Registers A and B are pairs of registers AD, AB and
BD, BB, the difference and borrow registers, respectively. Each
register pair may be shifted to a multiplexer in -front of the A
register pair, or subtracted in a borrow-save subtractor unit
(BSSU), the results of which are multiplexed in front of the
B register pair. An assimilation of the d and b results is performed on the most significant bits, so that the difference
registers are slightly longer than the borrow. A decoder examines the assimilated bits and the most significant divisor bits
to produce the multiplexer control signals. These are latched
in registers SA and SB, since they are required for the "next"
cycle rather than the current one. Register pair A is used for
the initial dividend, and so an extra multiplexer entry is provided. All registers are master-slave types to avoid signals
"chasing their tails" round the loop. Normal operation causes
routes- 1 and 3 or 2 and 4 to be selected. The "crossover" selects
routes 2 and 3, and the "parallel" case selectes routes 1 and
4.
To perform the correct selection of the next operation the
initial operands are bit-normalized [i.e., in the range (1/2, 1)]
and the assimilation takes place over 5 bits, of which two are
integral and three fractional. This is because a shift can give
a result greater than one (see Fig. 5), and an extra bit is then
needed for the sign. The proof of this is not difficult, and can
be found in [3]. The time taken by the main data cycle is
closely matched to the delays in the decoding network, so that
the design might be said to be close to the optimum for this
type. Using Fl OOK ECL circuitry, a division time of around
385 ns for 64 bits is estimated ("typical" figures). The cost
would be in the region of 360 ICS. The availability of a borrow-save subtract module (2 bits per package), or the use of
a group subtractor unit (see Section VII), would halve this
figure and improve, the division time.
Fig. 6. BSS divider implementation.
A two-bit at a time version of the algorithm was considered
and rejected. In fact, the algorithm can be thought of as a two
bit algorithm anyway, since the decoding determines the action
one cycle ahead. Consequently, to extend.the algorithm to
generate two quotient bits each cycle would require a four-bit
at a time decoding capability. The decode time of such a
scheme would be more than double the cycle time of the proposed design, giving no improvement in performance.
VI. SIGNED DIVISION
Although the algorithm is described for positive numbers,
it can be extended to signed numbers quite easily. The method
is based on complementing a negative operand, and the quotient (and remainder) as necessary.
The two's complement of the dividend can be formed in one
of two ways.
Either a negative dividend is placed in the A borrow register
initially, rather than in the difference register. Thus, the first
PR is 0 - N as required. Alternatively, the l's complement of
the dividend is loaded to the A difference register, and the
borrow register forced to all ones, i.e.,
-N=N- (-1)
which is correct.. Which method is selected is determined by
the available circuitry. Three input multiplexers are not usually
available, but four inputs are, favoring the second method.
However, forcing a register to all ones and all zeros is not
usually possible in a reasonably high scale of integration. The
first method, however, requires three input multiplexers on
both borrow and difference registers. Thus, the choice of
method is not simple.
For the divisor, the 1's complement of a negative number
is used in a subtraction. The extra one in the least significant
bit is placed in the space in the borrow register that is created
by shifting the borrow up by one bit with respect to the difference. This is acceptable since the divisor is being subtracted.
The quotient is formed in a redundant representation, one
section representing + Is and the other - Is. The complement
of the result may be obtained by reversing the roles of the two
registers, i.e., we perform Q(-1) - Q(l) rather than Q(1) -
Q(-1).
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-30, NO.
696
VII. THE GROUP SUBTRACTOR UNIT
N
9, SEPTEMBER 1981
1111 1001 1010
1001 1010
The main cost of this divider unit lies in the borrow-save
0110 1111 0101
Diff.
subtract units, since borrow-save subtract units are not
0
Borrow
1
0
1
and
00K).
1
OK
available in high-speed circuit ranges (ECL
in
a
24-pin
such
a
unit
bits
of
put
four
to
would
be
possible
It
1101 1110 1010
Shift
package with less than 0.5 W dissipation. Since this has not
1
0
been done, an alternative is to make use of an arithmetic unit
1001 1010 0101
D
in a borrow-save mode. In practice the -181 unit is very slow
Diff.
0100 0100 0101
in relation to the simpler borrow-save units, and would not be
0 10
Borrow
as
it
suitable. The F100180 however is relatively fast, but
contains no carry-out signal, it must be used as a 5-bit unit
Shift
0100 tO00 1010
rather than 6, the "sum 6" being used as "carry 5."
10
00
The problem with using a device such as this in a borrow1001 1010 0101
D
save mode is that there are only two inputs to most bits, rather
result negative; shift previous PR
than the three required by a borrow-save technique. However,
1001 0001 0100
the procedure requires the PR to be shifted between cycles, so
100 000
that after a few cycles the borrow appears opposite the "borrow
in" input of the arithmetic unit. Fig. 7 illustrates a group
subtractor division using a four bit arithmetic unit. The first
result negative; shift previous PR
subtraction results in a difference, and a borrow for each group
01 0010 0010 1000
of 4 bits. These are then shifted left. The difference is shifted
1 0000 0000
one place, but the borrow must be shifted one place with re00 1001 1010 0101
D
spect to the difference, i.e., 2 places. This set of difference bits
is not opposite the carry-in inputs of the arithmetic unit, so take
0 1001 1000 0011
no part in the arithmetic in the second cycle. They are trans0
1
,1
ferred to the new borrow register with a one bit shift. Notice
0000
1 0011
0110
Shift
that there are now two significant bits in each group. This
1
0010
000
procedure continues until a borrow bit is aligned with the least
Fig. 7. Group subtractor division.
significant bit of a group of 4 bits, when it can take part in the
arithmetic via the borrow input. This leaves an empty space
in the borrow register, which is filled by the borrow from the Fig. 8 shows the Robertson diagram for this algorithm. The
slope of the obliques is now 4, and there are five possible values
next less significant group of bits.
This system built with the F100180 has a subtract time of s. There is still a range of overlap between the various possimilar to that of a borrow-save subtractor, but contains five sible selections so that decoding of the procedure for the next
bits in a chip. This is much cheaper than building the unit with cycle requires only a few bits.
The overlap of the ranges of s suggests a further improvefull borrow-save subtractors, which would have to use small
ment. A special subtractor operating on only the most signifscale integrated circuits (2 per bit).
icant few bits of PR and D will produce a result much faster
than the full propagate subtraction. Provided sufficient bits
VIII. SOME OTHER DIVISION TECHNIQUES
are included (8) [3], the decoding logic can work with these
The simple nonrestoring divider using a full propagate only, the decode time then being completely overlapped with
subtraction and redundancy, but developing two bits per cycle, the subtraction. This is in fact a significant improvement in
turns out to be quite cost effective relative to other designs (see time for very little cost. A similar line of reasoning can lead to
below). Strictly speaking, one requires to subtract multiples higher radix algorithms. The comparison later includes figures
of the divisor up to three times to generate the two quotient for a radix 8 divider, but the value of 3D must be generated
bits. However, by placing a more severe restriction on the range which is a disadvantage.
Two forms of iterative technique are also available for
of values that the PR can take it is possible to achieve the result
using only D and 2D. In this case the new partial remainder performing division. These are based upon the relation
x
xnR1
NPR is given by
Q=-y y n Ri
NPR = 4 (OPR - sD)
where y n Ri tends to unity.
,ihere OPR is the old partial remainder, and s is 0, i 1, or ±2.
In the first of these methods the factors Ri are chosen such
If gD is the maximum value NPR can take, then it is also the that the denominator approaches unity quadratically
and is
maximum value of OPR. In this case s will be its maximum based on the Newton-Raphson iterative technique. It is used
value of 2. Hence
in the IBM360/91 [5], Cray-I [6], and MU5 [7]. The first
=
factor is found from a look-up table, and successive iterations
gD 4gD 8D
=
double the number of bits of accuracy. Usually 2 or 3 iterations
g 8/3.
D
0101
697
ZURAWSKI AND GOSLING: HIGH-SPEED DIGITAL DIVIDER UNITS
2PR
7 Ui.1l
Fig. 8. Roberston diagrams for 2-bits-at-a-time.
Fig. 9. Iterative division using shift-and-add.
are used, so that the division time is equivalent to 6-8 multiplications (2 multiplications and a subtraction per iteration,
plus an initializing cycle).
The second iterative technique chooses rather simpler factors which require only an add and a shift in each iteration. The
method has been discussed by a number of authors, notably
de Lugish [8] for a base-2 and Rodrigues [4],[12] for base-4.
The algorithms tend to a solution by only 1 or 2 bits, respectively, in each iteration. In the Rodrigues algorithm the ith
real-time environment where the operands are generated in
a serial manner. However, these algorithms are more complicated and consequently of little use in a conventional computer.
iteration gives
qi+l =
qi(l + si4-i)
and
Ui+1 = 4(Ui + Si + siUi4'i).
The si's are chosen so that I Ui I K 8/3. The variable qi tends
towards the quotient by two bits in each iteration.
A hardware implementation of this is shown in Fig. 9. The
U register is examined to determine the value of si, and at the
same time U is shifted 2i places to produce Ui4-i. A "mini
adder" of a few of the more significant bits adds si to Ui, and
Si is used to select Ui4', 2Ui 4, or zero to be fed to the main
adder. The result of the main addition is then returned to U.
A similar procedure operates for q. By placing a master-slave
flip-flop before the main adder it would be possible to use just
the "U"' section of this hardware in pipelined form.
The primary advantage of this technique is that similar algorithms are known for a number of elementary mathematical
functions [4], [8], and these can use the same hardware, thus
improving the cost effectiveness.
The above division method can be extended into an on-line
algorithm [ 13]. In such a scheme the division process is started
as soon as a number of most significant bits of the divisor and
the dividend are available. The division then continues as more
operand bits become known. These algorithms are ideal in a
IX. A COMPARISON OF DIVIDER UNITS
Table I lists a number of dividers with their cost and operation times. The first column gives an approximate cost per
bit, where n is the number of bits. As a means of comparison,
the next column gives the cost for 64 bit operands, where an
operand is the mantissa part of a floating point number. The
cost is in terms of number of IC's, since cost is proportional to
mounting area, rather than to the price of the IC's themselves.
All the dividers are designed assuming FlOOK logic, and the
times are the "typical" ones quoted by the manufacturers. The
cycle time is the time to perform one division iteration and
generate one quotient digit. The total division time is the
product of the number of cycles with the cycle time, to which
is added any delay due to assimilation of a redundant quotient
or the formation of divisor multiples. Cost times time represents a figure of merit giving a measure of cost effectiveness,
the "best" unit being that with the smallest figure of merit.
The first point to notice is that the simple and cheap nonrestoring radix 2 divider is one of the more cost-effective designs. Indeed the first three entries in the table represent three
of the best choices, the radix-8 unit also being the fastest shown
using commercial MSI.
The borrow-save subtract divider can be seen to be quite
fast, but also very expensive. As was indicated earlier, this is
largely due to the lack of a suitable MSI. The KDF-9 [9]
computer made use of a BSS unit without look-ahead, sharing
its equipment with a twin-beat multiplier [10] in order to reduce actual costs. The technique which has been explored here
698
IEEE TRANSACTIONS ON COMPUTERS, VOL.
C-30, NO. 9, SEPTEMBER 1981
TABLE I
COMPARISON OF DIVIDER UNITS
TYPE
COST
PER BIT
ICs
in
SimpLe non restoring, radix 2
COST
CYCLE
TIM E
ns
NUMBER
OF CYCLE S
60
15-1
65
982
58B9
FOR 64 BITS
ICs
TIME
ns
COST x TIME
FOR 64 BITS
IC.ns 10-3
Non restoring, rodix4, poraLLeL
decode
20
+13n/8
121
13 2
33
450
54-5
Non restoring, radix 6O paraLLeL
decode
30
+17n/B
166
13*7
23
344
57.1
BSS system with look ahead
15
*11n/2
360
5 9
64
3&5
138-6
Above using ULAs
20
+
9n/8
92
4.1
64
270
24.8
Above using 5-bit group
15
* Sn/2
177
5-5
64
360
63-7
15
+
5n/2
179
21-8
32
700
125-3
subtractor
Rodrigues iterotions1 rodix 4
Newton
Raphson,
2 iterations
is to put a 2-bit slice onto an uncommitted logic array (ULA)
(gate array, masterslice) [3], [10] of a type made by the
Plessey Company. The speed of this unit is about 0.5 ns per
basic gate (similar to ECL1OOK). As can be seen, this has a
dramatic effect on cost and speed.
The group subtractor unit was also introduced to cut down
on the package count of the BSS unit, but by using commercially available IC's. The improvement over the basic BSS unit
is not as dramatic as for the ULA version.
The Rodrigues iterative divider is not very cost effective,
mainly because of its speed. The primary advantage of this type
of unit is that similar algorithms are known for computing a
number of elementary mathematical functions (square root,
sine, exp, etc.), and hence the cost can be offset partly against
them. This would then make such a unit more worthwhile.
The speed of the Newton-Raphson divider is based on the
availability of two multiplication units using ULA's [10]. A
single unit using Fl OOK-speed technology might give similar
figures. The cost is not quoted, since these multipliers can be
used for multiplication functions also. However, the cost of
controlling the algorithm will be a few tens of IC's, and there
is then a strong argument for building a simple dedicated unit
instead if the operand size is small (64 bits or less). To obtain
results accurate to within 1 bit will also require the multiplications and additions to be performed to about 4 bits more
accuracy than is normally required. As a further comparison,
the Cray- I reciprocal approximation unit requires 29 beats of
12.5 ns or 362.5 ns. A further multiplication is then needed to
form a quotient.
In some computations the remainder is required rather than,
or as well as, the quotient. Neither of the iterative techniques
will provide this directly; it may be obtained by a further
990
multiplication Ind a subtraction. It is also to be noticed that
the iterative techniques will not give the correct answer in a
fixed point division, unless an appropriate rounding technique
is also applied. This further reduces their effectiveness.
The figures quoted indicate that with modern technology
it is possible to design a cost effective divider unit. The basic
problem of the design is such that a factor of two at least is
likely to exist between equivalent multiplication and division
algorithms. This is still much better than some factors that
have been found in the past. The possibility of further developments in ULA's suggests that even cheaper designs will soon
be possible. A design for a combined multiply-divide-square
root system has been worked out [3], and will be reported
later.
ACKNOWLEDGMENT
Thanks are due to Prof. D. B. G. Edwards of the Department of Computer Science, University of Manchester, for
provision of the facilities to carry out the work described. Dr.
Zurawski wishes to acknowledge the support of the SRC.
REFERENCES
[1] J. B. Gosling, Hardware for Computer Arithmetic. London: Macmillan, 1980.
[2] J. E. Robertson, "A new class of digital division methods," IRE Trans.
Electron. Comput., vol. EC-7, pp. 218-222, 1958.
[3] J. H. P. Zurawski, "High performance evaluation of division and other
elementary functions," Ph.D. dissertation, Univ. of Manchester,
Manchester, England, 1980.
IEEE TRANSACTIONS ON COMPUTERS, VOL. c-30, NO. 9, SEPTEMBER 1981
[4] M. R. D. Rodrigues, "Algorithms for the fast hardware evaluation of
mathematical functions," M.Sc. thesis, Univ. of Manchester, Manchester, England, 1978.
[5] S. F. Anderson, J. G. Earle, R. E. Goldschmidt, and D. M. Powers, "IBM
System 360 model 91; floating point execution unit," IBM J. Res. Develop., vol. 11, pp. 34-53, 1967.
[6] Cray Research Co., Cray-l Data.
[7] D. Morris and D. N. Ibbett, The MU5 Computer System. London:
Macmillan, 1979.
[8] B. G. de Lugish, "A class of algorithms for automatic evaluation of
certain elementary functions in a binary computer," Ph.D. dissertation,
Univ. of Illinois, Urbana-Champaign, 1970.
[9] J. R. Ellison, "The KDF-9 multiplier divider unit," English Elec. Co.
Ltd., Rep. K/GD.y.99, 1963.
[10] J. B. Gosling, D. J. Kinniment, and D. B. G. Edwards, "Uncommitted
logic array which provides cost effective multiplication even for long
words," CDT, vol. 2, pp. 113-120, 1979.
[11] J. Coonan, W. Kahan, J. Palmer, T. Pittman, and D. Stevenson, "A
proposed standard for floating point arithmetic," SIGNUM Newsletter,
Oct. 1979.
[12] M. R. D. Rodrigues, J. H. P. Zurawski, and J. B. Gosling, "The hardware
evaluation of mathematical functions," to be published.
[13] R. M. Owens and M. J. Irwin, "On-line algorithms for the design of
pipeline architectures," presented at 6th Symp. on Comput. Arch.,
1979.
699
J. H. P. Zurawski (M'81) received the M.Sc. and
Ph.D. degrees in computer science in 1977 and
1980, respectively, from the University of Man-
chester, Manchester, England.
Since 1979 he has been with the Department of
Computer Science, University of Manchester,
where he is involved in the commissioning of the
arithmetic unit of the new MU6G computer unit.
J. B. Gosling received the B.Sc. degree in electrical
engineering in 1960 and the Ph.D. degree in computer science in 1969, from the University of Man-
chester, Manchester, England.
Currently, he is a Senior Lecturer in the Department of Computer Science, University of
Manchester. Previously, he worked as a Development Engineer with AEI (Woolwich) Ltd. He has
had some experience in the design of computer
graphic systems. He is now involved in the design
of advanced mathematical computing machines,
in particular, the design of hardware assistance for large problems.
Regular Correspondence.
Synthesis of Generalized Parallel Counters
S. DORMIDO AND M. A. CANTO
Abstract-Synthesis of generalized equal column parallel counters from
smaller ones is presented. The notation used for a general counter is (n X N;
d),where n is the number of input coh_Ns is the nmnber of input bits in each
column, and d = s * n (s = 2, 3, * * ) is the number of bits in the output word.
This method makes possible the construction of (n2 X N2; d2) for a given (n,
X N,; di) (primitive network), were n2 = an, and a is anwinteger.
The total number of elements C and levels q are bounded by
a(N2 - S2)
N] -S
]
[d[logIN N21 + I'd,
- nli
q.
Index Terms-Digital counters, FAST multipliers, multiple-input adders,
parallel-counter networks, parallel counters.
Manuscript received April 2, 1979; revised November 26, 1979, December
15, 1980, and May 7, 1981.
The authors are with the Departamento de Informatica y AutomAtica,
Universidad Complutense, Ciudad Universitaria, Madrid, Spain.
I. INTRODUCTION
Parallel counters were introduced by Dadda [ I ] as a basic unit in
the realization of parallel multipliers. A parallel (p; d) counter is a
combinational network with d outputs and p 2d- I inputs, where
the binary number represented by the d outputs is the number of
"ones" present at the inputs. Since then, considerable attention has
been paid to methods for generating large parallel counters from
smaller ones.
Foster and Stockton [2] have described a method for constructing
large parallel counters with a network of full adders ((3; 2) counters).
Swartzlander [3] and Kobayashi and Ohara [4] expand this method
in order to form large parallel counters from smaller ones.
This type of counter can be considered as a particular case of a
more general class of arithmetic networks defined by Meo [5]. As
parallel counters have all inputs with weight 2°, such elements may
accept several successively weighted input columns and produce their
sum, taking the weighting into consideration. Stenzel et al. [6] and
Dadda [7] give schemes of parallel digital multipliers based on generalized parallel counters.
Counters of this type are represented by the following notation:
(Pn-1,Pn-2, * , po; d)
0018-9340/81/0900-0699$00.75
© 1981 IEEE
© Copyright 2026 Paperzz