Floating-Point Division and Square Root: Choosing the Right

Floating-Point Division and Square Root:
Choosing the Right Implementation
Peter Soderquist
School of Electrical Engineering
Cornell University
Ithaca, New York 14853
E-mail: [email protected]
Miriam Leeser
Dept. of Electrical and Computer Engineering
Northeastern University
Boston, Massachusetts 02115
E-mail: [email protected]
To appear in IEEE Micro
Copyright 1996 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional
purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other
works must be obtained from the IEEE.
Floating-Point Division and Square Root:
Choosing the Right Implementation
Peter Soderquist
Cornell University
Miriam Leeser
Northeastern University
Importance of division and square root
to increase the frequency of these operations [11]; poor implementations disproportionately penalize code which uses
them at all [10]. Furthermore, as the latency gap grows
between addition and multiplication on the one hand and
divide/square root on the other, the latter increasingly become performance bottlenecks [14]. Programmers have
attempted to get around this problem by rewriting algorithms to avoid divide/square root operations, but the resulting code generally suffers from poor numerical properties, such as instability or overflow [9, 14, 7]. In short,
division and square root are natural components of many
algorithms, which are best served by implementations with
good performance.
Quantifying what constitutes “good performance” is
challenging. One rule of thumb, for example, states that
the latency of division should be three times that of multiplication; this figure is based on division frequencies in a
selection of typical scientific applications [1]. Even if one
accepts this doctrine at face value, implementing division
- and square root - involves much more than relative latencies. Area, throughput, complexity, and the interaction
with other operations must be considered as well. This article explores the various tradeoffs involved and illuminates
the consequences of different design choices.
Floating-point support has become a mandatory feature
of new microprocessors. Over the past few years, the leading architectures have seen several generations of floatingpoint units (FPU’s). While addition and multiplication implementations have become increasingly efficient, division
and square root support has remained uneven. There is a
considerable variation in the types of algorithms employed,
as well as the quality and performance of the implementations. This situation originates in skepticism about the importance of division and square root, and an insufficient understanding of the design alternatives. The purpose of this
paper is to clarify and evaluate the implementation tradeoffs at the FPU level, thus enabling designers to make informed decisions.
Division and square root have long been considered minor, bothersome members of the floating-point family. Microprocessor designers frequently perceive them as infrequent, low-priority operations, barely worth the trouble of
implementing; design effort and chip resources are allocated accordingly. The survey of microprocessor FPU performance in Table 1 shows some of the uneven results of
this philosophy. While multiplication requires from 2 to 5
machine cycles, division latencies range from 9 to 60. The
variation is even greater for square root, which is not supported in hardware in several cases. This data hints at but
mostly conceals the significant variation in algorithms and
topologies among the different implementations.
The error in the Intel Pentium floating-point unit, and
the accompanying publicity and $475 million write-off illustrate some of the hazards of an incorrect division implementation [8]. But correctness is not enough; low performance causes enough problems of its own. Even though
divide and square root are relatively infrequent operations
in most applications, they are indispensable, particularly
in many scientific programs. Compiler optimizations tend
Add-multiply configurations
Add-multiply circuits are the foundation of any floatingpoint unit. Addition (which subsumes subtraction) and
multiplication are not only the most frequently occurring
arithmetic operations, but together they are sufficient to
support all other operations required by the IEEE 754
floating-point standard [6]. All other functions, including
division and square root, may be regarded as additions to
or enhancements of the add-multiply hardware. For these
reasons, the implementation of addition and multiplication
largely determines the overall performance of a floating1
Table 1: Arithmetic performance of recent microprocessor FPU’s for double precision operands, in machine
cycles (* = inferred from available information; y = not supported)
Cycle
Latency[cycles]/Throughput[cycles]
pa
Design
Time [ns]
ab
ab
ab
DEC 21164 Alpha AXP
3.33 ns
4/1
4/1
22-60/22-60*
y
Hal Sparc64
6.49 ns
4/1
4/1
8-9/7-8
y
HP PA7200
7.14 ns
2/1
2/1
15/15
15/15
HP PA8000
5 ns
3/1
3/1
31/31
31/31
IBM RS/6000 POWER2 13.99 ns
2/1
2/1
16-19/15-18*
25/24*
Intel Pentium
6.02 ns
3/1
3/2
39/39
70/70
Intel Pentium Pro
7.52 ns
3/1
5/2
30*/30*
53*/53*
MIPS R8000
13.33 ns
4/1
4/1
20/17
23/20
MIPS R10000
3.64 ns
2/1
2/1
18/18
32/32
PowerPC 604
10 ns
3/1
3/1
31/31
y
PowerPC 620
7.5 ns
3/1
3/1
18/18
22/22
Sun SuperSPARC
16.67 ns
3/1
3/1
9/7
12/10
Sun UltraSPARC
4 ns
3/1
3/1
22/22
22/22
point unit.
The add-multiply configurations in the latest generation
of microprocessors fall into two basic categories. The first
variety, which we have termed the independent configuration, is illustrated in Figure 1. It features add and multiply
units which operate independently, reading operands from
the register file and returning correctly rounded results. Although the illustration shows sharing of inputs and outputs
between functional units, some implementations give each
unit individual access to the register file. The table shows
the typically high performance figures for this type of configuration. Frequently, the adder is actually a floating-point
ALU, performing comparisons, complementation, and format conversions as well as addition and subtraction. Processors which feature independent add-multiply configurations include the DEC 21164, HP PA7200, Intel Pentium Pro, and Sun UltraSPARC.
The second type of add-multiply structure, known as the
multiply-accumulate configuration, is shown in Figure 2,
along with typical performance figures. It combines both
functions into a single atomic operation on three operands,
with the multiplication of two operands followed by addition of the product and the third operand. Typically, the
product is maintained at a very high precision and added
without intermediate rounding. The multiply-accumulate
configuration is motivated by the fact that a large proportion of scientific computations, such as series and matrix
operations, involve many instances of multiplication followed immediately by addition. The IBM RS/6000 series,
the MIPS R8000, the HP PA8000, and the Hal Sparc64
utilize add-multiply structures of this type.
register
file
add
multiply
Operation
addition
multiplication
Latency
2
2
Throughput
1
1
Figure 1: Typical independent add-multiply configuration
register
file
multiplyaccumulate
Operation
multiply-accumulate
Design issues
Given the add-multiply infrastructure, there are many
ways to incorporate division and square root functionality
Latency
2
Throughput
1
Figure 2: Typical multiply-accumulate configuration
2
into an FPU. The following is a brief introduction to the
major design decisions, which will be explored in greater
detail in subsequent sections. The primary issues may be
split into the following categories:
techniques. They also feature in signal processing algorithms, and the rotation and projection algorithms common
in graphics and solid modeling use similar combinations of
operations.
The Givens rotation algorithm is not “typical” in its use
of divide/square root operations. In fact, there are probably few other real applications whose performance is so
dependent on the efficiency of division and square root.
Nevertheless, what it lacks in representativeness, it makes
up for in importance. Finally, as a kind of divide/square
root “torture test,” Givens rotations clearly illustrates the
effects of different implementation choices.
Algorithm. The choice of divide/square root algorithm is by far the most important design decision. It
involves a series of smaller decisions, and once made,
largely determines the subsequent range of choices
and the area and performance of the implementation
in general.
Issue rate. Although some chips execute floatingpoint instructions strictly in series, many of the most
recent microprocessors can issue up to two floatingpoint instructions per cycle. Therefore, the next issue
worthy of attention is the impact of floating-point issue rate on the efficiency of divide/square root implementations.
Divide/square root algorithms
There are many possible algorithms for computing division and square root, but only a subset are practical for
microprocessor implementation at the present time. These
fall into the two categories of multiplicative and subtractive algorithms [13]. Multiplicative methods use hardware
integrated with the floating-point multiplier and have low
to moderate latencies, while subtractive methods generally
employ separate circuitry and have low to high latencies.
We will consider only the case where division and square
root are implemented in hardware using the same algorithm. The arguments for this are covered in more detail
in [14]. In summary, division and square root tend to have
very similar implementations and allow for extensive hardware sharing. Furthermore, this is the most common case
in actual practice.
Connectivity. Closely related to issue rate is the manner in which divide/square root hardware is interfaced
with the rest of the FPU, either with independent access to the register file, or coupled to or integrated
with one of the other functional units.
Replication. Finally, it is becoming not uncommon
for microprocessors to duplicate floating-point functional units, including the divide/square root hardware. The performance and area effects of such replication deserve closer consideration.
Multiplicative techniques. Multiplicative algorithms
compute an estimate of the quotient or square root using
iterative refinement of an initial guess. They reduce divide and square root operations into a series of multiplications, subtractions, and bit shifts. In addition, multiplicative algorithms typically converge at a quadratic rate,
which means the number of accurate digits in the estimate
doubles at each iteration. The two techniques used in recent microprocessors are the Newton-Raphson Method and
Goldschmidt’s algorithm. With many features in common,
they differ primarily in the ordering of operations, which
affects their mapping onto pipelined hardware.
These issues are not independent but interact with each
other in complex and non-obvious ways. We will explain
the connections between them while trying to isolate and
quantify their effects of the tradeoffs as much as possible.
Simulation
To provide a uniform, quantitative foundation for the
analysis of the different design tradeoffs, we have implemented an FPU-level simulator which executes a benchmark based on Givens rotations (see the sidebar [included
at the end of this paper]). Given an FPU configuration,
including its structure and performance on individual operations, the simulator calculates the number of cycles required to transform an arbitrary matrix into triangular for m
using Givens rotations. The simulations use a standard set
of test matrices ranging from 10-by-10 elements to 200by-100 elements in size [14].
The Givens rotation algorithm has been chosen because
it is rich in divide and square root operations, and because
it is a real program with important numerical applications.
The primary task of many scientific programs is the solution of partial differential equations. Givens rotations are
a central component of several frequently used solution
The Newton-Raphson method. An algorithm with a long
history, the Newton-Raphson method [12, 5] has been implemented in many machines, including the IBM RS/6000
series. To find a=b, set the seed value x0 to an estimate of
1=b and iterate over
x +1 = x (2 ? b x )
i
i
i
until xi+1 is within the desired precision of 1=b. Then a xi+1 a=b. For square root computation, set the seed to
3
a b a .. b
a
b
operand register a
operand register b
a
mux
lookup
table
mux
mux
multiplier
array
temporary register 1
pipeline register
mux
rounder/normalizer
temporary register 2
subtracter/shifter
mux
result register
Figure 3: A floating-point multiplier enhanced for multiplicative divide/square root computation
approximate 1=
pa and refine it with the iteration
x +1 = 21 x (3 ? a x2):
i
i
Implementations Multiplicative divide/square root implementations generally take the form of modifications
to the floating-point multiplier. This is because divide
and square root functionality alone do not justify the area
required for a second multiplier. The designers of the
IBM RS/6000 floating-point unit have chosen the NewtonRaphson method as being best-suited to the multiplyaccumulate structure. In independent add-multiply configurations, Goldschmidt’s algorithm is a natural choice for
pipelined multipliers. Since the numerator and denominator operations are independent, clever scheduling can allow
the multiplier to run at maximum efficiency. The NewtonRaphson method suffers from dependencies between subsequent operations, which hobbles pipelining.
The block diagram in Figure 3 shows one possible implementation, namely an independent floating-point multiplier modified for Goldschmidt’s algorithm. The new components and routing needed for divide/square root support
are shown with dashed lines. Although the details will
vary from case to case, most multiplicative implementations feature one or more of the following hardware enhancements:
i
The product a xi+1 yields an approximation to
pa.
Goldschmidt’s algorithm. Recent implementations of
Goldschmidt’s algorithm [5] have included the Sun SuperSPARC and arithmetic coprocessors from Texas Instruments. To compute a=b = x0=y0 , iteratively calculate
x +1 = x r
i
i
and
i
y +1 = y r
i
i
i
where ri = 2 ? yi ; then yi ! 1 and xi ! a=b. Both x0
and y0 should be prescaled by a seed value approximating
p
1=b to insure rapid convergence. To compute a, set x0 =
p
y0 = a, prescale by an estimate of 1= a and iterate over
x +1 = x r2
i
where ri
= (3
i
and
i
y +1 = y r
i
i
i
? y )=2, such that
i
x +1=y2+1 = x =y2 = 1=a:
i
Then xi
i
i
i
! 1, and hence y ! pa.
i
4
Extra routing and storage. These are required to make
divide/square root into atomic operations independent
of the register file.
register
file
multiply
add
register
file
divide/
square root
divide/
square root
multiplyaccumulate
(a)
(b)
Figure 4: FPU topologies with subtractive divide/square root functional units and (a) independent add-multiply
or (b) multiply-accumulate configurations
q or square root s directly, one digit at a time.
Seed value lookup tables. The execution time of the
algorithm is directly related to the accuracy of the initial guess. An accurate double-precision value can be
produced in 3 iterations with an 8-bit seed, while a
16-bit seed can bring the number down to 2. Unfortunately, the lookup table will have to store 512 times
as many bits in the latter case [14].
The conventional procedure for long division is an algorithm of
this type. In practice, subtractive algorithms typically use
redundant representations of values and treat groups of
consecutive bits as higher-radix digits to enhance performance [3]. For example, a radix-4 divider scans successive
pairs of bits as single digits – interpreting two radix-2 values as a single radix-4 one.
For division, let q[j ] be the partial quotient at step j
(where q[n] = q), and w[j ] the residual or partial remainder. The goal of the algorithm is to find the sequence of
quotient digits which minimizes the residual at each step.
To compute q = x d for n digit, radix-r values, set
w[0] = x and evaluate the recurrence
Constant subtraction/shifting logic. For independent
multipliers, supporting the non-multiply operations
of the algorithm directly improves performance and
maintains independence from the floating-point adder.
Multiply-accumulate structures have these capabilities built in.
w[j + 1] = rw[j ] ? dq +1;
Last-digit rounding support. Values derived from reciprocal estimates, as in the Newton-Raphson method
and Goldschmidt’s algorithm, have an inherent error
which can lead to inaccuracies in the last result digit.
There are several strategies for solving this nontrivial
problem, with associated area and performance tradeoffs [14]. Multiply-accumulate units can avoid extra
hardware at the expense of extra operations.
j
where qj +1 is the j + 1st quotient digit. For square root
computation, let p
the j th partial root be denoted by s[j ]. To
find s = s[n] = x, set w[0] = x and evaluate
w[j + 1] = rw[j ] ? 2s[j ]s +1 ? s2+1 r?( +1) :
j
j
j
Define f [j ] = 2s[j ] ? sj +1 r?(j +1); then
Multiplicative divide/square root implementations can incur a penalty beyond the area needed for extra logic, storage, and routing. Since some of the required modifications
lie on the critical path of the multiplier, they can negatively
impact the latency of multiplication itself.
w[j + 1] = rw[j ] ? f [j ]s +1
j
has the same form as the division recurrence. In practice,
f [j ] is simple to generate, which facilitates combined division and square root implementations.
Subtractive techniques. The subtractive divide/square
root algorithms employed in current microprocessors can
generally be classified as SRT methods, named for D.
Sweeny, J.E. Robertson, and K.D. Tocher who independently developed very similar algorithms for division [12].
Subtractive division and square root compute the quotient
Implementations Subtractive divide/square root implementations generally rely on dedicated hardware which
executes in parallel with the remainder of the FPU, as
shown in Figure 4. A minority of designs, such as the
IBM/Motorola PowerPC line and MIPS R4000 series, use
5
d
x
residual carry register
r wc[j]
s[j]/divisor registers
and f[j] logic
d f[j]
{d} 4 {s[j]} 4
4
factor generation
8
8
resultdigit
selection
4
q j+1 s j+1
-dq j+1 -f[j]s j+1
residual sum register
r ws[j]
c
b a
carry-save adder
wc[j+1]
ws[j+1]
on-the-fly conversion
q=x .. d s= x
Figure 5: Radix-4 SRT divide/square root unit
enhancements of the addition hardware. However, a separate functional unit offers superior performance, since the
multiplication and addition hardware can continue to operate on independent instructions while the quotient or root
is being computed. If divide and square root functionality
is supported by, for example, the multiplication hardware,
then these operations - which typically have long latencies
- tie up the unit, holding up subsequent multiply instructions until the quotient or root is available.
digit selection and factor generation become prohibitive.
To circumvent these effects, lower-radix stages can be
combined into a single higher-radix unit. Radix-16 division, for example, can be obtained by overlapping two
radix-4 dividers, with only a modest increase in area and
iteration delay over radix-4 alone [3]. There are methods
for achieving even higher radix operations, the majority
of which are restricted to theoretical treatment or experimental implementations. Several of these approaches are
summarized and compared by Ercegovac, Lang, and Montuschi [2].
One way to improve the performance of any subtractive implementation is to boost the number of iterations
per machine cycle, if possible. In the HP PA7200, for
example, the low iteration delay of the divide/square root
unit means that it can be cycled at twice the system clock
rate. Another technique which offers extremely low latencies is self-timing, whereby a functional unit operates
asynchronously with respect to the rest of the FPU, completing each iteration at the highest rate permitted by its
internal logic [15]. The Hal Sparc64 implements a version
of self-timed division.
Figure 5 shows a basic radix-4 divide/square root unit,
including most of the features critical to its high performance. The residual is stored in redundant form as vectors of sum and carry bits; this enables the use of a lowlatency carry-save adder to calculate the subtraction in
the recurrence. The result-digit selection table returns the
next quotient/square root digit on the basis of the residual and divisor/partial root values; the redundant result
digit set allows truncated values to be used, keeping the
table small and fast. Factor generation logic keeps all possible multiples of d or f [j ] and every result digit (qj or
sj 2 f?2; ?1; 0; 1; 2g) available at all times for immediate multiplexer selection. Finally, the on-the-fly conversion
logic [3], which operates concurrently with quotient computation and off of the critical path, maintains updated values of s[j ] and f [j ] and incrementally converts the partial
result from redundant into conventional representation.
Performance impact. To determine how the algorithms
compare in terms of performance, we have matched
each add-multiply configuration with a selected set of divide/square root implementations and executed the Givens
rotation benchmark on the standard matrix data set. For
each configuration (independent and multiply-accumulate)
we have assumed a floating-point issue rate of one instruction per cycle. FPU structures and performance figures for
While radix-2 divide/square root units operate on one
bit at a time, radix-4 implementations retire 2 bits of the
result at every iteration. Higher-radix units process even
larger groups of bits at once. Unfortunately, for radices
greater than 8, the latency and area requirements of result
6
Table 3: Improvement in execution time [%] on the
Givens rotation benchmark, by implementation, for
the independent configuration
individual operations are closely modeled on real examples. The selected divide/square root implementations are
as follows:
Algorithm
8-bit seed Goldschmidt
16-bit seed Goldschmidt
radix-4 SRT
radix-16 SRT
8-bit seed multiplicative. This is a baseline multiplicative implementation with an 8-bit seed lookup
table, as used in many actual chips. The independent configuration uses Goldschmidt’s algorithm,
while the multiply-accumulate structure implements
the Newton-Raphson method.
16-bit seed multiplicative. As before, the algorithm is
matched to the add-multiply configuration; the larger
lookup table reduces the latency of divide/square root
computation.
Min
0.0
1.6
7.2
7.2
Avg
0.0
5.0
15.2
23.4
of the FPU which gives the radix-4 and radix-16 units their
decisive performance advantage.
The second set of experiments pairs the multiplyaccumulate structure shown in Figure 2 with the standard
set of divide/square root implementations. This configuration and the first divide/square root implementation in Table 4 are based on the IBM RS/6000 series FPU, which
uses specially adapted versions of the Newton-Raphson
method. The 16-bit seed version uses these same algorithms but assumes a reduced number of iterations. As
for the subtractive implementations, they have simply been
carried over from the HP PA7200 case. Not only does this
make for a more uniform comparison, but seems like a
feasible implementation since the RS/6000 actually has a
longer cycle time than the PA7200 while using a comparable fabrication technology.
Radix-4 SRT. The subtractive equivalent of the 8-bit
seed case, this is a basic, practical implementation using a separate functional unit.
Radix-16 SRT. Overlapping radix-4 units provide
lower-latency operations.
The first set of experiments matches the independent configuration of Figure 1, which is modeled on the
HP PA7200, with the divide/square root implementations
indicated above. Division and square root performance
for each case is given in Table 2. The radix-4 case,
shown in boldface, uses the actual latency figures from
the HP PA7200. This implementation computes 4 quotient
digits per machine cycle by clocking the divide/square root
unit at double speed. For radix-16 operations, the figures
assume that the slightly longer cycle time of the overlapped
implementation can be accommodated without difficulty.
The Goldschmidt’s algorithm latencies are based the Texas
Instruments implementations parameterized by the latency
of multiplication and the accuracy of the seed value.
Table 4: Divide/square root performance of multiplyaccumulate implementations
Algorithm
8-bit seed Newton-Raphson
16-bit seed Newton-Raphson
radix-4 SRT
radix-16 SRT
Table 2: Divide/square root performance of independent implementations
Algorithm
8-bit seed Goldschmidt
16-bit seed Goldschmidt
radix-4 SRT
radix-16 SRT
Max
0.0
9.9
20.9
46.0
Divide
19
14
15
8
Latency
Square Root
22
17
15
8
The results, shown in Table 5, display the same pattern as for the independent add-multiply case, albeit with
even greater contrast. The longer latencies of the NewtonRaphson operations mean that the subtractive implementations perform even better by comparison. Note also the
significant improvement afforded by the radix-16 implementation over the radix-4 case.
Latency
Divide Square Root
9
13
7
10
15
15
8
8
Area considerations. There are vast differences in the
implementation styles of multiplicative and subtractive
methods. The former are primarily enhancements of existing circuitry, while the latter inhabit completely new hardware; furthermore, the core operations are entirely different. In the interest of an objective evaluation, it is essential
The relative performance of the different implementations on the Givens rotation benchmark is summarized in
Table 3, normalized to the 8-bit seed Goldschmidt case.
The subtractive implementations clearly dominate the multiplicative ones, even in cases where the latter have superior
latencies. It is the ability to execute in parallel with the rest
7
Table 6: Area comparison of two divide/square root implementations
Algorithm
Device
Chip Area [mm2 ]
Transistor Count
Div./Sqrt. Area [mm 2 ]
radix-4 SRT
Weitek 3364
95.2
165,000
4.8
Table 5: Improvement in execution time [%] on the
Givens rotation benchmark, by implementation, for
the multiply-accumulate configuration
Algorithm
8-bit seed Newton-Raphson
16-bit seed Newton-Raphson
radix-4 SRT
radix-16 SRT
Max
0.0
19.7
69.4
125.7
Min
0.0
4.9
22.4
22.5
8-bit seed
Goldschmidt
TI 8847
100.8
180,000
6.3
20 times larger than the 8-bit seed implementation, and
10 times larger than the radix-16 SRT unit. This calls into
question the practicality of such a solution, especially since
it leads to a relatively small improvement in latency as evidenced by Tables 2 and 4.
Avg
0.0
11.6
48.3
68.8
Table 7: Relative cost of different divide/square root
implementations
Algorithm
8-bit seed multiplicative
16-bit seed multiplicative
radix-4 SRT
radix-16 SRT
to compare the circuit area required by these very distinct
approaches.
Estimating area in a way that yields a fair comparison
between chips is difficult because of basic differences in
the implementation technologies. Nevertheless, it is possible to give some basis for evaluating the different implementations. Table 6 compares the size of the hardware
dedicated to division in the Weitek 3364 and Texas Instruments 8847 arithmetic coprocessors; the figures are based
on measurements of microphotographs [5]. The chips have
similar die sizes and device densities, and we can assume
the feature sizes are comparable as well. Although the multiplication algorithms are different, both have two-pass arrays which take up approximately 22% of the chip area.
In short, apart from their divide/square root implementations, these two chips have a lot in common. However, the
area devoted to division hardware is little more than 6%
of the chip size in either case. Also, the relative area requirements differ by 1.5%. These figures represent only
two particular designs whose implementation technology
is by now somewhat out of date. However, they suggest
that 8-bit seed Goldschmidt and radix-4 SRT implementations are both economical, and that the area differences
between them can be kept small.
An alternative approach uses standard-cell technology
to estimate the areas of different implementations, including the more sophisticated options [14]. Table 7 shows the
results of this study, with the values normalized to the size
of the 8-bit seed Goldschmidt variant. According to these
results, a radix-16 SRT unit need only be 45% larger than
a radix-4 design in the same technology. A 16-bit seed
Goldschmidt implementation, by contrast, could be over
Area Factor
1.0
22
1.5
2.2
Summary Multiplicative implementations can provide
low latency and lower cost than subtractive ones. However, their overall performance on our benchmark is inferior, mainly because the multiplication hardware is responsible for additional non-pipelined, multicycle operations.
In the case of the multiply-accumulate configuration, addition, multiplication, division, and square root must all be
accommodated by the same pipeline. Although 8-bit implementations tend to require less area than radix-4 hardware, 16-bit lookup tables dwarf even radix-16 implementations. Furthermore, the incredibly expensive upgrade
from an 8-bit seed table to a 16-bit one offers only a modest reduction in divide/square root latency, and a downright
meager improvement in benchmark execution time.
Subtractive divide/root implementations provide superior benchmark performance at a reasonable cost, for both
independent and multiply-accumulate configurations. This
is primarily due to the fact that subtractive hardware operates independently and does not tie up other FPU resources. The parallel operation also means that improvements in divide/square root latency have a more decisive
impact than in multiplicative implementations. Upgrading from radix-4 to radix-16 can improve performance by
as much as 21% for the independent case or 56% for the
multiply-accumulate structure, at a cost of 33% more area.
Subtractive implementations also fit in nicely with the cur8
rent trend toward decoupled superscalar processors with
high instruction issue rates and growing transistor budgets. We conclude that implementations of subtractive divide/square root algorithms are the most sensible choice,
and will focus on them for the remainder of the discussion.
Also, the difference between the dual issue radix-4 and
radix-16 cases is much less significant than the contrast
between cases with the same algorithm but differing issue rates. This is another instance, not unlike the contest
between multiplicative and subtractive algorithms, where
parallelism wins over latency. By keeping all of the units
busier, especially the adder and multiplier, dual instruction
issue softens the impact of division latency.
We have also performed a set of simulations to determine how dual instruction issue affects the performance
balance between multiplicative and subtractive methods.
As shown in Table 8, multiple instruction issue strengthens the performance advantage of subtractive implementations. This is what one would expect given the increased
parallelism of operations afforded by a separate functional
unit for divide/square root operations. These data support
the claim that subtractive methods are better suited to superscalar implementations.
Issue rate
The simulations in the previous section assumed an issue rate of one floating-point instruction per machine cycle.
This situation holds in such processors as the PowerPC series, where the floating-point unit is a single pipeline, and
the Intel P6, where the arithmetic functional units share a
single set of inputs and outputs. However, many recent machines, including the MIPS 10000 and Sun UltraSPARC,
can generate up to two floating-point instructions at once.
Dual issue only makes sense when there are at least two
independent functional units. Furthermore, if one of the
two functional units is a dedicated divide/square root unit,
dual issue is not worthwhile since these instructions are
not frequent enough to keep a separate unit busy. By this
reasoning, dual floating-point instruction issue is appropriate for an FPU with an independent adder, multiplier,
and divide/square root unit, but not one with a multiplyaccumulate structure and separate divide/square root unit.
Area considerations. The choice of how many floatingpoint instructions to dispatch per cycle is a much larger
issue than divide/square root implementation alone, and
probably not entirely up to the floating-point designer.
Nevertheless, if the option is available, there are several
issues to consider. Obviously, there must be at least as
many functional units as the maximum number of instructions issued per cycle. Also, allocating one instruction per
cycle to divide/square root operations alone is a waste of
resources. Finally, there is a cost to boosting the floatingpoint issue rate which is not readily quantified. Routing
must be added to deliver operands and return results. The
register file needs extra ports, which adds to its area and
access time. Finally, there are the necessary changes to the
dispatch logic, which affect the processor as a whole.
Performance impact. To explore the interaction of
higher issue rates and divide/square root performance, we
have run a series of experiments using the Givens rotation
benchmark. For the reasons given above, the focus is restricted to independent add-multiply configurations with
independent, subtractive divide/square root implementations. The variables are the instruction issue rate (single
or dual) and the algorithm of the divide/square root implementation (radix-4 or radix-16). Performance figures for
individual operations are the same as from previous experiments.
Connectivity
Table 9: Effect of FP instruction issue rate on benchmark performance; execution time improvement [%]
for an independent configuration
Algorithm
radix-4 SRT
radix-16 SRT
radix-4 SRT
radix-16 SRT
Issue Rate
single issue
single issue
dual issue
dual issue
Max
0.0
26.9
50.0
52.1
Min
0.0
0.0
5.5
22.3
In an FPU with a dual issue, independent add-multiply
configuration, there are several ways to connect subtractive divide/square root hardware. One can either package
divide/square root circuitry as an independent functional
unit like the Sun UltraSPARC, share ports with the adder
as in the DEC 21164, or share ports with the multiplier
like in the MIPS R10000. Figure 6 illustrates the possible
configurations. In every case, division and square root are
computed concurrently with other FPU operations. However, certain configurations lead to contention for shared
input and/or output ports.
Multiplicative divide/square root hardware is always
integrated with the multiplier, so the connectivity issue
is moot for those types of implementations. Sharing
subtractive divide/square root unit ports with a multiplyaccumulate unit is equivalent to single issue FPU, a case
which has already been covered.
Avg
0.0
7.0
35.4
44.4
From the results in Table 9, it is apparent that increasing the floating-point instruction issue rate leads to a genuine performance boost. This effect tends to overshadow
the effects of divide/square root latency. The dual issue
radix-4 case outperforms the single issue radix-16 one,
even though the latter is faster on individual operations.
9
register
file
multiply
add
divide/
square root
(a)
register
file
multiply
add
divide/
square root
(b)
register
file
divide/
square root
multiply
add
(c)
Figure 6: Dual issue, independent add-multiply configurations with a) independent, b) adder-coupled and c)
multiplier-coupled divide/square root units
10
Table 8: Effect of algorithm choice on benchmark performance; execution time improvement [%] for an independent, dual issue configuration
Algorithm
8-bit seed Goldschmidt
16-bit seed Goldschmidt
radix-4 SRT
radix-16 SRT
Issue Rate
dual issue
dual issue
dual issue
dual issue
Max
0.0
11.4
37.2
46.7
Min
0.0
2.3
4.5
14.1
Avg
0.0
6.2
20.1
28.9
Table 10: Effect of divide/square root connectivity on benchmark performance; execution time improvement [%]
for an independent configuration
Algorithm
radix-4 SRT
radix-4 SRT
radix-4 SRT
radix-16 SRT
radix-16 SRT
radix-16 SRT
Connectivity
multiplier-coupled
adder-coupled
independent
multiplier-coupled
adder-coupled
independent
Performance impact. On the face of it, sharing functional unit ports between divide/square root and other operations seems sensible, since addition and multiplication
occur much more frequently. It is important, however, to
make sure that collisions between divide/square root and
the host operations are minimized. These simulations explore the impact of different interconnection schemes on
benchmark performance. In every case, an independent
add-multiply configuration and dual floating-point instruction issue has been assumed. The variables are the algorithm (radix-4 or radix-16) and the connectivity (independent, adder-coupled, or multiplier-coupled). In cases
where functional units are shared, divide and square root
operations have been pushed back one cycle in the event of
a port conflict.
Max
0.0
4.4
4.4
12.1
26.3
26.3
Min
0.0
1.4
1.4
0.1
2.3
2.3
Avg
0.0
2.8
2.8
3.8
10.6
10.6
adder-coupled design, its near 2:1 ratio of multiply to add
operations should not be taken as typical. In a survey
of floating-point operations in the SPEC92fp benchmark
suite, the multiply:add ratio is closer to 2:3 [11]. This supports the intuitive suggestion that many applications are
dominated by addition rather than multiplication, and that
multiplier-coupled divide/square root is therefore the most
appropriate choice. In every other design decision covered
so far, improving the performance of the benchmark is not
in conflict with enhancing the speed of all programs, this
case being an exception.
Area considerations. The primary cost of a separate,
independent divide/square root unit is the extra routing
required, including the extra buses and the multiplexing/selection logic. Compare the routing complexity of the
FPU with the independently connected divide/square root
unit in Figure 6(a) with the adder-and multiplier-coupled
cases in Figures 6(b) and 6(c). When the divide/square root
unit is coupled to one of the other functional units, the two
sets of ports on the register file can be readily split between
the two clusters of components. Connecting two sets of
ports to three independent units is considerably more complicated. Similarly, independent divide/square root units
complicate the instruction dispatch logic by increasing the
number of possible paths and destinations for instructions.
The simulation results, given in Table 10, are somewhat
counter-intuitive. First of all, for both algorithms, the independent and adder-coupled cases show the exact same
performance. This is a feature of the Givens rotations algorithm, which has approximately twice as many multiplies as adds, and, therefore, many free issue slots allocated to the adder. The adder-coupled case can take advantage of these openings, yielding the same performance as
the independent case. The multiply-coupled configuration
is hampered by frequent collisions between multiply and
divide/square root operations, but the effects are relatively
mild, especially for the radix-4 cases. With the shorter latency of radix-16 operations, the collisions have comparatively greater impact on performance, but still less than 7%
on average.
Replication
Several microprocessor manufacturers have recently
attempted to boost numerical performance by replicating floating-point functional units. The IBM RS/6000
Although the Givens rotation benchmark favors the
11
Table 11: Summary of divide/square root design decisions for different architectures
Design
DEC 21164 Alpha AXP
Hal Sparc64
HP PA7200
HP PA8000
IBM RS/6000 POWER2
Intel Pentium
Intel Pentium Pro
MIPS R8000
MIPS R10000
PowerPC 604
PowerPC 620
Sun SuperSPARC
Sun UltraSPARC
Add-Multiply
Configuration
independent
multiply-acc.
independent
multiply-acc
multiply-acc.
independent
independent
multiply-acc.
independent
multiply-acc.
multiply-acc.
independent
independent
Algorithm
SRT
SRT
SRT
SRT
Newton-Raphson
SRT
SRT
multiplicative
SRT
SRT
SRT
Goldschmidt
SRT
POWER2 and MIPS R8000 feature dual multiplyaccumulate units, the MIPS R10000 couples its independent multiplier with two other units for divide and square
root, and the HP PA8000 FPU consists of two multiplyaccumulate units and two divide/square root units. This
trend is facilitated by increasingly compact technology and
higher maximum issue rates.
There are many different ways to replicate floatingpoint functionality and connect it with the rest of the
system. For processors with FPU’s organized as single,
atomic pipelines or blocks, such as the PowerPC and IBM
RS/6000 series, the most natural alternative is to replicate
the entire unit. Decoupled superscalar designs like the Intel P6 and Sun SuperSPARC are more at liberty to add individual units, such as an extra divide/square root unit.
Div/Sqrt Design Decisions
FP Issue Rate
Connectivity
2 instr/cycle
adder-coupled
2 instr/cycle
independent
2 instr/cycle
independent
2 instr/cycle multiply-acc.-coupled
2 instr/cycle
integrated
1 instr/cycle
adder-coupled
1 instr/cycle
independent
2 instr/cycle
integrated
2 instr/cycle
multiplier-coupled
1 instr/cycle
integrated
1 instr/cycle
integrated
1 instr/cycle
multiplier-integrated
2 instr/cycle
independent
Units
1
1
1
2
2
1
1
2
2
1
1
1
1
Area considerations. It is a more efficient use of area to
improve a slow divide/square root unit than to replicate it.
For example, in a dual-issue independent FPU, upgrading
from radix-4 to radix-16 gives a maximum improvement of
22.5% and an average improvement of 7.5% on the Givens
rotation benchmark, while duplication of the radix-4 unit
gives no appreciable performance benefit. The upgrade to
radix-16 costs only 45% of the radix-4 area, as opposed to
100% for replication, not including all of the extra routing and any changes to the register file or instruction issue
logic.
Conclusions
We have argued that floating-point division and square
root are important operations which warrant fast, efficient
implementations, and explained the most common algorithms and their corresponding hardware structures. Our
discussion of implementations has focussed on four design
choices: algorithm, issue rate, connectivity, and replication. Figure 11 summarizes the design decisions made in
recent microprocessors.
Subtractive divide/square root implementations offer latencies which are competitive with the fastest multiplicative ones, but with a considerably lower consumption of
chip area. More significantly, by operating concurrently
with the remainder of the FPU, subtractive hardware provides a potentially significant boost in performance, as
evidenced by simulations on the Givens rotation benchmark. This performance advantage holds true for both
multiply-accumulate and independent add-multiply configurations. Unlike multiplicative divide/square root hardware, which serializes computation, subtractive implementations are well suited to exploit decoupled superscalar microarchitectures and issue rates of multiple floating-point
Performance impact. There are many potential ways to
compare the effects of replicating floating-point hardware
on divide/square root performance. We considered one
example particularly worthy of examination, namely, the
duplication of divide/square root hardware in a fully independent FPU. However, from a preliminary investigation, we concluded that the addition would have no effect on Givens rotation benchmark performance, unless
the method of scheduling operations was heavily modified.
Even with a much more aggressive schedule, the expected
effects would be insignificant. Addressing the performance
impact of replicating floating-point functionality in general is beyond the scope of this investigation. However,
since the Givens rotation benchmark represents an unusually high level of divide/square root utilization, replicating
the hardware for these functions alone seems to offer little
potential benefit.
12
instructions per cycle.
For subtractive implementations, increasing the instruction issue rate from one to two instructions per cycle can
also dramatically improve performance, by over 35% on
average for the Givens rotation benchmark. As long as the
extra instruction is not squandered on divide/square root
operations alone, this seems like a worthwhile improvement, taking into account the cost of routing and upgrading
the register file and instruction issue logic.
The cost and complexity of adding a fully independent
divide/square root unit to a dual-issue, independent addmultiply configuration are considerable. Therefore, coupling this functionality to one of the existing structures by
sharing connections to the register file seems like a reasonable economizing measure. The worst case performance
impact seems modest compared with the possible explosion in routing costs, and the additional scheduling complications incurred. The balance of the evidence suggests
that the multiplier is the best candidate for sharing ports
with the divide/square root unit.
The merits of replicating floating point functional units
in general remain to be seen. Certainly, even with highly
parallel algorithms, one can expect significantly less than
twice the performance from a doubling of hardware. In
particular, duplicating divide/square root hardware alone
appears to hold few advantages, even for the Givens rotation benchmark with its heavy utilization of these operations.
[4] Gene H. Golub and Charles F. Van Loan. Matrix
Computations. The Johns Hopkins University Press;
Baltimore, second edition, 1989.
[5] John L. Hennessy and David A. Patterson.
puter Architecture: A Quantitative Approach.
gan Kaufmann Publishers; San Mateo, CA,
Appendix A: Computer Arithmetic by David
berg.
ComMor1990.
Gold-
[6] IEEE standard for binary floating-point arithmetic.
New York ANSI/IEEE Std. 754–1985, August 1985.
[7] W. Kahan. Using MathCAD 3.1 on a Mac, August
1994.
[8] Mathworks, Inc., Natick, MA. The Pentium Papers, November 1994.
Extensive online documentation of the Pentium FDIV bug and related
publicity. Available on the World Wide Web at
http://www.mathworks.com/README.html.
[9] S. E. McQuillan and J. V. McCanny. A VLSI architecture for multiplication, division, and square root.
In Proceedings of the 1991 International Conference
on Acoustics, Speech and Signal Processing, pages
1205–1208. IEEE, May 1991.
[10] S. E. McQuillan, J. V. McCanny, and R. Hamill. New
algorithms and VLSI architectures for SRT division
and square root. In Proceedings of the 11th IEEE
Symposium on Computer Arithmetic, pages 80–86.
IEEE, June 1993.
Acknowledgements
This research is supported in part by the National Science Foundation under contract CCR-9257280. Peter
Soderquist was supported by a Fellowship from the National Science Foundation. Miriam Leeser is supported in
part by an NSF Young Investigator Award.
[11] Stewart F. Oberman and Michael J. Flynn. Design
issues in floating-point division. Technical Report
CSL-TR-94-647, Stanford University Departments of
Electrical Engineering and Computer Science, Stanford, CA, December 1994.
References
[12] Norman R. Scott. Computer Number Systems and
Arithmetic. Prentice Hall; Englewood Cliffs, NJ,
1985.
[1] B. K. Bose, L. Pei, G. S. Taylor, and D. A. Patterson. Fast multiply and divide for a VLSI floatingpoint unit. In Proceedings of the 8th IEEE Symposium on Computer Arithmetic, pages 87–94. IEEE,
May 1987.
[13] Peter
Soderquist
and
Miriam
Leeser.
An area/performance comparison of subtractive and
multiplicative divide/square root implementations. In
Proceedings of the 12th IEEE Symposium on Computer Arithmetic, pages 132–139. IEEE, July 1995.
[2] Milos̆ D. Ercegovac, Tomás Lang, and Paolo Montuschi. Very high radix division with selection by
rounding and prescaling. In Proceedings of the 11th
IEEE Symposium on Computer Arithmetic, pages
112–119. IEEE, June 1993.
[14] Peter Soderquist and Miriam Leeser. Area and performance tradeoffs in floating-point division and square
root implementations. ACM Computing Surveys,
28(3):518–564, September 1996.
[3] Miloš D. Ercegovac and Tomás Lang. Division and
Square Root: Digit Recurrence Algorithms and Implementations. Kluwer Academic Publishers; Norwell, MA, 1994.
[15] Ted E. Williams. A zero-overhead self-timed 160ns 54-b CMOS divider. IEEE Journal of Solid-State
Circuits, 26(11):1651–1661, November 1991.
13
2
6
A = 64
2
6
6
4
2
3
77 (4!1) 66 5 4 0 2
3
(3 2) 6
0 7
7 ! 6
4
5
0 0
0 ;
;
2
3
77 (3!1) 66 5 4 0 0 2
3
(4 3) 6
0 7
7 ! 6
4
5
0
0 0
0 ;
;
2
3
77 (2!1) 66 0 5 4 0 0 3
0 7
7=R
0
0 5
;
0
0
3
7
7
5
!
(4;2)
0
Figure 7: Triangularization of a 4-by-3 matrix using Givens rotations
Sidebar: Givens Rotations and FPU-Level
Simulation
tions can therefore be used to shape matrices by introducing zeros. Figure 7 shows how to shape an arbitrary 4-by-3
matrix into upper triangular form. All that is needed is one
computation of coefficients as in Figure 8 and repeated application of Equation 1 for each new zero produced. Note
that despite the centrality of divide and square root operations, Givens rotation typically involves computing many
matrix-vector products and therefore a predominance of
additions and multiplications.
The FPU-level simulator used in this study models the
transformation of an m-by-n matrix into upper triangular
form using Givens rotations. It accepts as input the dimensions of the matrix and a function describing the structure
and performance of a given floating-point unit configuration. The computed result is the simulated performance of
the particular FPU. Note that the simulator does not actually compute the values needed for triangularization, but
rather the performance of different floating-point configurations in machine cycles.
Many engineering and scientific applications use matrices and numerical analysis methods to model and simulate
systems. These matrices must frequently take on a specific shape - such as diagonal, upper triangular, or lower
triangular - assumed by a particular algorithm. Givens rotation [4] provides a tool for shaping arbitrary matrices into
these and other more complicated forms as needed.
For any two scalars a and b, Givens rotation performs
the operation
p
c ?s
s c
T a
b
=
r
0
(1)
where r = a2 + b2. The rotation coefficients c and s can
be computed using the function in Figure 8.
function: [c; s] = givens(a; b)
if b = 0
c = 1; s = 0
else
if jbj > jaj
p
= a=b; s = 1= 1 + 2 ; c = s
else
p
= b=a; c = 1= 1 + 2; s = c
end
end
end givens
Figure 8: Function for computing Givens rotation coefficients
T
Let a b
be a submatrix of A, an m-by-n matrix
representing the coefficients of a system model. Let the rotation be applied to all vertical pairs of elements in the rows
of A containing a and b, in the manner of Equation 1. The
system of equations represented by A remains the same,
but now there is a zero in the position of b. Givens rota14