Floating-Point Division and Square Root: Choosing the Right Implementation Peter Soderquist School of Electrical Engineering Cornell University Ithaca, New York 14853 E-mail: [email protected] Miriam Leeser Dept. of Electrical and Computer Engineering Northeastern University Boston, Massachusetts 02115 E-mail: [email protected] To appear in IEEE Micro Copyright 1996 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Floating-Point Division and Square Root: Choosing the Right Implementation Peter Soderquist Cornell University Miriam Leeser Northeastern University Importance of division and square root to increase the frequency of these operations [11]; poor implementations disproportionately penalize code which uses them at all [10]. Furthermore, as the latency gap grows between addition and multiplication on the one hand and divide/square root on the other, the latter increasingly become performance bottlenecks [14]. Programmers have attempted to get around this problem by rewriting algorithms to avoid divide/square root operations, but the resulting code generally suffers from poor numerical properties, such as instability or overflow [9, 14, 7]. In short, division and square root are natural components of many algorithms, which are best served by implementations with good performance. Quantifying what constitutes “good performance” is challenging. One rule of thumb, for example, states that the latency of division should be three times that of multiplication; this figure is based on division frequencies in a selection of typical scientific applications [1]. Even if one accepts this doctrine at face value, implementing division - and square root - involves much more than relative latencies. Area, throughput, complexity, and the interaction with other operations must be considered as well. This article explores the various tradeoffs involved and illuminates the consequences of different design choices. Floating-point support has become a mandatory feature of new microprocessors. Over the past few years, the leading architectures have seen several generations of floatingpoint units (FPU’s). While addition and multiplication implementations have become increasingly efficient, division and square root support has remained uneven. There is a considerable variation in the types of algorithms employed, as well as the quality and performance of the implementations. This situation originates in skepticism about the importance of division and square root, and an insufficient understanding of the design alternatives. The purpose of this paper is to clarify and evaluate the implementation tradeoffs at the FPU level, thus enabling designers to make informed decisions. Division and square root have long been considered minor, bothersome members of the floating-point family. Microprocessor designers frequently perceive them as infrequent, low-priority operations, barely worth the trouble of implementing; design effort and chip resources are allocated accordingly. The survey of microprocessor FPU performance in Table 1 shows some of the uneven results of this philosophy. While multiplication requires from 2 to 5 machine cycles, division latencies range from 9 to 60. The variation is even greater for square root, which is not supported in hardware in several cases. This data hints at but mostly conceals the significant variation in algorithms and topologies among the different implementations. The error in the Intel Pentium floating-point unit, and the accompanying publicity and $475 million write-off illustrate some of the hazards of an incorrect division implementation [8]. But correctness is not enough; low performance causes enough problems of its own. Even though divide and square root are relatively infrequent operations in most applications, they are indispensable, particularly in many scientific programs. Compiler optimizations tend Add-multiply configurations Add-multiply circuits are the foundation of any floatingpoint unit. Addition (which subsumes subtraction) and multiplication are not only the most frequently occurring arithmetic operations, but together they are sufficient to support all other operations required by the IEEE 754 floating-point standard [6]. All other functions, including division and square root, may be regarded as additions to or enhancements of the add-multiply hardware. For these reasons, the implementation of addition and multiplication largely determines the overall performance of a floating1 Table 1: Arithmetic performance of recent microprocessor FPU’s for double precision operands, in machine cycles (* = inferred from available information; y = not supported) Cycle Latency[cycles]/Throughput[cycles] pa Design Time [ns] ab ab ab DEC 21164 Alpha AXP 3.33 ns 4/1 4/1 22-60/22-60* y Hal Sparc64 6.49 ns 4/1 4/1 8-9/7-8 y HP PA7200 7.14 ns 2/1 2/1 15/15 15/15 HP PA8000 5 ns 3/1 3/1 31/31 31/31 IBM RS/6000 POWER2 13.99 ns 2/1 2/1 16-19/15-18* 25/24* Intel Pentium 6.02 ns 3/1 3/2 39/39 70/70 Intel Pentium Pro 7.52 ns 3/1 5/2 30*/30* 53*/53* MIPS R8000 13.33 ns 4/1 4/1 20/17 23/20 MIPS R10000 3.64 ns 2/1 2/1 18/18 32/32 PowerPC 604 10 ns 3/1 3/1 31/31 y PowerPC 620 7.5 ns 3/1 3/1 18/18 22/22 Sun SuperSPARC 16.67 ns 3/1 3/1 9/7 12/10 Sun UltraSPARC 4 ns 3/1 3/1 22/22 22/22 point unit. The add-multiply configurations in the latest generation of microprocessors fall into two basic categories. The first variety, which we have termed the independent configuration, is illustrated in Figure 1. It features add and multiply units which operate independently, reading operands from the register file and returning correctly rounded results. Although the illustration shows sharing of inputs and outputs between functional units, some implementations give each unit individual access to the register file. The table shows the typically high performance figures for this type of configuration. Frequently, the adder is actually a floating-point ALU, performing comparisons, complementation, and format conversions as well as addition and subtraction. Processors which feature independent add-multiply configurations include the DEC 21164, HP PA7200, Intel Pentium Pro, and Sun UltraSPARC. The second type of add-multiply structure, known as the multiply-accumulate configuration, is shown in Figure 2, along with typical performance figures. It combines both functions into a single atomic operation on three operands, with the multiplication of two operands followed by addition of the product and the third operand. Typically, the product is maintained at a very high precision and added without intermediate rounding. The multiply-accumulate configuration is motivated by the fact that a large proportion of scientific computations, such as series and matrix operations, involve many instances of multiplication followed immediately by addition. The IBM RS/6000 series, the MIPS R8000, the HP PA8000, and the Hal Sparc64 utilize add-multiply structures of this type. register file add multiply Operation addition multiplication Latency 2 2 Throughput 1 1 Figure 1: Typical independent add-multiply configuration register file multiplyaccumulate Operation multiply-accumulate Design issues Given the add-multiply infrastructure, there are many ways to incorporate division and square root functionality Latency 2 Throughput 1 Figure 2: Typical multiply-accumulate configuration 2 into an FPU. The following is a brief introduction to the major design decisions, which will be explored in greater detail in subsequent sections. The primary issues may be split into the following categories: techniques. They also feature in signal processing algorithms, and the rotation and projection algorithms common in graphics and solid modeling use similar combinations of operations. The Givens rotation algorithm is not “typical” in its use of divide/square root operations. In fact, there are probably few other real applications whose performance is so dependent on the efficiency of division and square root. Nevertheless, what it lacks in representativeness, it makes up for in importance. Finally, as a kind of divide/square root “torture test,” Givens rotations clearly illustrates the effects of different implementation choices. Algorithm. The choice of divide/square root algorithm is by far the most important design decision. It involves a series of smaller decisions, and once made, largely determines the subsequent range of choices and the area and performance of the implementation in general. Issue rate. Although some chips execute floatingpoint instructions strictly in series, many of the most recent microprocessors can issue up to two floatingpoint instructions per cycle. Therefore, the next issue worthy of attention is the impact of floating-point issue rate on the efficiency of divide/square root implementations. Divide/square root algorithms There are many possible algorithms for computing division and square root, but only a subset are practical for microprocessor implementation at the present time. These fall into the two categories of multiplicative and subtractive algorithms [13]. Multiplicative methods use hardware integrated with the floating-point multiplier and have low to moderate latencies, while subtractive methods generally employ separate circuitry and have low to high latencies. We will consider only the case where division and square root are implemented in hardware using the same algorithm. The arguments for this are covered in more detail in [14]. In summary, division and square root tend to have very similar implementations and allow for extensive hardware sharing. Furthermore, this is the most common case in actual practice. Connectivity. Closely related to issue rate is the manner in which divide/square root hardware is interfaced with the rest of the FPU, either with independent access to the register file, or coupled to or integrated with one of the other functional units. Replication. Finally, it is becoming not uncommon for microprocessors to duplicate floating-point functional units, including the divide/square root hardware. The performance and area effects of such replication deserve closer consideration. Multiplicative techniques. Multiplicative algorithms compute an estimate of the quotient or square root using iterative refinement of an initial guess. They reduce divide and square root operations into a series of multiplications, subtractions, and bit shifts. In addition, multiplicative algorithms typically converge at a quadratic rate, which means the number of accurate digits in the estimate doubles at each iteration. The two techniques used in recent microprocessors are the Newton-Raphson Method and Goldschmidt’s algorithm. With many features in common, they differ primarily in the ordering of operations, which affects their mapping onto pipelined hardware. These issues are not independent but interact with each other in complex and non-obvious ways. We will explain the connections between them while trying to isolate and quantify their effects of the tradeoffs as much as possible. Simulation To provide a uniform, quantitative foundation for the analysis of the different design tradeoffs, we have implemented an FPU-level simulator which executes a benchmark based on Givens rotations (see the sidebar [included at the end of this paper]). Given an FPU configuration, including its structure and performance on individual operations, the simulator calculates the number of cycles required to transform an arbitrary matrix into triangular for m using Givens rotations. The simulations use a standard set of test matrices ranging from 10-by-10 elements to 200by-100 elements in size [14]. The Givens rotation algorithm has been chosen because it is rich in divide and square root operations, and because it is a real program with important numerical applications. The primary task of many scientific programs is the solution of partial differential equations. Givens rotations are a central component of several frequently used solution The Newton-Raphson method. An algorithm with a long history, the Newton-Raphson method [12, 5] has been implemented in many machines, including the IBM RS/6000 series. To find a=b, set the seed value x0 to an estimate of 1=b and iterate over x +1 = x (2 ? b x ) i i i until xi+1 is within the desired precision of 1=b. Then a xi+1 a=b. For square root computation, set the seed to 3 a b a .. b a b operand register a operand register b a mux lookup table mux mux multiplier array temporary register 1 pipeline register mux rounder/normalizer temporary register 2 subtracter/shifter mux result register Figure 3: A floating-point multiplier enhanced for multiplicative divide/square root computation approximate 1= pa and refine it with the iteration x +1 = 21 x (3 ? a x2): i i Implementations Multiplicative divide/square root implementations generally take the form of modifications to the floating-point multiplier. This is because divide and square root functionality alone do not justify the area required for a second multiplier. The designers of the IBM RS/6000 floating-point unit have chosen the NewtonRaphson method as being best-suited to the multiplyaccumulate structure. In independent add-multiply configurations, Goldschmidt’s algorithm is a natural choice for pipelined multipliers. Since the numerator and denominator operations are independent, clever scheduling can allow the multiplier to run at maximum efficiency. The NewtonRaphson method suffers from dependencies between subsequent operations, which hobbles pipelining. The block diagram in Figure 3 shows one possible implementation, namely an independent floating-point multiplier modified for Goldschmidt’s algorithm. The new components and routing needed for divide/square root support are shown with dashed lines. Although the details will vary from case to case, most multiplicative implementations feature one or more of the following hardware enhancements: i The product a xi+1 yields an approximation to pa. Goldschmidt’s algorithm. Recent implementations of Goldschmidt’s algorithm [5] have included the Sun SuperSPARC and arithmetic coprocessors from Texas Instruments. To compute a=b = x0=y0 , iteratively calculate x +1 = x r i i and i y +1 = y r i i i where ri = 2 ? yi ; then yi ! 1 and xi ! a=b. Both x0 and y0 should be prescaled by a seed value approximating p 1=b to insure rapid convergence. To compute a, set x0 = p y0 = a, prescale by an estimate of 1= a and iterate over x +1 = x r2 i where ri = (3 i and i y +1 = y r i i i ? y )=2, such that i x +1=y2+1 = x =y2 = 1=a: i Then xi i i i ! 1, and hence y ! pa. i 4 Extra routing and storage. These are required to make divide/square root into atomic operations independent of the register file. register file multiply add register file divide/ square root divide/ square root multiplyaccumulate (a) (b) Figure 4: FPU topologies with subtractive divide/square root functional units and (a) independent add-multiply or (b) multiply-accumulate configurations q or square root s directly, one digit at a time. Seed value lookup tables. The execution time of the algorithm is directly related to the accuracy of the initial guess. An accurate double-precision value can be produced in 3 iterations with an 8-bit seed, while a 16-bit seed can bring the number down to 2. Unfortunately, the lookup table will have to store 512 times as many bits in the latter case [14]. The conventional procedure for long division is an algorithm of this type. In practice, subtractive algorithms typically use redundant representations of values and treat groups of consecutive bits as higher-radix digits to enhance performance [3]. For example, a radix-4 divider scans successive pairs of bits as single digits – interpreting two radix-2 values as a single radix-4 one. For division, let q[j ] be the partial quotient at step j (where q[n] = q), and w[j ] the residual or partial remainder. The goal of the algorithm is to find the sequence of quotient digits which minimizes the residual at each step. To compute q = x d for n digit, radix-r values, set w[0] = x and evaluate the recurrence Constant subtraction/shifting logic. For independent multipliers, supporting the non-multiply operations of the algorithm directly improves performance and maintains independence from the floating-point adder. Multiply-accumulate structures have these capabilities built in. w[j + 1] = rw[j ] ? dq +1; Last-digit rounding support. Values derived from reciprocal estimates, as in the Newton-Raphson method and Goldschmidt’s algorithm, have an inherent error which can lead to inaccuracies in the last result digit. There are several strategies for solving this nontrivial problem, with associated area and performance tradeoffs [14]. Multiply-accumulate units can avoid extra hardware at the expense of extra operations. j where qj +1 is the j + 1st quotient digit. For square root computation, let p the j th partial root be denoted by s[j ]. To find s = s[n] = x, set w[0] = x and evaluate w[j + 1] = rw[j ] ? 2s[j ]s +1 ? s2+1 r?( +1) : j j j Define f [j ] = 2s[j ] ? sj +1 r?(j +1); then Multiplicative divide/square root implementations can incur a penalty beyond the area needed for extra logic, storage, and routing. Since some of the required modifications lie on the critical path of the multiplier, they can negatively impact the latency of multiplication itself. w[j + 1] = rw[j ] ? f [j ]s +1 j has the same form as the division recurrence. In practice, f [j ] is simple to generate, which facilitates combined division and square root implementations. Subtractive techniques. The subtractive divide/square root algorithms employed in current microprocessors can generally be classified as SRT methods, named for D. Sweeny, J.E. Robertson, and K.D. Tocher who independently developed very similar algorithms for division [12]. Subtractive division and square root compute the quotient Implementations Subtractive divide/square root implementations generally rely on dedicated hardware which executes in parallel with the remainder of the FPU, as shown in Figure 4. A minority of designs, such as the IBM/Motorola PowerPC line and MIPS R4000 series, use 5 d x residual carry register r wc[j] s[j]/divisor registers and f[j] logic d f[j] {d} 4 {s[j]} 4 4 factor generation 8 8 resultdigit selection 4 q j+1 s j+1 -dq j+1 -f[j]s j+1 residual sum register r ws[j] c b a carry-save adder wc[j+1] ws[j+1] on-the-fly conversion q=x .. d s= x Figure 5: Radix-4 SRT divide/square root unit enhancements of the addition hardware. However, a separate functional unit offers superior performance, since the multiplication and addition hardware can continue to operate on independent instructions while the quotient or root is being computed. If divide and square root functionality is supported by, for example, the multiplication hardware, then these operations - which typically have long latencies - tie up the unit, holding up subsequent multiply instructions until the quotient or root is available. digit selection and factor generation become prohibitive. To circumvent these effects, lower-radix stages can be combined into a single higher-radix unit. Radix-16 division, for example, can be obtained by overlapping two radix-4 dividers, with only a modest increase in area and iteration delay over radix-4 alone [3]. There are methods for achieving even higher radix operations, the majority of which are restricted to theoretical treatment or experimental implementations. Several of these approaches are summarized and compared by Ercegovac, Lang, and Montuschi [2]. One way to improve the performance of any subtractive implementation is to boost the number of iterations per machine cycle, if possible. In the HP PA7200, for example, the low iteration delay of the divide/square root unit means that it can be cycled at twice the system clock rate. Another technique which offers extremely low latencies is self-timing, whereby a functional unit operates asynchronously with respect to the rest of the FPU, completing each iteration at the highest rate permitted by its internal logic [15]. The Hal Sparc64 implements a version of self-timed division. Figure 5 shows a basic radix-4 divide/square root unit, including most of the features critical to its high performance. The residual is stored in redundant form as vectors of sum and carry bits; this enables the use of a lowlatency carry-save adder to calculate the subtraction in the recurrence. The result-digit selection table returns the next quotient/square root digit on the basis of the residual and divisor/partial root values; the redundant result digit set allows truncated values to be used, keeping the table small and fast. Factor generation logic keeps all possible multiples of d or f [j ] and every result digit (qj or sj 2 f?2; ?1; 0; 1; 2g) available at all times for immediate multiplexer selection. Finally, the on-the-fly conversion logic [3], which operates concurrently with quotient computation and off of the critical path, maintains updated values of s[j ] and f [j ] and incrementally converts the partial result from redundant into conventional representation. Performance impact. To determine how the algorithms compare in terms of performance, we have matched each add-multiply configuration with a selected set of divide/square root implementations and executed the Givens rotation benchmark on the standard matrix data set. For each configuration (independent and multiply-accumulate) we have assumed a floating-point issue rate of one instruction per cycle. FPU structures and performance figures for While radix-2 divide/square root units operate on one bit at a time, radix-4 implementations retire 2 bits of the result at every iteration. Higher-radix units process even larger groups of bits at once. Unfortunately, for radices greater than 8, the latency and area requirements of result 6 Table 3: Improvement in execution time [%] on the Givens rotation benchmark, by implementation, for the independent configuration individual operations are closely modeled on real examples. The selected divide/square root implementations are as follows: Algorithm 8-bit seed Goldschmidt 16-bit seed Goldschmidt radix-4 SRT radix-16 SRT 8-bit seed multiplicative. This is a baseline multiplicative implementation with an 8-bit seed lookup table, as used in many actual chips. The independent configuration uses Goldschmidt’s algorithm, while the multiply-accumulate structure implements the Newton-Raphson method. 16-bit seed multiplicative. As before, the algorithm is matched to the add-multiply configuration; the larger lookup table reduces the latency of divide/square root computation. Min 0.0 1.6 7.2 7.2 Avg 0.0 5.0 15.2 23.4 of the FPU which gives the radix-4 and radix-16 units their decisive performance advantage. The second set of experiments pairs the multiplyaccumulate structure shown in Figure 2 with the standard set of divide/square root implementations. This configuration and the first divide/square root implementation in Table 4 are based on the IBM RS/6000 series FPU, which uses specially adapted versions of the Newton-Raphson method. The 16-bit seed version uses these same algorithms but assumes a reduced number of iterations. As for the subtractive implementations, they have simply been carried over from the HP PA7200 case. Not only does this make for a more uniform comparison, but seems like a feasible implementation since the RS/6000 actually has a longer cycle time than the PA7200 while using a comparable fabrication technology. Radix-4 SRT. The subtractive equivalent of the 8-bit seed case, this is a basic, practical implementation using a separate functional unit. Radix-16 SRT. Overlapping radix-4 units provide lower-latency operations. The first set of experiments matches the independent configuration of Figure 1, which is modeled on the HP PA7200, with the divide/square root implementations indicated above. Division and square root performance for each case is given in Table 2. The radix-4 case, shown in boldface, uses the actual latency figures from the HP PA7200. This implementation computes 4 quotient digits per machine cycle by clocking the divide/square root unit at double speed. For radix-16 operations, the figures assume that the slightly longer cycle time of the overlapped implementation can be accommodated without difficulty. The Goldschmidt’s algorithm latencies are based the Texas Instruments implementations parameterized by the latency of multiplication and the accuracy of the seed value. Table 4: Divide/square root performance of multiplyaccumulate implementations Algorithm 8-bit seed Newton-Raphson 16-bit seed Newton-Raphson radix-4 SRT radix-16 SRT Table 2: Divide/square root performance of independent implementations Algorithm 8-bit seed Goldschmidt 16-bit seed Goldschmidt radix-4 SRT radix-16 SRT Max 0.0 9.9 20.9 46.0 Divide 19 14 15 8 Latency Square Root 22 17 15 8 The results, shown in Table 5, display the same pattern as for the independent add-multiply case, albeit with even greater contrast. The longer latencies of the NewtonRaphson operations mean that the subtractive implementations perform even better by comparison. Note also the significant improvement afforded by the radix-16 implementation over the radix-4 case. Latency Divide Square Root 9 13 7 10 15 15 8 8 Area considerations. There are vast differences in the implementation styles of multiplicative and subtractive methods. The former are primarily enhancements of existing circuitry, while the latter inhabit completely new hardware; furthermore, the core operations are entirely different. In the interest of an objective evaluation, it is essential The relative performance of the different implementations on the Givens rotation benchmark is summarized in Table 3, normalized to the 8-bit seed Goldschmidt case. The subtractive implementations clearly dominate the multiplicative ones, even in cases where the latter have superior latencies. It is the ability to execute in parallel with the rest 7 Table 6: Area comparison of two divide/square root implementations Algorithm Device Chip Area [mm2 ] Transistor Count Div./Sqrt. Area [mm 2 ] radix-4 SRT Weitek 3364 95.2 165,000 4.8 Table 5: Improvement in execution time [%] on the Givens rotation benchmark, by implementation, for the multiply-accumulate configuration Algorithm 8-bit seed Newton-Raphson 16-bit seed Newton-Raphson radix-4 SRT radix-16 SRT Max 0.0 19.7 69.4 125.7 Min 0.0 4.9 22.4 22.5 8-bit seed Goldschmidt TI 8847 100.8 180,000 6.3 20 times larger than the 8-bit seed implementation, and 10 times larger than the radix-16 SRT unit. This calls into question the practicality of such a solution, especially since it leads to a relatively small improvement in latency as evidenced by Tables 2 and 4. Avg 0.0 11.6 48.3 68.8 Table 7: Relative cost of different divide/square root implementations Algorithm 8-bit seed multiplicative 16-bit seed multiplicative radix-4 SRT radix-16 SRT to compare the circuit area required by these very distinct approaches. Estimating area in a way that yields a fair comparison between chips is difficult because of basic differences in the implementation technologies. Nevertheless, it is possible to give some basis for evaluating the different implementations. Table 6 compares the size of the hardware dedicated to division in the Weitek 3364 and Texas Instruments 8847 arithmetic coprocessors; the figures are based on measurements of microphotographs [5]. The chips have similar die sizes and device densities, and we can assume the feature sizes are comparable as well. Although the multiplication algorithms are different, both have two-pass arrays which take up approximately 22% of the chip area. In short, apart from their divide/square root implementations, these two chips have a lot in common. However, the area devoted to division hardware is little more than 6% of the chip size in either case. Also, the relative area requirements differ by 1.5%. These figures represent only two particular designs whose implementation technology is by now somewhat out of date. However, they suggest that 8-bit seed Goldschmidt and radix-4 SRT implementations are both economical, and that the area differences between them can be kept small. An alternative approach uses standard-cell technology to estimate the areas of different implementations, including the more sophisticated options [14]. Table 7 shows the results of this study, with the values normalized to the size of the 8-bit seed Goldschmidt variant. According to these results, a radix-16 SRT unit need only be 45% larger than a radix-4 design in the same technology. A 16-bit seed Goldschmidt implementation, by contrast, could be over Area Factor 1.0 22 1.5 2.2 Summary Multiplicative implementations can provide low latency and lower cost than subtractive ones. However, their overall performance on our benchmark is inferior, mainly because the multiplication hardware is responsible for additional non-pipelined, multicycle operations. In the case of the multiply-accumulate configuration, addition, multiplication, division, and square root must all be accommodated by the same pipeline. Although 8-bit implementations tend to require less area than radix-4 hardware, 16-bit lookup tables dwarf even radix-16 implementations. Furthermore, the incredibly expensive upgrade from an 8-bit seed table to a 16-bit one offers only a modest reduction in divide/square root latency, and a downright meager improvement in benchmark execution time. Subtractive divide/root implementations provide superior benchmark performance at a reasonable cost, for both independent and multiply-accumulate configurations. This is primarily due to the fact that subtractive hardware operates independently and does not tie up other FPU resources. The parallel operation also means that improvements in divide/square root latency have a more decisive impact than in multiplicative implementations. Upgrading from radix-4 to radix-16 can improve performance by as much as 21% for the independent case or 56% for the multiply-accumulate structure, at a cost of 33% more area. Subtractive implementations also fit in nicely with the cur8 rent trend toward decoupled superscalar processors with high instruction issue rates and growing transistor budgets. We conclude that implementations of subtractive divide/square root algorithms are the most sensible choice, and will focus on them for the remainder of the discussion. Also, the difference between the dual issue radix-4 and radix-16 cases is much less significant than the contrast between cases with the same algorithm but differing issue rates. This is another instance, not unlike the contest between multiplicative and subtractive algorithms, where parallelism wins over latency. By keeping all of the units busier, especially the adder and multiplier, dual instruction issue softens the impact of division latency. We have also performed a set of simulations to determine how dual instruction issue affects the performance balance between multiplicative and subtractive methods. As shown in Table 8, multiple instruction issue strengthens the performance advantage of subtractive implementations. This is what one would expect given the increased parallelism of operations afforded by a separate functional unit for divide/square root operations. These data support the claim that subtractive methods are better suited to superscalar implementations. Issue rate The simulations in the previous section assumed an issue rate of one floating-point instruction per machine cycle. This situation holds in such processors as the PowerPC series, where the floating-point unit is a single pipeline, and the Intel P6, where the arithmetic functional units share a single set of inputs and outputs. However, many recent machines, including the MIPS 10000 and Sun UltraSPARC, can generate up to two floating-point instructions at once. Dual issue only makes sense when there are at least two independent functional units. Furthermore, if one of the two functional units is a dedicated divide/square root unit, dual issue is not worthwhile since these instructions are not frequent enough to keep a separate unit busy. By this reasoning, dual floating-point instruction issue is appropriate for an FPU with an independent adder, multiplier, and divide/square root unit, but not one with a multiplyaccumulate structure and separate divide/square root unit. Area considerations. The choice of how many floatingpoint instructions to dispatch per cycle is a much larger issue than divide/square root implementation alone, and probably not entirely up to the floating-point designer. Nevertheless, if the option is available, there are several issues to consider. Obviously, there must be at least as many functional units as the maximum number of instructions issued per cycle. Also, allocating one instruction per cycle to divide/square root operations alone is a waste of resources. Finally, there is a cost to boosting the floatingpoint issue rate which is not readily quantified. Routing must be added to deliver operands and return results. The register file needs extra ports, which adds to its area and access time. Finally, there are the necessary changes to the dispatch logic, which affect the processor as a whole. Performance impact. To explore the interaction of higher issue rates and divide/square root performance, we have run a series of experiments using the Givens rotation benchmark. For the reasons given above, the focus is restricted to independent add-multiply configurations with independent, subtractive divide/square root implementations. The variables are the instruction issue rate (single or dual) and the algorithm of the divide/square root implementation (radix-4 or radix-16). Performance figures for individual operations are the same as from previous experiments. Connectivity Table 9: Effect of FP instruction issue rate on benchmark performance; execution time improvement [%] for an independent configuration Algorithm radix-4 SRT radix-16 SRT radix-4 SRT radix-16 SRT Issue Rate single issue single issue dual issue dual issue Max 0.0 26.9 50.0 52.1 Min 0.0 0.0 5.5 22.3 In an FPU with a dual issue, independent add-multiply configuration, there are several ways to connect subtractive divide/square root hardware. One can either package divide/square root circuitry as an independent functional unit like the Sun UltraSPARC, share ports with the adder as in the DEC 21164, or share ports with the multiplier like in the MIPS R10000. Figure 6 illustrates the possible configurations. In every case, division and square root are computed concurrently with other FPU operations. However, certain configurations lead to contention for shared input and/or output ports. Multiplicative divide/square root hardware is always integrated with the multiplier, so the connectivity issue is moot for those types of implementations. Sharing subtractive divide/square root unit ports with a multiplyaccumulate unit is equivalent to single issue FPU, a case which has already been covered. Avg 0.0 7.0 35.4 44.4 From the results in Table 9, it is apparent that increasing the floating-point instruction issue rate leads to a genuine performance boost. This effect tends to overshadow the effects of divide/square root latency. The dual issue radix-4 case outperforms the single issue radix-16 one, even though the latter is faster on individual operations. 9 register file multiply add divide/ square root (a) register file multiply add divide/ square root (b) register file divide/ square root multiply add (c) Figure 6: Dual issue, independent add-multiply configurations with a) independent, b) adder-coupled and c) multiplier-coupled divide/square root units 10 Table 8: Effect of algorithm choice on benchmark performance; execution time improvement [%] for an independent, dual issue configuration Algorithm 8-bit seed Goldschmidt 16-bit seed Goldschmidt radix-4 SRT radix-16 SRT Issue Rate dual issue dual issue dual issue dual issue Max 0.0 11.4 37.2 46.7 Min 0.0 2.3 4.5 14.1 Avg 0.0 6.2 20.1 28.9 Table 10: Effect of divide/square root connectivity on benchmark performance; execution time improvement [%] for an independent configuration Algorithm radix-4 SRT radix-4 SRT radix-4 SRT radix-16 SRT radix-16 SRT radix-16 SRT Connectivity multiplier-coupled adder-coupled independent multiplier-coupled adder-coupled independent Performance impact. On the face of it, sharing functional unit ports between divide/square root and other operations seems sensible, since addition and multiplication occur much more frequently. It is important, however, to make sure that collisions between divide/square root and the host operations are minimized. These simulations explore the impact of different interconnection schemes on benchmark performance. In every case, an independent add-multiply configuration and dual floating-point instruction issue has been assumed. The variables are the algorithm (radix-4 or radix-16) and the connectivity (independent, adder-coupled, or multiplier-coupled). In cases where functional units are shared, divide and square root operations have been pushed back one cycle in the event of a port conflict. Max 0.0 4.4 4.4 12.1 26.3 26.3 Min 0.0 1.4 1.4 0.1 2.3 2.3 Avg 0.0 2.8 2.8 3.8 10.6 10.6 adder-coupled design, its near 2:1 ratio of multiply to add operations should not be taken as typical. In a survey of floating-point operations in the SPEC92fp benchmark suite, the multiply:add ratio is closer to 2:3 [11]. This supports the intuitive suggestion that many applications are dominated by addition rather than multiplication, and that multiplier-coupled divide/square root is therefore the most appropriate choice. In every other design decision covered so far, improving the performance of the benchmark is not in conflict with enhancing the speed of all programs, this case being an exception. Area considerations. The primary cost of a separate, independent divide/square root unit is the extra routing required, including the extra buses and the multiplexing/selection logic. Compare the routing complexity of the FPU with the independently connected divide/square root unit in Figure 6(a) with the adder-and multiplier-coupled cases in Figures 6(b) and 6(c). When the divide/square root unit is coupled to one of the other functional units, the two sets of ports on the register file can be readily split between the two clusters of components. Connecting two sets of ports to three independent units is considerably more complicated. Similarly, independent divide/square root units complicate the instruction dispatch logic by increasing the number of possible paths and destinations for instructions. The simulation results, given in Table 10, are somewhat counter-intuitive. First of all, for both algorithms, the independent and adder-coupled cases show the exact same performance. This is a feature of the Givens rotations algorithm, which has approximately twice as many multiplies as adds, and, therefore, many free issue slots allocated to the adder. The adder-coupled case can take advantage of these openings, yielding the same performance as the independent case. The multiply-coupled configuration is hampered by frequent collisions between multiply and divide/square root operations, but the effects are relatively mild, especially for the radix-4 cases. With the shorter latency of radix-16 operations, the collisions have comparatively greater impact on performance, but still less than 7% on average. Replication Several microprocessor manufacturers have recently attempted to boost numerical performance by replicating floating-point functional units. The IBM RS/6000 Although the Givens rotation benchmark favors the 11 Table 11: Summary of divide/square root design decisions for different architectures Design DEC 21164 Alpha AXP Hal Sparc64 HP PA7200 HP PA8000 IBM RS/6000 POWER2 Intel Pentium Intel Pentium Pro MIPS R8000 MIPS R10000 PowerPC 604 PowerPC 620 Sun SuperSPARC Sun UltraSPARC Add-Multiply Configuration independent multiply-acc. independent multiply-acc multiply-acc. independent independent multiply-acc. independent multiply-acc. multiply-acc. independent independent Algorithm SRT SRT SRT SRT Newton-Raphson SRT SRT multiplicative SRT SRT SRT Goldschmidt SRT POWER2 and MIPS R8000 feature dual multiplyaccumulate units, the MIPS R10000 couples its independent multiplier with two other units for divide and square root, and the HP PA8000 FPU consists of two multiplyaccumulate units and two divide/square root units. This trend is facilitated by increasingly compact technology and higher maximum issue rates. There are many different ways to replicate floatingpoint functionality and connect it with the rest of the system. For processors with FPU’s organized as single, atomic pipelines or blocks, such as the PowerPC and IBM RS/6000 series, the most natural alternative is to replicate the entire unit. Decoupled superscalar designs like the Intel P6 and Sun SuperSPARC are more at liberty to add individual units, such as an extra divide/square root unit. Div/Sqrt Design Decisions FP Issue Rate Connectivity 2 instr/cycle adder-coupled 2 instr/cycle independent 2 instr/cycle independent 2 instr/cycle multiply-acc.-coupled 2 instr/cycle integrated 1 instr/cycle adder-coupled 1 instr/cycle independent 2 instr/cycle integrated 2 instr/cycle multiplier-coupled 1 instr/cycle integrated 1 instr/cycle integrated 1 instr/cycle multiplier-integrated 2 instr/cycle independent Units 1 1 1 2 2 1 1 2 2 1 1 1 1 Area considerations. It is a more efficient use of area to improve a slow divide/square root unit than to replicate it. For example, in a dual-issue independent FPU, upgrading from radix-4 to radix-16 gives a maximum improvement of 22.5% and an average improvement of 7.5% on the Givens rotation benchmark, while duplication of the radix-4 unit gives no appreciable performance benefit. The upgrade to radix-16 costs only 45% of the radix-4 area, as opposed to 100% for replication, not including all of the extra routing and any changes to the register file or instruction issue logic. Conclusions We have argued that floating-point division and square root are important operations which warrant fast, efficient implementations, and explained the most common algorithms and their corresponding hardware structures. Our discussion of implementations has focussed on four design choices: algorithm, issue rate, connectivity, and replication. Figure 11 summarizes the design decisions made in recent microprocessors. Subtractive divide/square root implementations offer latencies which are competitive with the fastest multiplicative ones, but with a considerably lower consumption of chip area. More significantly, by operating concurrently with the remainder of the FPU, subtractive hardware provides a potentially significant boost in performance, as evidenced by simulations on the Givens rotation benchmark. This performance advantage holds true for both multiply-accumulate and independent add-multiply configurations. Unlike multiplicative divide/square root hardware, which serializes computation, subtractive implementations are well suited to exploit decoupled superscalar microarchitectures and issue rates of multiple floating-point Performance impact. There are many potential ways to compare the effects of replicating floating-point hardware on divide/square root performance. We considered one example particularly worthy of examination, namely, the duplication of divide/square root hardware in a fully independent FPU. However, from a preliminary investigation, we concluded that the addition would have no effect on Givens rotation benchmark performance, unless the method of scheduling operations was heavily modified. Even with a much more aggressive schedule, the expected effects would be insignificant. Addressing the performance impact of replicating floating-point functionality in general is beyond the scope of this investigation. However, since the Givens rotation benchmark represents an unusually high level of divide/square root utilization, replicating the hardware for these functions alone seems to offer little potential benefit. 12 instructions per cycle. For subtractive implementations, increasing the instruction issue rate from one to two instructions per cycle can also dramatically improve performance, by over 35% on average for the Givens rotation benchmark. As long as the extra instruction is not squandered on divide/square root operations alone, this seems like a worthwhile improvement, taking into account the cost of routing and upgrading the register file and instruction issue logic. The cost and complexity of adding a fully independent divide/square root unit to a dual-issue, independent addmultiply configuration are considerable. Therefore, coupling this functionality to one of the existing structures by sharing connections to the register file seems like a reasonable economizing measure. The worst case performance impact seems modest compared with the possible explosion in routing costs, and the additional scheduling complications incurred. The balance of the evidence suggests that the multiplier is the best candidate for sharing ports with the divide/square root unit. The merits of replicating floating point functional units in general remain to be seen. Certainly, even with highly parallel algorithms, one can expect significantly less than twice the performance from a doubling of hardware. In particular, duplicating divide/square root hardware alone appears to hold few advantages, even for the Givens rotation benchmark with its heavy utilization of these operations. [4] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press; Baltimore, second edition, 1989. [5] John L. Hennessy and David A. Patterson. puter Architecture: A Quantitative Approach. gan Kaufmann Publishers; San Mateo, CA, Appendix A: Computer Arithmetic by David berg. ComMor1990. Gold- [6] IEEE standard for binary floating-point arithmetic. New York ANSI/IEEE Std. 754–1985, August 1985. [7] W. Kahan. Using MathCAD 3.1 on a Mac, August 1994. [8] Mathworks, Inc., Natick, MA. The Pentium Papers, November 1994. Extensive online documentation of the Pentium FDIV bug and related publicity. Available on the World Wide Web at http://www.mathworks.com/README.html. [9] S. E. McQuillan and J. V. McCanny. A VLSI architecture for multiplication, division, and square root. In Proceedings of the 1991 International Conference on Acoustics, Speech and Signal Processing, pages 1205–1208. IEEE, May 1991. [10] S. E. McQuillan, J. V. McCanny, and R. Hamill. New algorithms and VLSI architectures for SRT division and square root. In Proceedings of the 11th IEEE Symposium on Computer Arithmetic, pages 80–86. IEEE, June 1993. Acknowledgements This research is supported in part by the National Science Foundation under contract CCR-9257280. Peter Soderquist was supported by a Fellowship from the National Science Foundation. Miriam Leeser is supported in part by an NSF Young Investigator Award. [11] Stewart F. Oberman and Michael J. Flynn. Design issues in floating-point division. Technical Report CSL-TR-94-647, Stanford University Departments of Electrical Engineering and Computer Science, Stanford, CA, December 1994. References [12] Norman R. Scott. Computer Number Systems and Arithmetic. Prentice Hall; Englewood Cliffs, NJ, 1985. [1] B. K. Bose, L. Pei, G. S. Taylor, and D. A. Patterson. Fast multiply and divide for a VLSI floatingpoint unit. In Proceedings of the 8th IEEE Symposium on Computer Arithmetic, pages 87–94. IEEE, May 1987. [13] Peter Soderquist and Miriam Leeser. An area/performance comparison of subtractive and multiplicative divide/square root implementations. In Proceedings of the 12th IEEE Symposium on Computer Arithmetic, pages 132–139. IEEE, July 1995. [2] Milos̆ D. Ercegovac, Tomás Lang, and Paolo Montuschi. Very high radix division with selection by rounding and prescaling. In Proceedings of the 11th IEEE Symposium on Computer Arithmetic, pages 112–119. IEEE, June 1993. [14] Peter Soderquist and Miriam Leeser. Area and performance tradeoffs in floating-point division and square root implementations. ACM Computing Surveys, 28(3):518–564, September 1996. [3] Miloš D. Ercegovac and Tomás Lang. Division and Square Root: Digit Recurrence Algorithms and Implementations. Kluwer Academic Publishers; Norwell, MA, 1994. [15] Ted E. Williams. A zero-overhead self-timed 160ns 54-b CMOS divider. IEEE Journal of Solid-State Circuits, 26(11):1651–1661, November 1991. 13 2 6 A = 64 2 6 6 4 2 3 77 (4!1) 66 5 4 0 2 3 (3 2) 6 0 7 7 ! 6 4 5 0 0 0 ; ; 2 3 77 (3!1) 66 5 4 0 0 2 3 (4 3) 6 0 7 7 ! 6 4 5 0 0 0 0 ; ; 2 3 77 (2!1) 66 0 5 4 0 0 3 0 7 7=R 0 0 5 ; 0 0 3 7 7 5 ! (4;2) 0 Figure 7: Triangularization of a 4-by-3 matrix using Givens rotations Sidebar: Givens Rotations and FPU-Level Simulation tions can therefore be used to shape matrices by introducing zeros. Figure 7 shows how to shape an arbitrary 4-by-3 matrix into upper triangular form. All that is needed is one computation of coefficients as in Figure 8 and repeated application of Equation 1 for each new zero produced. Note that despite the centrality of divide and square root operations, Givens rotation typically involves computing many matrix-vector products and therefore a predominance of additions and multiplications. The FPU-level simulator used in this study models the transformation of an m-by-n matrix into upper triangular form using Givens rotations. It accepts as input the dimensions of the matrix and a function describing the structure and performance of a given floating-point unit configuration. The computed result is the simulated performance of the particular FPU. Note that the simulator does not actually compute the values needed for triangularization, but rather the performance of different floating-point configurations in machine cycles. Many engineering and scientific applications use matrices and numerical analysis methods to model and simulate systems. These matrices must frequently take on a specific shape - such as diagonal, upper triangular, or lower triangular - assumed by a particular algorithm. Givens rotation [4] provides a tool for shaping arbitrary matrices into these and other more complicated forms as needed. For any two scalars a and b, Givens rotation performs the operation p c ?s s c T a b = r 0 (1) where r = a2 + b2. The rotation coefficients c and s can be computed using the function in Figure 8. function: [c; s] = givens(a; b) if b = 0 c = 1; s = 0 else if jbj > jaj p = a=b; s = 1= 1 + 2 ; c = s else p = b=a; c = 1= 1 + 2; s = c end end end givens Figure 8: Function for computing Givens rotation coefficients T Let a b be a submatrix of A, an m-by-n matrix representing the coefficients of a system model. Let the rotation be applied to all vertical pairs of elements in the rows of A containing a and b, in the manner of Equation 1. The system of equations represented by A remains the same, but now there is a zero in the position of b. Givens rota14
© Copyright 2026 Paperzz