High-Speed Inverse Square Roots Michael J. Schulte and Kent E. Wires Computer Architecture and Arithmetic Laboratory Electrical Engineering and Computer Science Department Lehigh University Bethlehem, PA 18015, USA [email protected] and [email protected] Abstract Inverse square roots are used in several digital signal processing, multimedia, and scientific computing applications. This paper presents a high-speed method for computing inverse square roots. This method uses a table lookup, operand modification, and multiplication to obtain an initial approximation to the inverse square root. This is followed by a modified Newton-Raphson iteration, consisting of one square, one multiply-complement, and one multiplyadd operation. The initial approximation and NewtonRaphson iteration employ specialized hardware to reduce the delay, area, and power dissipation. Application of this method is illustrated through the design of an inverse square root unit for operands in the IEEE single precision format. An implementation of this unit with a 4-layer metal, 2.5 Volt, 0.25 micron CMOS standard cell library has a cycle time of 6.7 ns, an area of 0.41 mm2 , a latency of five cycles, and a throughput of one result per cycle. 1. Introduction Square roots and inverse square roots are important in several digital signal processing, multimedia, and scientific computing applications [1], [2], [3], [4]. For many computations, including vector normalization [1], least squares lattice filters [2], Cholesky decomposition [4], and Givens rotations [5], a square root is first computed and then used as the divisor in a subsequent divide operation. A more efficient method for performing this computation is to first compute the inverse square root and then use it as the multiplier in a subsequent multiply operation [1]. Because of its usefulness in 3D graphics applications, special instructions for inverse square root have been added to the Motorola AltiVec [6] and Advanced Micro Devices 3DNow! [7] Instruction Set Extensions. Although several algorithms have been developed for computing square roots or inverse square roots, these algorithms typically either have long latencies or high memory requirements [5], [8], [9]. Digit recurrence algorithms, such as those presented in [10], [11], [12], need less area than other methods, but have linear convergence and often require a large number of iterations. On the other hand, methods that employ parallel polynomial approximations, such as those presented in [4], [13] [14], [15], have short latencies yet require large amounts of memory and area. This paper presents a high-speed method for computing inverse square roots. This method uses a variation of the algorithm presented in [16] to obtain an initial approximation to the inverse square root. The initial approximation requires a table lookup, operand modification, and multiplication. After the initial approximation, a modified Newton-Raphson iteration is used to produce an accurate inverse square root. The initial approximation and modified Newton-Raphson iteration are implemented using specialized hardware to reduce the delay, area, and power dissipation. The method presented in this paper is similar to the method presented in [1], except it uses a more accurate initial approximation, requires only a single Newton-Raphson iteration, and employs truncated multipliers and a specialized squaring unit. Consequently, it requires significantly less memory and area. Section 2 describes the method for computing inverse square roots. Section 3 presents the design of a hardware unit that uses this method to compute inverse square roots for numbers in the IEEE single precision format [17]. Section 4 gives our conclusions. A similar method for computing high-speed reciprocal approximations is described in [18]. 2. Inverse square root method The method presented in this paper produces an approximation Y to the inverse square root of a number X . The method for calculating Y consists of the following steps: k 0 1 2 3 4 5 6 p 1. Compute an initial approximation R 1= a variation of the method described in [16]. X using a 2. Perform a modified Newton-Raphson iteration p to produce a more accurate approximation Y 1= X . To reduce the hardware requirements, the inverse square root method uses truncated multipliers and a specialized squaring unit. range of n n3 n7 8 n 14 15 n 27 28 n 52 53 n 101 102 n 198 1 4 Table 1. Values of k for less than one ulp error. 2.1. Truncated multipliers In the discussion to follow, it is assumed that an unsigned n-bit multiplicand A is multiplied by an unsigned n-bit multiplier B to produce an unsigned 2n-bit product P . For fractional numbers, the values for A, B , and P are A= Xn ai ,i i=1 2 B= Xn bi ,i 2 i=1 P= Xn pi ,i 2 2 i=1 (1) The multiplication matrix for P = A B is shown in Figure 1a. To avoid excessive growth in word size, P is often rounded to n bits. a1 x b1 a2 an-1 an b2 bn-1 bn an-1b n an b n a1 bn a2 bn a1 bn-1 a2 bn-1 a1 b2 p1 a1 b1 a2 b1 p2 p3 a2 b 2 an-1 b n-1 an b n-1 an-1b 2 an b 2 a n-1b1 an b1 pn pn+1 pn+2 p2n-2 p2n-1 p2n (a) Standard Multiplication Matrix ak+1b n ak+2b n-1 anb k+1 1 a1 bn a2 b n a1 b n-1 a2 bn-1 1 1 ak-1 bn akbn akbn-1 ak+1b n-1 an-1 b k anb k anb k-1 a1 b2 p1 a1 b1 a2 b1 p2 p3 a2 b 2 an-1b2 an b 2 an-1 b1 an b1 pn pn+1 pn+2 pn+k-1 pn+k (b) Truncated Multiplication Matrix Figure 1. Multiplication matrices. With truncated multiplication, only the n + k most significant columns of the multiplication matrix are used to compute the product, which is then truncated to n bits [19], [20]. Using a method similar to the one presented in [20], the partial product bits in column n + k + 1 are added to column n + k to compensate for the error that occurs by eliminating columns n + k + 1 to 2n. To compensate for truncating the (n + k )-bit result to only n bits, ones are added to columns n + 2 to n + k , as shown in Figure 1b. To add these ones, k , 1 half adders are changed to specialized half adders. Specialized half adders are equivalent to full adders that have one input set to one and require roughly the same number of transistors as regular half adders [21]. As described in [22], this method of truncated multiplication results in a maximum absolute error that is bounded by Etrunc 0:5 + X b(n,k)=2c i=1 ( n , k + 2 , 2i)2,k,2i,1 (2) units in the last place (ulps). Table 1 shows values of k that limit the maximum absolute error due to truncated multiplication to less than one ulp for different ranges of n [22]. Truncated multipliers require significantly less hardware than conventional parallel multipliers. A conventional n by n array multiplier requires n2 AND gates, n2 , 2n full adders, and n half adders [19]. A truncated array multiplier, with t = n , k , requires t(t , 1)=2 fewer AND gates, (t , 1)(t , 2)=2 fewer full adders, and (t , 1) fewer half adders than an equivalent conventional array multiplier [22]. This reduction in hardware leads to a significant decrease in delay, area, and power dissipation. For example, a 32 by 32 truncated array multiplier with k = 4 has 17% less delay, 39% less area, and 41% less average power dissipation than a conventional 32 by 32 array multiplier [23]. Similarly, a conventional n by n Parallel Reduced Area Multiplier [21], requires n2 AND gates, n2 , 4n + 3 + S full adders, n , 1 half adders, and a (2n , 2 , S )-bit carry propagate adder, where S is the number of full adder stages required to reduce the partial products. A truncated Parallel Reduced Area Multiplier requires approximately t(t , 1)=2 fewer AND gates, (t , 2)(t , 3)=2+ S fewer full adders, and t fewer half adders. The size of the carry-propagate adder is reduced by approximately (t , S ) bits [22]. With mi- a1 a8 a2 a8 a3 a8 a4 a8 a5 a8 a6 a8 a7 a8 a8 a8 a1 a7 a2 a7 a3 a7 a4 a7 a5 a7 a6 a7 a7 a7 a8 a7 a1 a6 a2 a6 a3 a6 a4 a6 a5 a6 a6 a6 a7 a6 a8 a6 a1 a5 a2 a5 a3 a5 a4 a5 a5 a5 a6 a5 a7 a5 a8 a5 a1 a4 a2 a4 a3 a4 a4 a4 a5 a4 a6 a4 a7 a4 a8 a4 a1 a3 a2 a3 a3 a3 a4 a3 a5 a3 a6 a3 a7 a3 a8 a3 a1 a2 a2 a2 a3 a2 a4 a2 a5 a2 a6 a2 a7 a2 a8 a2 a1 a1 a2 a1 a3 a1 a4 a1 a5 a1 a6 a1 a7 a1 a8 a1 An n-bit squaring unit that uses Dadda’s method [26] for reducing the partial products requires (n2 + n , 2)=2 AND gates, (n2 , 7n + 10)=2 full adders, 2dn=2e, 3 half adders, n,1 inverters, and a (2n,3)-bit carry propagate adder [27]. In comparison, a conventional n by n Dadda tree multiplier requires n2 AND gates, n2 , 4n + 3 full adders, n , 1 half adders, and a (2n , 2)-bit carry propagate adder [21]. (a) Original Matrix a1 a2 a3 a4 a5 a6 a7 0 a8 2.3. Initial approximation a1 a2 a1 a3 a1 a4 a1 a5 a1 a6 a1 a7 a1 a8 a2 a8 a3 a8 a4 a8 a5 a8 a6 a8 a7 a8 a2 a3 a2 a4 a2 a5 a2 a6 a2 a7 a3 a7 a4 a7 a5 a7 a6 a7 a3 a4 a3 a5 a3 a6 a4 a6 a5 a6 a4 a5 (b) Reduced Matrix a1 a2 a1 a2 a2 a3 a2 a3 a3 a4 a3 a4 a4 a5 a4 a5 a5 a6 a5 a6 a6 a7 a6 a7 a7 a8 a7 a8 0 a8 a1 a3 a1 a4 a1 a5 a1 a6 a1 a7 a1 a8 a2 a8 a3 a8 a4 a8 a5 a8 a6 a8 a2 a4 a2 a5 a2 a6 a2 a7 a3 a7 a4 a7 a5 a7 A variation of the method presentedpin [16] is used to obtain an initial approximation R 1= X . As in [16], it is assumed that X = [1:x1 x2 : : : xn ] (xi 2 0; 1). With this method, X is divided into two parts; X1 = [1:x1 x2 : : : xm ] and X2 = [xm+1 xm+2 : : : xn ] 2,m . The initial approximation is computed as R = X0 C0 a3 a5 a3 a6 a4 a6 (c) Optimized Matrix (4) X 0 = [1:x1 x2 : : : xm xm+1 xm+1 xm+2 : : : xnX ], C 0 = X1,3=2 , 3 2,m,2X1,5=2 + 33 22m,6X1,7=2 , and nX denotes the number of bits in the fractional part of X 0 . The values for X 0 are obtained from X by complementing nX , m bits of X . The values for C 0 are precomputed, where Figure 2. 8 by 8 squaring matrices. 0 0 nor modifications, the technique of truncated multiplication can be extended to handle two’s complement multipliers, multiply-add and multiply-complement units, and multipliers for which A, B , and P are different sizes [22]. These modifications do not significantly increase the area or delay of truncated multiplication. 2.2. Specialized squaring unit The modified Newton-Raphson iteration uses the square of the initial approximation. Rather than having a parallel multiplier compute the square, a specialized squaring unit, which requires significantly less hardware, is employed. Figure 2a shows an 8 by 8 partial product matrix for S = A2 . Since ai aj = aj ai , the matrix is symmetric with respect to the antidiagonal. This allows the number of partial products to be reduced by using the identities ai aj + aj ai = 2ai aj and ai ai = ai . As shown in Figure 2b, the original matrix can be replaced by an equivalent matrix consisting of the partial products on the antidiagonal, plus the partial products above the antidiagonal shifted one position to the left [24]. To improve the regularity of the matrix and further reduce its maximum height, a technique presented in [25] is employed. This technique uses the identity ai + ai ai+1 = 2ai ai+1 + ai ai+1 (3) This allows the partial products ai and ai ai+1 in one column to be replaced by ai ai+1 in the same column and ai ai+1 in the next column to the left, as shown in Figure 2c. 0 rounded to nearest, and stored in a table (e.g., a RAM or ROM) that is addressed with the bits [x1 x2 : : : xm ]. For floating point numbers, the least significant bit of the exponent is also used to index the table. pIf the exponent is odd, the value for C 0 is multiplied by 1= 2 before it is rounded and stored in the table. This method varies slightly from the method presented in [16], because only the nX 0 most significant fractional bits of X 0 are used and R is computed using truncated multiplication. Although these modifications result in a small amount of additional error, they significantly reduce the amount of hardware required for the multiplication. The maximum absolute error in the initial approximation is bounded by R < ,2m,6X ,5=2 + X 0 C 3 2 C 0 X 1 0 + X C 0 0 + 0 Rt + (5) where the first term is the approximation error [16], C 0 is the error due to rounding C 0 , X 0 is the error due to using only the nX 0 most significant fractional bits of X 0 , and Rt is the error due to computing R with a truncated multiplier. 2.4. Modified Newton-Raphson iteration After the initial approximation R is computed, a modified iteration of the Newton-Raphson inverse square root algorithm is used to improve its accuracy. As presented in [28] and [29], the standard Newton-Raphson inverse square root algorithm uses the iterative equation Ri+1 = Ri (3 , XRi2)=2 (6) Implementing a standard Newton-Raphson iteration requires a square, two multiplies, a subtract, and a one bit right shift. The method presented in this paper uses a modified Newton-Raphson iteration, which can be expressed as W D Y = R2 = 1 = , WX R + RD=2 Implementing this iteration requires one square (W = R2 ), one multiply-complement operation (D = 1 , WX ) and one multiply-add operation (Y = R + RD=2). Dividing RD by 2 is implemented by shifting RD one extra position to the right before adding it to R. To further reduce the area, delay, and power dissipation, the following optimizations are used. W = R is computed by a specialized squaring unit. D = 1 , WX is computed by complementing the bits 2 of WX . Since W 1=X , D is close to zero and its most significant bits are all zeros (when D 0) or all ones (when D < 0). Since only one of these bits is required to indicated the sign, the other bits are not computed. D has least significant bits that are negligible, WX is computed using truncated multiplication. Since RD=2 is close to zero and its least significant bits are negligible when added to R, RD=2 is comSince puted using truncated multiplication. These optimizations are illustrated in Section 3. If the maximum absolute error in R is R and the iteration is performed with infinite precision, then Y will have a maximum absolute error of (3 p X2R + X3R)=2 (7) If the maximum absolute errors due to truncating W , D, and Y are denoted as Wt , Dt , and Y t , respectively, then the error in the final approximation is bounded by Y < p p X2R + X3R + XWt + XRWt + p (8) R Dt + Dt = X )=2 + Y t (3 By controlling each of these errors, Y is computed with a maximum absolute error of less than one ulp. 3. Inverse square root unit Based on the method and error analysis presented in the previous section, the design of an inverse square root unit is developed. The results of the error analysis establish an initial hardware design. After this, exhaustive simulations determine where modifications can be made to further reduce the design’s hardware requirements, while still guaranteeing that all results are accurate to one ulp. The inverse square root unit provides approximations for numbers in the IEEE single precision floating point format [17]. With this format, a floating point number V is represented using a sign bit sv , an 8-bit biased exponent ev , and a 23-bit stored significand fv . If V is a normalized number, it has the value ,1)sv (1 + fv ) 2ev , (9) where 1 (1 + fv ) < 2 and 1 ev 254. An exponent V =( 127 of zero is used to represent denormalized numbers and zero, and an exponent of 255 is used to represent quantities that are infinite or not-a-number (NaN) [17]. The p inverse square root unit takes V and computes Z 1= V , which has a sign bit sz , an 8-bit biased exponent ez , and a 23-bit stored significand fz . To handle IEEE single precision numbers with both even and odd exponents, the method presented in the previous section is modified slightly and hardware is added to take care of special cases. For normalized positive numbers, if the biased exponent ev is even, X = 2(1 + fv ); otherwise, X = 1 + fv . Since 1 X < 4, 0:5 < Y 1. As is done in [1], additional hardware is used to handle the special case X = Y = 1 (i.e., fv = 0 and ev is odd). Unless X = 1 or V is not a positive normalized number, fz is obtained by shifting Y left one bit to normalize it and extracting the leading one, which can be expressed mathematically as fz = 2Y , 1. Additional hardware is also needed to handle negative numbers, and numbers that are not normalized. Table 2 shows how the values for ez and fz are determined based on the values of ev and fv , for V 0. If V < 0, the result is set to NaN, and an invalid operation exception flag i is set. If V is a denormalized number (Case 2), a denormalized input flag d is set, which causes a software trap to occur. If fz should not be set to fv (Cases 6 and 7), the approximate flag a is set. Since V and Z always have the same sign for V 0 and the sign of Z is not defined for V < 0, sz is always set to sv . The block diagram of an inverse square root unit for single-precision floating point numbers is shown in Figure 3. Busses are shown in bold and the width of each bus is given. This unit performs an inverse square root approximation in only five cycles. It is pipelined so that a new result can be produced each cycle. In the first cycle, the value of C 0 is obtained by a table lookup. In parallel with this, the Case 1 2 3 4 5 6 7 V 0 Denorm 1 + NaN Normalized Normalized Normalized ev fv 0 0 255 255 1 to 253 odd 1 to 253 odd 2 to 254 even Z 0 1 0 +0 6= 0 6= 0 0 6= 0 any ez 255 fz fv , ev Trap for Denormal Input , ev fv ev fv (381 , ev )=2 fv (379 , ev )=2 2=p1 + fv , 1 (380 , ev )=2 2= 2(1 + fv ) , 1 255 NaN Normalized Normalized Normalized p Table 2. Determining ez and fz from ev and fv . exponent/exception logic computes ez and sets the values of the d, i, and a flags. In the second cycle, a modified multiplier computes R = C 0 X 0 . In the third cycle, a squaring unit computes W = R2 . In the fourth cycle, a multiplycomplement unit computes D = 1 , WX . Rather than passing both X and fv in the pipeline, only fv is passed, and X is determined from fv and the least significant bit of ev , which is denoted as o. In the fifth cycle, the multiplyaccumulate unit computes Y = R + RD=2 and fz is selected as either fv or 2Y , 1, based on a. sv 23 i d s X 0 = [1:x1 x2 : : : x5 x6 x6 x7 : : : x14 ]: (10) is obtained from the 14 most significant bits of fv , which correspond to [x1 : : : x14 ]. Eliminating the least significant columns of the multiplier reduces its delay and area by approximately 11% and 31%, respectively. Based on Equation 5, the maximum absolute error in the initial approximation is bounded by R < 3 2,16 + 2,16 + 2,15 + 2,32 + 2,16 < 2,13 (11) since m = 5, 1 < X 0 < 2, 0:25 < C 0 < 1, X < 2,15 , C < 2,17 , and Rt < 2,16 . Exhaustive simulation of 0 0 fv 5 8 Table Lookup Exponent/ Exception Logic 8 3.1. Exponent logic and initial approximation The exponent/exception logic takes as inputs sv , ev , and fv , and computes ez , a, d, and i. A 9-bit subtractor and multiplexors are used to compute ez , as shown in Table 2. The unit also requires a modest amount of combinational logic to determine special cases and set the values of the a, d, and i flags. The table lookup for the initial approximation is implemented as a 64-word by 16-bit read-only-memory (ROM). The ROM takes as inputs the 5 most significant bits of fv , which correspond to [x1 x2 : : : x5 ], and the least significant bit of ev , which determines if the exponent is even or odd. The table lookup produces C 0 , as described in Section 2.3. The modified multiplier, which computes R = C 0 X 0 , is implemented as a 16 by 16 truncated multiplier, in which k = 3 and the product is truncated to 16 bits. Prior to performing the multiplication, the modified operand ev ez 23 16 C’ a 14 Modified Multiplier 8 s ez fv o 23 16 a R o fv Squaring Unit 8 s ez a 23 25 16 W R o Multiply Complemnt Unit 8 s ez 23 14 16 a fv R 23 fv D Multiply Add Unit 23 8 23 Multiplexor sz ez fz Figure 3. Inverse square root unit (Z p = V ). =1 the initial approximation reveals that this conservative upper bound on the error never occurs and that the maximum absolute error is bounded by R < 2,14 . 3.2. Newton-Raphson iteration hardware A specialized 16-bit squaring unit computes W = R2 and truncates W to 25 bits. The maximum absolute error from truncating the result is bounded by Wt < 2,25 . Compared to a conventional 16 by 16 multiplier, the 16-bit squaring unit has approximately 14% less delay and 47% less area. The multiply-complement unit performs the operation D = 1 , WX , which corresponds to a 25 by 25 truncated multiplication, in which the bits of the product are inverted. If o = 1, the unbiased exponent is even and X = 1 + fv . Otherwise, X = 2(1 + fv ) and a one bit left shift of 1 + fv is required. For the truncated multiplycomplement unit, k = 3 and the result is first truncated to 26 bits and then complemented. Figure 4 shows the multiplication matrix for this unit. With truncated multiplication, the 20 least significant columns of the multiplication matrix are eliminated. Furthermore, simulations show that j D j< 2,13 , which means that the 13 most significant bits are identical. Therefore, the 12 most significant bits of D are not needed and the columns of the multiplication matrix for computing these bits are omitted. Since the result is truncated to 26 fractional bits and the 12 most significant columns of the multiplication matrix are omitted, only 14 bits of WX are complemented and passed to the next stage of computation. The non-shaded portion of the matrix in Figure 4 corresponds to columns that are omitted. Omitting these columns reduces the delay and area of the the multiply-complement unit by approximately 22% and 42%, respectively. The maximum absolute error introduced by the truncated multiply-complement unit is bounded by Dt < 2,26 . The multiply-add unit computes Y = R + RD=2, which corresponds to a 16 plus 16 by 14 multiply-add operation. The multiplication matrix for this unit is shown in Figure 5. Since Y is always positive, and D can be either positive or negative, the product is sign extended, as described in [15]. Since j D=2 j 2,14 , the product RD is right-shifted 14 bits when adding it to R. For this unit, k = 3 and the final result is truncated to 24 bits. Omitting the least significant columns of the multiply-add unit reduces its delay and area by 8% and 63%, respectively. Since 0:5 < Y < 1:0, Y is normalized by shifting it left one bit and removing the leading one to produce fz . The maximum absolute error introduced by the truncated multiply-add unit is bounded by Y t 2,24 . Based on Equation 8, the maximum absolute error in Y 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 1111111111111111111111 0000000000000000000000 111111111111d121111111111111111111111 11111111111111d26 00 11d 29 000000000000000000000000 111111111111111111111111d 49 d0 d1 000000000000 d13 00000000000000 Figure 4. Multiply-complement D =1 , WX . 11111111111111111 00000000000000000 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 10 00000000000000000 11111111111111111 1010 00000000000000000 11111111111111111 00000000000000000 11111111111111111 10 00000000000000000 11111111111111111 1010 00000000000000000 11111111111111111 00000000000000000 11111111111111111 1010 00000000000000000 11111111111111111 00000000000000000 11111111111111111 1010 00000000000000000 11111111111111111 00000000000000000 11111111111111111 1010 00000000000000000 11111111111111111 00000000000000000 11111111111111111 10 00000000000000000 11111111111111111 r1 r2 r15 r16 111111111111111y15y16 00000000 11111111y24000 111y2700000000000000000000 11111111111111111111y44 y1 y2 000000000000000 Figure 5. Multiply-add Y = R + RD=2. is bounded by ,28 + 2,41 + 2,26 + 2,41 + ,41 + 2,27 + 2,24 2 < 2,23 Y < 3 2 (12) < X < 4, R < 2,14, Wt < 2,26, Dt < 2,26 . Y t < 2,24 . Exhaustive simulation over 1 < X < 4 shows since 1 that this conservative upper bound on the error never occurs and that the maximum absolute error is bounded by Y < 2,24 . Thus, the maximum absolute error in the final result is less than one ulp. 3.3. Implementation and Comparison The inverse square root unit was synthesized using the Synopsys Module Compiler and a 2.5 Volt, 0.25 micron CMOS standard cell library, which has four levels of metal. The area and critical path delay of each component, and the total area and cycle time for the unit are given in Table 3. The truncated multipliers use a variation of the Reduced Area Multiplier reduction scheme [21], followed by a carry lookahead adder. The table lookup uses a high-speed ROM. The inverse square root unit has an area of 0.41 mm2 and a cycle time of 6.7 ns. For comparison purposes, Table 4 gives area and delay estimates for the inverse square root unit presented in [1]. The unit in [1] is similar to the unit presented in this paper, except it uses a less accurate initial approximation, requires Component Exponent/Exception Unit Table-Lookup Modified Multiplier Squaring Unit Multiply-Complement Multiply-Add Total Area/Cycle Time Area (mm2 ) 0.03 0.04 0.07 0.03 0.13 0.08 0.41 Delay (ns) 1.8 1.4 4.7 2.6 5.6 6.0 6.7 Table 3. Area and delay. Component Exponent/Exception Unit Table-Lookup Multiply-Complement 1st Multiplier 2nd Multiplier 3rd Multiplier Multiply-Add Total Area/Cycle Time Area (mm2 ) 0.03 0.10 0.07 0.06 0.10 0.23 0.11 0.72 Delay (ns) 1.8 2.0 5.1 4.9 6.0 6.6 6.3 7.3 Table 4. Area and delay for unit in [1]. two Newton-Raphson iterations, and does not employ specialized truncated multiplication and squaring units. Furthermore, it only works for normalized single precision numbers. To allow for a fair comparison, the area and delay estimates given in Table 4 are normalized using data from the same standard cell library. They also assume that the unit has been modified to produce one result per cycle and handle all types of single precision numbers. Compared to the inverse square root unit presented in this paper, the inverse square root unit presented in [1] has a cycle time that is approximately 9% longer and it requires approximately 77% more area. To achieve the cycle time given in Table 4, it has a latency of six cycles. 4. Conclusions This paper presents a high-speed method for computing inverse square roots. The method uses an accurate initial approximation technique, followed by a modified NewtonRaphson iteration. Based on this method, the design of an inverse square root unit for IEEE single precision floating point numbers is developed. This unit has a cycle time of 6.7 ns, an area of 0.41 mm2 , a latency of five cycles, and a throughput of one result per cycle. By utilizing truncated multipliers and specialized squaring units, the area, delay, and power consumption of the design are significantly reduced. Error analysis and exhaustive behavior level simulation over the significand range ensure that the inverse square root unit produces results with a maximum error of one ulp. Acknowledgments This material is based upon work supported by the National Science Foundation under Grant No. MIP-9703421. References [1] H. Kwan, R. L. Nelson, and E. E. Swartzlander, Jr., “Cascaded Implementation of an Iterative InverseSquare-Root Algorithm with Overflow Lookahead,” in Proceedings of the 12th Symposium on Computer Arithmetic, pp. 114–123, 1995. [2] R. W. Stewart, R. Chapman, and T. Durrani, “The Square Root in Signal Processing,” in Proceedings of Real-Time Signal Processing XII, pp. 89–100, 1989. [3] T. R. Halfhill, “Beyond MMX x86 Extensions,” Byte, vol. 22, no. 12, pp. 87–92, 1997. [4] V. K. Jain, G. E. Perez, and J. M. Wills, “Novel Reciprocal and Square-Root VLSI Cell: Architecture and Application to Signal Processing,” in International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 1201–1204, 1991. [5] P. Soderquist and M. Leeser, “Area and Performance Tradeoffs in Floating-Point Divide and Square Root Implementations,” ACM Computing Surveys, vol. 28, no. 3, pp. 518–564, 1996. [6] AltiVec Technology Programming Environments Manual. Motorola, Inc., 1998. [7] S. Oberman, F. Weber, N. Juffa, and G. Favor, “AMD 3DNow! Technology and the K6-2 Microprocessor,” in Proceedings of Hot Chips 10, pp. 245–254, 1998. [8] P. Montuschi and P. M. Mezzalama, “Survey of Square Rooting Algorithms,” IEE Proceedings E (Computers and Digital Techniques), vol. 137, no. 1, pp. 31–40, 1990. [9] I. Koren, Computer Arithmetic Algorithms. Prentice Hall, 1993. [10] M. D. Ercegovac and T. Lang, Division and Square Root: Digit-Recurrence Algorithms and Implementations. Kluwer Academic Publishers, 1994. [11] S. Majerski, “Square-Rooting Algorithms for HighSpeed Digital Circuits,” IEEE Transactions on Computer, vol. C-34, pp. 724–733, August 1985. [12] M. D. Ercegovac and T. Lang, “Radix-4 Square Root Without Initial PLA,” IEEE Transactions on Computer, vol. C-39, pp. 1016–1024, 1990. [13] V. K. Jain and L. Lin, “Square-Root, Reciprocal, Sine/Cosine, Arctangent cell for Signal and Image Processing,” in International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 521– 524, 1994. [14] J. Cao and B. Wei, “High-Performance Hardware for Function Generation,” in Proceedings of the 13th Symposium on Computer Arithmetic, pp. 184–189, 1997. [15] M. J. Schulte and E. E. Swartzlander, Jr., “Hardware Designs for Exactly Rounded Elementary Functions,” IEEE Transactions on Computers, vol. C-44, pp. 964– 973, 1994. [16] N. Takagi, “Generating a Power of an Operand by a Table Look-Up and a Multiplication,” in Proceedings of the 13th Symposium on Computer Arithmetic, pp. 126–131, 1997. [17] IEEE, Std 754-1985 for Binary Floating-Point Arithmetic. Standards Committee of The IEEE Computer Society. 345 East 47th Street, New York, NY 10017, USA, 1985. [18] M. J. Schulte, J. E. Stine, and K. E. Wires, “HighSpeed Reciprocal Approximations,” in Proceedings of the Thirty First Asilomar Conference on Signals, Circuits and Systems, pp. 1178–1182, 1998. [19] M. J. Schulte and E. E. Swartzlander, Jr., “Truncated multiplication with correction constant,” in VLSI Signal Processing, VI, pp. 388–396, 1993. [20] E. J. King and E. E. Swartzlander, Jr., “Data Dependent Truncated Scheme for Parallel Multiplication,” in Proceedings of the Thirty First Asilomar Conference on Signals, Circuits and Systems, pp. 1178–1182, 1998. [21] K. Bickerstaff, M. J. Schulte, and E. E. Swartzlander, Jr., “Parallel Reduced Area Multipliers,” Journal of VLSI Signal Processing, vol. 9, pp. 181–192, 1995. [22] M. J. Schulte, J. G. Jansen, and J. E. Stine., “Implementing Truncated Multipliers,” to be submitted to the IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 1999. [23] M. J. Schulte, J. G. Jansen, and J. E. Stine., “Reduced Power Dissipation Through Truncated Multiplication,” in IEEE Alessandro Volta Memorial International Workshop on Low Power Design, 1999 (in press). [24] T. C. Chen, “A Binary Multiplication Scheme Based on Squaring,” IEEE Transactions on Computers, vol. C-20, pp. 678–680, 1971. [25] R. H. Strandberg et al., “Efficient Realizations of Squaring Circuit and Reciprocal used in Adaptive Sample Rate Notch Filters,” Journal of VLSI Signal Processing, vol. 14, no. 3, pp. 303–309, 1996. [26] L. Dadda, “Some Schemes for Parallel Multipliers,” Alta Frequenza, vol. 34, pp. 349–356, 1965. [27] M. J. Schulte, K. E. Wires, and L. P. Marquette, “Implementing Squaring Units,” to be submitted to the International Conference on Computer Design, 1999. [28] W. James and P. Jaratt, “The Generation of Square Roots on a Computer With Rapid Multiplication Compared with Division,” Mathematics of Computation, vol. 19, pp. 497–500, 1965. [29] C. V. Ramamoorthy and J. Goodman, “Some Properties of Iterative Square-Rooting Methods Using HighSpeed Multiplication,” IEEE Transactions on Computers, vol. C-21, pp. 837–847, 1972.
© Copyright 2026 Paperzz