High-Speed Inverse Square Roots - ACSEL

High-Speed Inverse Square Roots
Michael J. Schulte and Kent E. Wires
Computer Architecture and Arithmetic Laboratory
Electrical Engineering and Computer Science Department
Lehigh University
Bethlehem, PA 18015, USA
[email protected] and [email protected]
Abstract
Inverse square roots are used in several digital signal
processing, multimedia, and scientific computing applications. This paper presents a high-speed method for computing inverse square roots. This method uses a table lookup,
operand modification, and multiplication to obtain an initial approximation to the inverse square root. This is followed by a modified Newton-Raphson iteration, consisting
of one square, one multiply-complement, and one multiplyadd operation. The initial approximation and NewtonRaphson iteration employ specialized hardware to reduce
the delay, area, and power dissipation. Application of this
method is illustrated through the design of an inverse square
root unit for operands in the IEEE single precision format.
An implementation of this unit with a 4-layer metal, 2.5 Volt,
0.25 micron CMOS standard cell library has a cycle time of
6.7 ns, an area of 0.41 mm2 , a latency of five cycles, and a
throughput of one result per cycle.
1. Introduction
Square roots and inverse square roots are important in
several digital signal processing, multimedia, and scientific
computing applications [1], [2], [3], [4]. For many computations, including vector normalization [1], least squares
lattice filters [2], Cholesky decomposition [4], and Givens
rotations [5], a square root is first computed and then used
as the divisor in a subsequent divide operation. A more efficient method for performing this computation is to first
compute the inverse square root and then use it as the multiplier in a subsequent multiply operation [1]. Because of its
usefulness in 3D graphics applications, special instructions
for inverse square root have been added to the Motorola AltiVec [6] and Advanced Micro Devices 3DNow! [7] Instruction Set Extensions.
Although several algorithms have been developed for
computing square roots or inverse square roots, these algorithms typically either have long latencies or high memory requirements [5], [8], [9]. Digit recurrence algorithms,
such as those presented in [10], [11], [12], need less area
than other methods, but have linear convergence and often
require a large number of iterations. On the other hand,
methods that employ parallel polynomial approximations,
such as those presented in [4], [13] [14], [15], have short
latencies yet require large amounts of memory and area.
This paper presents a high-speed method for computing inverse square roots. This method uses a variation of
the algorithm presented in [16] to obtain an initial approximation to the inverse square root. The initial approximation requires a table lookup, operand modification, and
multiplication. After the initial approximation, a modified
Newton-Raphson iteration is used to produce an accurate
inverse square root. The initial approximation and modified
Newton-Raphson iteration are implemented using specialized hardware to reduce the delay, area, and power dissipation.
The method presented in this paper is similar to the
method presented in [1], except it uses a more accurate initial approximation, requires only a single Newton-Raphson
iteration, and employs truncated multipliers and a specialized squaring unit. Consequently, it requires significantly
less memory and area. Section 2 describes the method for
computing inverse square roots. Section 3 presents the design of a hardware unit that uses this method to compute
inverse square roots for numbers in the IEEE single precision format [17]. Section 4 gives our conclusions. A similar method for computing high-speed reciprocal approximations is described in [18].
2. Inverse square root method
The method presented in this paper produces an approximation Y to the inverse square root of a number X . The
method for calculating Y consists of the following steps:
k
0
1
2
3
4
5
6
p
1. Compute an initial approximation R 1=
a variation of the method described in [16].
X using a
2. Perform a modified Newton-Raphson iteration
p to produce a more accurate approximation Y 1= X .
To reduce the hardware requirements, the inverse square
root method uses truncated multipliers and a specialized
squaring unit.
range of n
n3
n7
8 n 14
15 n 27
28 n 52
53 n 101
102 n 198
1
4
Table 1. Values of k for less than one ulp error.
2.1. Truncated multipliers
In the discussion to follow, it is assumed that an unsigned n-bit multiplicand A is multiplied by an unsigned
n-bit multiplier B to produce an unsigned 2n-bit product
P . For fractional numbers, the values for A, B , and P are
A=
Xn ai ,i
i=1
2
B=
Xn bi ,i
2
i=1
P=
Xn pi ,i
2
2
i=1
(1)
The multiplication matrix for P = A B is shown in Figure 1a. To avoid excessive growth in word size, P is often
rounded to n bits.
a1
x
b1
a2
an-1
an
b2
bn-1
bn
an-1b n an b n
a1 bn a2 bn
a1 bn-1 a2 bn-1
a1 b2
p1
a1 b1
a2 b1
p2
p3
a2 b 2
an-1 b n-1 an b n-1
an-1b 2 an b 2
a n-1b1 an b1
pn
pn+1
pn+2
p2n-2 p2n-1 p2n
(a) Standard Multiplication Matrix
ak+1b n
ak+2b n-1
anb k+1
1
a1 bn a2 b n
a1 b n-1 a2 bn-1
1
1
ak-1 bn
akbn
akbn-1
ak+1b n-1
an-1 b k anb k
anb k-1
a1 b2
p1
a1 b1
a2 b1
p2
p3
a2 b 2
an-1b2 an b 2
an-1 b1 an b1
pn
pn+1
pn+2
pn+k-1
pn+k
(b) Truncated Multiplication Matrix
Figure 1. Multiplication matrices.
With truncated multiplication, only the n + k most significant columns of the multiplication matrix are used to compute the product, which is then truncated to n bits [19], [20].
Using a method similar to the one presented in [20], the partial product bits in column n + k + 1 are added to column
n + k to compensate for the error that occurs by eliminating columns n + k + 1 to 2n. To compensate for truncating the (n + k )-bit result to only n bits, ones are added to
columns n + 2 to n + k , as shown in Figure 1b. To add
these ones, k , 1 half adders are changed to specialized half
adders. Specialized half adders are equivalent to full adders
that have one input set to one and require roughly the same
number of transistors as regular half adders [21].
As described in [22], this method of truncated multiplication results in a maximum absolute error that is bounded
by
Etrunc 0:5 +
X
b(n,k)=2c
i=1
(
n , k + 2 , 2i)2,k,2i,1
(2)
units in the last place (ulps). Table 1 shows values of k that
limit the maximum absolute error due to truncated multiplication to less than one ulp for different ranges of n [22].
Truncated multipliers require significantly less hardware
than conventional parallel multipliers. A conventional n by
n array multiplier requires n2 AND gates, n2 , 2n full
adders, and n half adders [19]. A truncated array multiplier, with t = n , k , requires t(t , 1)=2 fewer AND
gates, (t , 1)(t , 2)=2 fewer full adders, and (t , 1) fewer
half adders than an equivalent conventional array multiplier [22]. This reduction in hardware leads to a significant
decrease in delay, area, and power dissipation. For example,
a 32 by 32 truncated array multiplier with k = 4 has 17%
less delay, 39% less area, and 41% less average power dissipation than a conventional 32 by 32 array multiplier [23].
Similarly, a conventional n by n Parallel Reduced Area
Multiplier [21], requires n2 AND gates, n2 , 4n + 3 + S
full adders, n , 1 half adders, and a (2n , 2 , S )-bit carry
propagate adder, where S is the number of full adder stages
required to reduce the partial products. A truncated Parallel
Reduced Area Multiplier requires approximately t(t , 1)=2
fewer AND gates, (t , 2)(t , 3)=2+ S fewer full adders, and
t fewer half adders. The size of the carry-propagate adder
is reduced by approximately (t , S ) bits [22]. With mi-
a1 a8 a2 a8 a3 a8 a4 a8 a5 a8 a6 a8 a7 a8 a8 a8
a1 a7 a2 a7 a3 a7 a4 a7 a5 a7 a6 a7 a7 a7 a8 a7
a1 a6 a2 a6 a3 a6 a4 a6 a5 a6 a6 a6 a7 a6 a8 a6
a1 a5 a2 a5 a3 a5 a4 a5 a5 a5 a6 a5 a7 a5 a8 a5
a1 a4 a2 a4 a3 a4 a4 a4 a5 a4 a6 a4 a7 a4 a8 a4
a1 a3 a2 a3 a3 a3 a4 a3 a5 a3 a6 a3 a7 a3 a8 a3
a1 a2 a2 a2 a3 a2 a4 a2 a5 a2 a6 a2 a7 a2 a8 a2
a1 a1 a2 a1 a3 a1 a4 a1 a5 a1 a6 a1 a7 a1 a8 a1
An n-bit squaring unit that uses Dadda’s method [26] for
reducing the partial products requires (n2 + n , 2)=2 AND
gates, (n2 , 7n + 10)=2 full adders, 2dn=2e, 3 half adders,
n,1 inverters, and a (2n,3)-bit carry propagate adder [27].
In comparison, a conventional n by n Dadda tree multiplier
requires n2 AND gates, n2 , 4n + 3 full adders, n , 1 half
adders, and a (2n , 2)-bit carry propagate adder [21].
(a) Original Matrix
a1
a2
a3
a4
a5
a6
a7
0
a8
2.3. Initial approximation
a1 a2 a1 a3 a1 a4 a1 a5 a1 a6 a1 a7 a1 a8 a2 a8 a3 a8 a4 a8 a5 a8 a6 a8 a7 a8
a2 a3 a2 a4 a2 a5 a2 a6 a2 a7 a3 a7 a4 a7 a5 a7 a6 a7
a3 a4 a3 a5 a3 a6 a4 a6 a5 a6
a4 a5
(b) Reduced Matrix
a1 a2 a1 a2 a2 a3 a2 a3 a3 a4 a3 a4 a4 a5 a4 a5 a5 a6 a5 a6 a6 a7 a6 a7 a7 a8 a7 a8
0
a8
a1 a3 a1 a4 a1 a5 a1 a6 a1 a7 a1 a8 a2 a8 a3 a8 a4 a8 a5 a8 a6 a8
a2 a4 a2 a5 a2 a6 a2 a7 a3 a7 a4 a7 a5 a7
A variation of the method presentedpin [16] is used to
obtain an initial approximation R 1= X . As in [16], it
is assumed that X = [1:x1 x2 : : : xn ] (xi 2 0; 1). With this
method, X is divided into two parts; X1 = [1:x1 x2 : : : xm ]
and X2 = [xm+1 xm+2 : : : xn ] 2,m . The initial approximation is computed as
R = X0 C0
a3 a5 a3 a6 a4 a6
(c) Optimized Matrix
(4)
X 0 = [1:x1 x2 : : : xm xm+1 xm+1 xm+2 : : : xnX ],
C 0 = X1,3=2 , 3 2,m,2X1,5=2 + 33 22m,6X1,7=2 , and
nX denotes the number of bits in the fractional part of X 0 .
The values for X 0 are obtained from X by complementing
nX , m bits of X . The values for C 0 are precomputed,
where
Figure 2. 8 by 8 squaring matrices.
0
0
nor modifications, the technique of truncated multiplication
can be extended to handle two’s complement multipliers,
multiply-add and multiply-complement units, and multipliers for which A, B , and P are different sizes [22]. These
modifications do not significantly increase the area or delay
of truncated multiplication.
2.2. Specialized squaring unit
The modified Newton-Raphson iteration uses the square
of the initial approximation. Rather than having a parallel multiplier compute the square, a specialized squaring unit, which requires significantly less hardware, is employed. Figure 2a shows an 8 by 8 partial product matrix
for S = A2 . Since ai aj = aj ai , the matrix is symmetric with respect to the antidiagonal. This allows the number of partial products to be reduced by using the identities
ai aj + aj ai = 2ai aj and ai ai = ai . As shown in Figure 2b, the original matrix can be replaced by an equivalent
matrix consisting of the partial products on the antidiagonal, plus the partial products above the antidiagonal shifted
one position to the left [24].
To improve the regularity of the matrix and further reduce its maximum height, a technique presented in [25] is
employed. This technique uses the identity
ai + ai ai+1 = 2ai ai+1 + ai ai+1
(3)
This allows the partial products ai and ai ai+1 in one column to be replaced by ai ai+1 in the same column and
ai ai+1 in the next column to the left, as shown in Figure 2c.
0
rounded to nearest, and stored in a table (e.g., a RAM or
ROM) that is addressed with the bits [x1 x2 : : : xm ]. For
floating point numbers, the least significant bit of the exponent is also used to index the table. pIf the exponent is odd,
the value for C 0 is multiplied by 1= 2 before it is rounded
and stored in the table.
This method varies slightly from the method presented
in [16], because only the nX 0 most significant fractional
bits of X 0 are used and R is computed using truncated multiplication. Although these modifications result in a small
amount of additional error, they significantly reduce the
amount of hardware required for the multiplication.
The maximum absolute error in the initial approximation
is bounded by
R <
,2m,6X ,5=2 + X 0 C
3 2
C 0 X
1
0
+
X C
0
0
+
0
Rt
+
(5)
where the first term is the approximation error [16], C 0 is
the error due to rounding C 0 , X 0 is the error due to using
only the nX 0 most significant fractional bits of X 0 , and Rt
is the error due to computing R with a truncated multiplier.
2.4. Modified Newton-Raphson iteration
After the initial approximation R is computed, a modified iteration of the Newton-Raphson inverse square root
algorithm is used to improve its accuracy. As presented
in [28] and [29], the standard Newton-Raphson inverse
square root algorithm uses the iterative equation
Ri+1 = Ri (3 , XRi2)=2
(6)
Implementing a standard Newton-Raphson iteration requires a square, two multiplies, a subtract, and a one bit
right shift.
The method presented in this paper uses a modified
Newton-Raphson iteration, which can be expressed as
W
D
Y
=
R2
=
1
=
, WX
R + RD=2
Implementing this iteration requires one square (W = R2 ),
one multiply-complement operation (D = 1 , WX ) and
one multiply-add operation (Y = R + RD=2). Dividing
RD by 2 is implemented by shifting RD one extra position
to the right before adding it to R.
To further reduce the area, delay, and power dissipation,
the following optimizations are used.
W = R is computed by a specialized squaring unit.
D = 1 , WX is computed by complementing the bits
2
of WX .
Since W 1=X , D is close to zero and its most significant bits are all zeros (when D 0) or all ones
(when D < 0). Since only one of these bits is required
to indicated the sign, the other bits are not computed.
D has least significant bits that are negligible,
WX is computed using truncated multiplication.
Since RD=2 is close to zero and its least significant
bits are negligible when added to R, RD=2 is comSince
puted using truncated multiplication.
These optimizations are illustrated in Section 3.
If the maximum absolute error in R is R and the iteration is performed with infinite precision, then Y will have a
maximum absolute error of
(3
p
X2R + X3R)=2
(7)
If the maximum absolute errors due to truncating W , D,
and Y are denoted as Wt , Dt , and Y t , respectively, then
the error in the final approximation is bounded by
Y <
p
p
X2R + X3R + XWt + XRWt +
p
(8)
R Dt + Dt = X )=2 + Y t
(3
By controlling each of these errors, Y is computed with a
maximum absolute error of less than one ulp.
3. Inverse square root unit
Based on the method and error analysis presented in the
previous section, the design of an inverse square root unit
is developed. The results of the error analysis establish an
initial hardware design. After this, exhaustive simulations
determine where modifications can be made to further reduce the design’s hardware requirements, while still guaranteeing that all results are accurate to one ulp.
The inverse square root unit provides approximations for
numbers in the IEEE single precision floating point format [17]. With this format, a floating point number V is
represented using a sign bit sv , an 8-bit biased exponent
ev , and a 23-bit stored significand fv . If V is a normalized
number, it has the value
,1)sv (1 + fv ) 2ev ,
(9)
where 1 (1 + fv ) < 2 and 1 ev 254. An exponent
V
=(
127
of zero is used to represent denormalized numbers and zero,
and an exponent of 255 is used to represent quantities that
are infinite or not-a-number (NaN) [17]. The
p inverse square
root unit takes V and computes Z 1= V , which has a
sign bit sz , an 8-bit biased exponent ez , and a 23-bit stored
significand fz .
To handle IEEE single precision numbers with both even
and odd exponents, the method presented in the previous
section is modified slightly and hardware is added to take
care of special cases. For normalized positive numbers, if
the biased exponent ev is even, X = 2(1 + fv ); otherwise,
X = 1 + fv . Since 1 X < 4, 0:5 < Y 1. As is
done in [1], additional hardware is used to handle the special case X = Y = 1 (i.e., fv = 0 and ev is odd). Unless
X = 1 or V is not a positive normalized number, fz is obtained by shifting Y left one bit to normalize it and extracting the leading one, which can be expressed mathematically
as fz = 2Y , 1. Additional hardware is also needed to handle negative numbers, and numbers that are not normalized.
Table 2 shows how the values for ez and fz are determined
based on the values of ev and fv , for V 0. If V < 0,
the result is set to NaN, and an invalid operation exception
flag i is set. If V is a denormalized number (Case 2), a denormalized input flag d is set, which causes a software trap
to occur. If fz should not be set to fv (Cases 6 and 7), the
approximate flag a is set. Since V and Z always have the
same sign for V 0 and the sign of Z is not defined for
V < 0, sz is always set to sv .
The block diagram of an inverse square root unit for
single-precision floating point numbers is shown in Figure 3. Busses are shown in bold and the width of each bus is
given. This unit performs an inverse square root approximation in only five cycles. It is pipelined so that a new result
can be produced each cycle. In the first cycle, the value of
C 0 is obtained by a table lookup. In parallel with this, the
Case
1
2
3
4
5
6
7
V
0
Denorm
1
+
NaN
Normalized
Normalized
Normalized
ev
fv
0
0
255
255
1 to 253 odd
1 to 253 odd
2 to 254 even
Z
0
1
0
+0
6= 0
6= 0
0
6= 0
any
ez
255
fz
fv
, ev
Trap for Denormal Input
, ev
fv
ev
fv
(381 , ev )=2
fv
(379 , ev )=2
2=p1 + fv , 1
(380 , ev )=2
2= 2(1 + fv ) , 1
255
NaN
Normalized
Normalized
Normalized
p
Table 2. Determining ez and fz from ev and fv .
exponent/exception logic computes ez and sets the values
of the d, i, and a flags. In the second cycle, a modified multiplier computes R = C 0 X 0 . In the third cycle, a squaring
unit computes W = R2 . In the fourth cycle, a multiplycomplement unit computes D = 1 , WX . Rather than
passing both X and fv in the pipeline, only fv is passed,
and X is determined from fv and the least significant bit of
ev , which is denoted as o. In the fifth cycle, the multiplyaccumulate unit computes Y = R + RD=2 and fz is selected as either fv or 2Y , 1, based on a.
sv
23
i
d
s
X 0 = [1:x1 x2 : : : x5 x6 x6 x7 : : : x14 ]:
(10)
is obtained from the 14 most significant bits of fv , which
correspond to [x1 : : : x14 ]. Eliminating the least significant
columns of the multiplier reduces its delay and area by approximately 11% and 31%, respectively.
Based on Equation 5, the maximum absolute error in the
initial approximation is bounded by
R < 3 2,16 + 2,16 + 2,15 + 2,32 + 2,16 < 2,13 (11)
since m = 5, 1 < X 0 < 2, 0:25 < C 0 < 1, X < 2,15 ,
C < 2,17 , and Rt < 2,16 . Exhaustive simulation of
0
0
fv
5
8
Table
Lookup
Exponent/
Exception
Logic
8
3.1. Exponent logic and initial approximation
The exponent/exception logic takes as inputs sv , ev , and
fv , and computes ez , a, d, and i. A 9-bit subtractor and
multiplexors are used to compute ez , as shown in Table 2.
The unit also requires a modest amount of combinational
logic to determine special cases and set the values of the a,
d, and i flags.
The table lookup for the initial approximation is implemented as a 64-word by 16-bit read-only-memory (ROM).
The ROM takes as inputs the 5 most significant bits of fv ,
which correspond to [x1 x2 : : : x5 ], and the least significant
bit of ev , which determines if the exponent is even or odd.
The table lookup produces C 0 , as described in Section 2.3.
The modified multiplier, which computes R = C 0 X 0 , is
implemented as a 16 by 16 truncated multiplier, in which
k = 3 and the product is truncated to 16 bits. Prior to performing the multiplication, the modified operand
ev
ez
23
16
C’
a
14
Modified
Multiplier
8
s
ez
fv
o
23
16
a
R
o
fv
Squaring
Unit
8
s
ez
a
23
25
16
W
R
o
Multiply
Complemnt
Unit
8
s
ez
23
14
16
a
fv
R
23
fv
D
Multiply
Add
Unit
23
8
23
Multiplexor
sz
ez
fz
Figure 3. Inverse square root unit (Z
p
= V ).
=1
the initial approximation reveals that this conservative upper bound on the error never occurs and that the maximum
absolute error is bounded by R < 2,14 .
3.2. Newton-Raphson iteration hardware
A specialized 16-bit squaring unit computes W = R2
and truncates W to 25 bits. The maximum absolute error from truncating the result is bounded by Wt < 2,25 .
Compared to a conventional 16 by 16 multiplier, the 16-bit
squaring unit has approximately 14% less delay and 47%
less area.
The multiply-complement unit performs the operation
D = 1 , WX , which corresponds to a 25 by 25 truncated multiplication, in which the bits of the product are
inverted. If o = 1, the unbiased exponent is even and
X = 1 + fv . Otherwise, X = 2(1 + fv ) and a one bit
left shift of 1 + fv is required. For the truncated multiplycomplement unit, k = 3 and the result is first truncated to
26 bits and then complemented. Figure 4 shows the multiplication matrix for this unit. With truncated multiplication, the 20 least significant columns of the multiplication
matrix are eliminated. Furthermore, simulations show that
j D j< 2,13 , which means that the 13 most significant bits
are identical. Therefore, the 12 most significant bits of D
are not needed and the columns of the multiplication matrix for computing these bits are omitted. Since the result
is truncated to 26 fractional bits and the 12 most significant columns of the multiplication matrix are omitted, only
14 bits of WX are complemented and passed to the next
stage of computation. The non-shaded portion of the matrix in Figure 4 corresponds to columns that are omitted.
Omitting these columns reduces the delay and area of the
the multiply-complement unit by approximately 22% and
42%, respectively. The maximum absolute error introduced
by the truncated multiply-complement unit is bounded by
Dt < 2,26 .
The multiply-add unit computes Y = R + RD=2, which
corresponds to a 16 plus 16 by 14 multiply-add operation.
The multiplication matrix for this unit is shown in Figure 5.
Since Y is always positive, and D can be either positive or
negative, the product is sign extended, as described in [15].
Since j D=2 j 2,14 , the product RD is right-shifted 14
bits when adding it to R. For this unit, k = 3 and the final
result is truncated to 24 bits. Omitting the least significant
columns of the multiply-add unit reduces its delay and area
by 8% and 63%, respectively. Since 0:5 < Y < 1:0, Y
is normalized by shifting it left one bit and removing the
leading one to produce fz . The maximum absolute error
introduced by the truncated multiply-add unit is bounded
by Y t 2,24 .
Based on Equation 8, the maximum absolute error in Y
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
111111111111d121111111111111111111111
11111111111111d26 00
11d 29 000000000000000000000000
111111111111111111111111d 49
d0 d1 000000000000
d13 00000000000000
Figure 4. Multiply-complement D
=1
, WX .
11111111111111111
00000000000000000
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
10
00000000000000000
11111111111111111
1010
00000000000000000
11111111111111111
00000000000000000
11111111111111111
10
00000000000000000
11111111111111111
1010
00000000000000000
11111111111111111
00000000000000000
11111111111111111
1010
00000000000000000
11111111111111111
00000000000000000
11111111111111111
1010
00000000000000000
11111111111111111
00000000000000000
11111111111111111
1010
00000000000000000
11111111111111111
00000000000000000
11111111111111111
10
00000000000000000
11111111111111111
r1 r2
r15 r16
111111111111111y15y16 00000000
11111111y24000
111y2700000000000000000000
11111111111111111111y44
y1 y2 000000000000000
Figure 5. Multiply-add Y
=
R + RD=2.
is bounded by
,28 + 2,41 + 2,26 + 2,41 +
,41 + 2,27 + 2,24
2
< 2,23
Y <
3 2
(12)
< X < 4, R < 2,14, Wt < 2,26, Dt < 2,26 .
Y t < 2,24 . Exhaustive simulation over 1 < X < 4 shows
since 1
that this conservative upper bound on the error never occurs
and that the maximum absolute error is bounded by Y <
2,24 . Thus, the maximum absolute error in the final result
is less than one ulp.
3.3. Implementation and Comparison
The inverse square root unit was synthesized using the
Synopsys Module Compiler and a 2.5 Volt, 0.25 micron
CMOS standard cell library, which has four levels of metal.
The area and critical path delay of each component, and
the total area and cycle time for the unit are given in Table 3. The truncated multipliers use a variation of the Reduced Area Multiplier reduction scheme [21], followed by a
carry lookahead adder. The table lookup uses a high-speed
ROM. The inverse square root unit has an area of 0.41 mm2
and a cycle time of 6.7 ns.
For comparison purposes, Table 4 gives area and delay
estimates for the inverse square root unit presented in [1].
The unit in [1] is similar to the unit presented in this paper,
except it uses a less accurate initial approximation, requires
Component
Exponent/Exception Unit
Table-Lookup
Modified Multiplier
Squaring Unit
Multiply-Complement
Multiply-Add
Total Area/Cycle Time
Area (mm2 )
0.03
0.04
0.07
0.03
0.13
0.08
0.41
Delay (ns)
1.8
1.4
4.7
2.6
5.6
6.0
6.7
Table 3. Area and delay.
Component
Exponent/Exception Unit
Table-Lookup
Multiply-Complement
1st Multiplier
2nd Multiplier
3rd Multiplier
Multiply-Add
Total Area/Cycle Time
Area (mm2 )
0.03
0.10
0.07
0.06
0.10
0.23
0.11
0.72
Delay (ns)
1.8
2.0
5.1
4.9
6.0
6.6
6.3
7.3
Table 4. Area and delay for unit in [1].
two Newton-Raphson iterations, and does not employ specialized truncated multiplication and squaring units. Furthermore, it only works for normalized single precision
numbers. To allow for a fair comparison, the area and delay
estimates given in Table 4 are normalized using data from
the same standard cell library. They also assume that the
unit has been modified to produce one result per cycle and
handle all types of single precision numbers. Compared to
the inverse square root unit presented in this paper, the inverse square root unit presented in [1] has a cycle time that
is approximately 9% longer and it requires approximately
77% more area. To achieve the cycle time given in Table 4,
it has a latency of six cycles.
4. Conclusions
This paper presents a high-speed method for computing
inverse square roots. The method uses an accurate initial
approximation technique, followed by a modified NewtonRaphson iteration. Based on this method, the design of an
inverse square root unit for IEEE single precision floating
point numbers is developed. This unit has a cycle time of
6.7 ns, an area of 0.41 mm2 , a latency of five cycles, and
a throughput of one result per cycle. By utilizing truncated
multipliers and specialized squaring units, the area, delay,
and power consumption of the design are significantly reduced. Error analysis and exhaustive behavior level simulation over the significand range ensure that the inverse square
root unit produces results with a maximum error of one ulp.
Acknowledgments
This material is based upon work supported by the National Science Foundation under Grant No. MIP-9703421.
References
[1] H. Kwan, R. L. Nelson, and E. E. Swartzlander, Jr.,
“Cascaded Implementation of an Iterative InverseSquare-Root Algorithm with Overflow Lookahead,”
in Proceedings of the 12th Symposium on Computer
Arithmetic, pp. 114–123, 1995.
[2] R. W. Stewart, R. Chapman, and T. Durrani, “The
Square Root in Signal Processing,” in Proceedings of
Real-Time Signal Processing XII, pp. 89–100, 1989.
[3] T. R. Halfhill, “Beyond MMX x86 Extensions,” Byte,
vol. 22, no. 12, pp. 87–92, 1997.
[4] V. K. Jain, G. E. Perez, and J. M. Wills, “Novel Reciprocal and Square-Root VLSI Cell: Architecture and
Application to Signal Processing,” in International
Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 1201–1204, 1991.
[5] P. Soderquist and M. Leeser, “Area and Performance
Tradeoffs in Floating-Point Divide and Square Root
Implementations,” ACM Computing Surveys, vol. 28,
no. 3, pp. 518–564, 1996.
[6] AltiVec Technology Programming Environments Manual. Motorola, Inc., 1998.
[7] S. Oberman, F. Weber, N. Juffa, and G. Favor, “AMD
3DNow! Technology and the K6-2 Microprocessor,”
in Proceedings of Hot Chips 10, pp. 245–254, 1998.
[8] P. Montuschi and P. M. Mezzalama, “Survey of Square
Rooting Algorithms,” IEE Proceedings E (Computers
and Digital Techniques), vol. 137, no. 1, pp. 31–40,
1990.
[9] I. Koren, Computer Arithmetic Algorithms. Prentice
Hall, 1993.
[10] M. D. Ercegovac and T. Lang, Division and Square
Root: Digit-Recurrence Algorithms and Implementations. Kluwer Academic Publishers, 1994.
[11] S. Majerski, “Square-Rooting Algorithms for HighSpeed Digital Circuits,” IEEE Transactions on Computer, vol. C-34, pp. 724–733, August 1985.
[12] M. D. Ercegovac and T. Lang, “Radix-4 Square Root
Without Initial PLA,” IEEE Transactions on Computer, vol. C-39, pp. 1016–1024, 1990.
[13] V. K. Jain and L. Lin, “Square-Root, Reciprocal,
Sine/Cosine, Arctangent cell for Signal and Image
Processing,” in International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 521–
524, 1994.
[14] J. Cao and B. Wei, “High-Performance Hardware for
Function Generation,” in Proceedings of the 13th Symposium on Computer Arithmetic, pp. 184–189, 1997.
[15] M. J. Schulte and E. E. Swartzlander, Jr., “Hardware
Designs for Exactly Rounded Elementary Functions,”
IEEE Transactions on Computers, vol. C-44, pp. 964–
973, 1994.
[16] N. Takagi, “Generating a Power of an Operand by
a Table Look-Up and a Multiplication,” in Proceedings of the 13th Symposium on Computer Arithmetic,
pp. 126–131, 1997.
[17] IEEE, Std 754-1985 for Binary Floating-Point Arithmetic. Standards Committee of The IEEE Computer
Society. 345 East 47th Street, New York, NY 10017,
USA, 1985.
[18] M. J. Schulte, J. E. Stine, and K. E. Wires, “HighSpeed Reciprocal Approximations,” in Proceedings of
the Thirty First Asilomar Conference on Signals, Circuits and Systems, pp. 1178–1182, 1998.
[19] M. J. Schulte and E. E. Swartzlander, Jr., “Truncated
multiplication with correction constant,” in VLSI Signal Processing, VI, pp. 388–396, 1993.
[20] E. J. King and E. E. Swartzlander, Jr., “Data Dependent Truncated Scheme for Parallel Multiplication,”
in Proceedings of the Thirty First Asilomar Conference on Signals, Circuits and Systems, pp. 1178–1182,
1998.
[21] K. Bickerstaff, M. J. Schulte, and E. E. Swartzlander,
Jr., “Parallel Reduced Area Multipliers,” Journal of
VLSI Signal Processing, vol. 9, pp. 181–192, 1995.
[22] M. J. Schulte, J. G. Jansen, and J. E. Stine., “Implementing Truncated Multipliers,” to be submitted to the
IEEE Transactions on Circuits and Systems II: Analog
and Digital Signal Processing, 1999.
[23] M. J. Schulte, J. G. Jansen, and J. E. Stine., “Reduced Power Dissipation Through Truncated Multiplication,” in IEEE Alessandro Volta Memorial International Workshop on Low Power Design, 1999 (in
press).
[24] T. C. Chen, “A Binary Multiplication Scheme Based
on Squaring,” IEEE Transactions on Computers,
vol. C-20, pp. 678–680, 1971.
[25] R. H. Strandberg et al., “Efficient Realizations of
Squaring Circuit and Reciprocal used in Adaptive
Sample Rate Notch Filters,” Journal of VLSI Signal
Processing, vol. 14, no. 3, pp. 303–309, 1996.
[26] L. Dadda, “Some Schemes for Parallel Multipliers,”
Alta Frequenza, vol. 34, pp. 349–356, 1965.
[27] M. J. Schulte, K. E. Wires, and L. P. Marquette, “Implementing Squaring Units,” to be submitted to the International Conference on Computer Design, 1999.
[28] W. James and P. Jaratt, “The Generation of Square
Roots on a Computer With Rapid Multiplication Compared with Division,” Mathematics of Computation,
vol. 19, pp. 497–500, 1965.
[29] C. V. Ramamoorthy and J. Goodman, “Some Properties of Iterative Square-Rooting Methods Using HighSpeed Multiplication,” IEEE Transactions on Computers, vol. C-21, pp. 837–847, 1972.