An Efficient Hardware Implementation for a Reciprocal Unit

An Efficient Hardware Implementation for a Reciprocal Unit
Andreas Habegger, Andreas Stahel, Josef Goette, and Marcel Jacomet
Bern University of Applied Sciences, MicroLab
CH-2501 Biel-Bienne, Switzerland
Abstract
The computation of the reciprocal of a numerical value is an important ingredient of many algorithms. We
present a compact hardware architecture to compute reciprocals by two or three Newton-Raphson iterations
to obtain the accuracy of I EEE 754 single- and double-precision standard, respectively. We estimate the
initialization value by a specially designed second-order polynomial approximating the reciprocal. By using
a second-order polynomial, we succeed in using one single hardware architecture for both, the polynomialapproximation computations as well as the Newton-Raphson iterations. Therefore, we obtain a most compact hardware implementation for the complete reciprocal computation.
Keywords: Arithmetic inversion, reciprocal, Newton-Raphson, polynomial initialization, Nelder-Mead,
hardware algorithm.
1. Introduction
Division is among the four basic operations
the most complicated one. Nevertheless, hardware
algorithms often need fast and compact division
units, because division is time-, chip area-, and
power-consuming. Division can either be done directly, N/D, or by first computing the reciprocal of
the denominator, 1/D, followed by a multiplication
with the numerator N, a method that is especially
useful if different numerators are to be divided by
the same denominator. We concentrate on computing the reciprocal. In [1] an overview and comparison of various division algorithms is given.
Division approximation algorithms are recursive in nature and can be grouped into algorithms
with linear convergence and algorithms with
quadratic convergence, [2]. Non-restoring, restoring, S RT (Sweeney, Robertson, and Tocher [3, 4,
5]) and C ORDIC, [6, 7], are examples of linear
convergence; reciprocation by Newton-Raphson
and Goldschmidt’s division by convergence, [8],
are examples for algorithms of quadratic convergence. The algorithms with linear convergence
suffer from a high latency, the algorithms with
quadratic convergence are costly in terms of chip
area and computational complexity.
Many efforts have been made to improve reciprocal division algorithms with quadratic convergence. A polynomial-based division algorithm is
proposed in [9] for low resolutions. An improvement of Goldschmidt’s division by convergence algorithm has recently been shown in [10]. Different improvements for Newton-Raphson methods
have been published, some of which concentrating on a so-called modified Newton-Raphson algorithms, others focusing on the initial start value
approximation of the Newton-Raphson algorithm.
Look-up table solutions for the initial value approximation are common [11], but by nature need
quite large memories if high accuracy combined
with low iteration counts are required. To reduce
the look-up table memory-size, [12] introduces the
Symmetric Bipartite Table Method.
2. Architecture
The Newton-Raphson method is an iterative
approximation algorithm. A high hardware effi-
ciency with respect to chip area in iterative algorithms is achieved by using all hardware elements in every iteration cycle; no sleeping hardware block should be present. We distinguish two
phases in the Newton-Raphson method, the computation of the initial start value and the iterative Newton-Raphson solution approximation. The
hardware architectures presented so far use different hardware units for these two phases, thus one
of them is always inactive. Our approach for an
efficient hardware implementation for a reciprocal
hardware unit is to use a single hardware unit for
both, the initial value approximation as well as the
Newton-Raphson approximation. With such an approach, the hardware unit will never be partially in
sleep mode, resulting in an optimally efficient use
of the chip area without additional delay or latency.
The Newton-Raphson method for computing
the reciprocal of a numerical value is
−a
−D
25
MUX
=
18
18
18
MUL
D
MUX
27
25
18
MUX
ADD
25
26
c
0
18
MUL
18
MUX
28
xi+1 = xi · (2 − xi · D)
−D · x2i + 2 · xi · +0 .
b
2
18
18
ADD
26
(1)
Observe that the right hand side of the above equation is a second order polynomial with the constant
term being zero. Therefore, using a second order
polynomial for the initial value computation, we
can use the same hardware element for the polynomial computations as for the Newton-Raphson iterations; only a minor modification is needed, see
Figure 1.
The second order polynomial used to calculate
the initial value x0 , approximating the reciprocal of
D, is
x0 = a · D2 + b · D + c .
(2)
As both equations, (1) and (2), evaluate
second-order polynomials, a hardware unit combining both of them can easily be found as Figure 1 shows. We see that setting the multiplexors
to their right positions, the hardware unit calculates
the initial value approximation, using the polynomial with coefficients a, b and c. If the multiplexors are in their left positions, it computes NewtonRaphson iterations using the data on the feedback
path.
Our goal is to achieve an approximation precision needed for single precision I EEE 754 float-
clock
REG
14
Figure 1. Efficient hardware implementation
for a reciprocal unit combining initial value
and Newton-Raphson approximation in a single hardware element.
ing point values in two Newton-Raphons iterations, and a precision needed for double precision
floating point values in three Newton-Raphson iterations. The Newton-Raphson method doubles
its precision with every iteration, thus the needed
24 bits for a 32 bits single-precision floating point
number requests an initial value precision larger
than 6 bits, and the 53 bits needed for 64 bit
double-precision floating point number requests an
even higher initial value precision larger than 6.625
bits. The goal is thus to find a second-order polynomial to achieve the requested initial precision of
6.625 bits (see the discussion in Section 3).
According to the above arguments, the number widths indicated in Figure 1 are valid for an
accuracy of single precision using two NewtonRaphson iterations; if we like to realize double
precision accuracy, the indicated number widths
must be replaced as follows: 14 → 27, 18 → 18,
25 → 54, 26 → 55, and 28 → 57. As mentioned
above, we are ending up with a hardware unit that
calculates in a first cycle the initial guess and further on the final value depending on the desired accuracy in two or three steps, respectively.
3. Theory
To determine the value of 1/D for some
D ≤ 1 we solve the equation
f (x) =
1
2
<
1
−D = 0,
x
with the help of the Newton-Raphson iteration
1/xi − D
f (xi )
= xi −
′
f (xi )
−1/x2i
= xi (2 − xi · D) .
xi+1 = x j −
Once the value of xi is close to the desired result
1/D, the iteration will converge quadratically, and,
therefore, the critical part is to start with a good
initial value x0 . Since the hardware for the iteration must be used for the initialization and the
Newton-Raphson iteration steps, we approximate
the function 1/x in the interval 21 < x ≤ 1 with a
second order polynomial g(x) = a · x2 + b · x + c,
such that the maximal value of the relative error is
minimized, i.e. minimize
1 2
· x .
E(a, b, c) = a · x + b · x + c −
x
As a good initial guess for the coefficients we use
the standard Chebyshev polynomial approximation
of order 2, e.g. [13]. Next a Nelder-Mead simplex
approximation, [14], will generate the optimal values. We find
g(x) = a · x2 + b · x + c ≈
1
,
x
with a = 2.58586, b = −5.81818, and c = 4.24242.
The optimality may be verified by the fact that the
maximal relative error is attained at the two boundary points and at two interior points, as is also seen
in Figure 2, the solid curve. Since
− log2 (|g(x) − 1/x| · x) ≥ 6.625 ,
see the dotted level in Figure 2, we achieve the
required 6-bit accuracy for the initial values x0 ≈
1/D for the Newton-Raphson method, leading to a
final 24-bits accuracy with two iterations; starting
with 6.625 and using three Newton-Raphson iterations, we achieve 53-bits accuracy.
Note that the above argumentation bases on using floating point arithmetic, but that the resulting polynomial g(x) and the Newton-Raphson iterations are implemented in our hardware in fixed
point arithmetic. Therefore, we need an additional
accuracy-margin to minimize the bit-widths in the
hardware of Figure 3.
We have carried-out the calculations above
with the cost function (g(x) − 1/x) · x being the
relative error, because each subsequent NewtonRaphson iteration squares this relative error. Because bit-precision is not measured as a relative
error, but, instead, in absolute number of bits,
we have defined an alternative cost function. It
uses the absolute error of the final result taking
into account two Newton-Raphson iterations; instead of the 6.625 · 2 · 2 = 26.5 bits we now get
27.8 bits. Simple calculations lead to this improved
cost function (g(x) · x − 1)4 /x. The coefficients
of the now resulting polynomial g(x) are shown
in Table 1. The dashed curve in Figure 2 shows
the relative error of the initial polynomial approximation with the new cost function. As compared to the solid line of the first cost function,
we see that leftmost local minimum at x = 0.5 is
higher, whereas the remaining three local minima
are lower. To calculate the relevant absolute error
of the final inversion-result, the leftmost minimum
has the largest influence.1
Accordingly, we find the improved cost function for three Newton-Raphson iterations needed
for double precision: (g(x) · x − 1)8 /x. Again, the
1 At
x = 0.5, relative and absolute errors are identical, because both divide the error by 1/x = 1/0.5 = 2; at x = 1 the
relative error only divides by 1/x = 1, whereas the absolute
error still divides by 2.
40
coefficients of the polynomial g(x) are shown in
Table 1. Here we achieve instead of the 6.625 · 2 ·
2 · 2 = 53 bits now 53.5 bits.
35
30
precision
single
double
2.65548 2.60477
-5.92781 -5.84702
4.28387 4.25284
25
resolution
polynomial
coefficients
a
b
c
24 bits
20
15
Table 1. Table with polynomial coefficients for
the improved cost functions needed for singleand double-precision arithmetic, respectively.
10
12 bits
5
6 bits
0
0.5
7.4
7.2
relative error in bits
7.1
7
6.9
6.9
6.8
6.8
0.65
0.7
0.75
input value
0.8
0.85
0.9
0.95
6.7
6.7
6.625 bits
6.6
0.5
6.6
0.505
0.51
6.5
6.4
0.5
0.6
Figure 3. Performance at various stages of
our reciprocal hardware. The lowest curve
shows the resolution obtained by our polynomial initialization algorithm. The middle
curve shows the resolution obtained with one
Newton-Raphson step. Finally, the top curve
shows the resolution obtained after a second
Newton-Raphson step.
7.3
7
0.55
0.55
0.6
0.65
0.7
0.75
input value
0.8
0.85
0.9
0.95
Figure 2. Relative errors of the initial values obtained by the polynomial approximation. The
solid curve is the precision for an optimized
relative error; the dashed and dashed-dotted
curves are for improved cost functions, targeted for single and double precision, respectively. The insert shows the relative error in a
zoomed-in area in the vicinity of the left border.
4. Verification
To verify the correct functioning of the proposed reciprocal hardware unit, we have simulated
its performance with Matlab/Simulink in fixed
point. Figure 1 contains the needed number of bits
at various positions inside of the hardware structure. The shown parameterization is sufficient to
obtain the resolution for I EEE 754 single-precision
floating point numbers.
1
Figure 3 gives the following results: The lowest curve shows the performance of the hardware when used to compute the initial value for
the Newton-Raphson algorithm with our improved
polynomial as described in Section 3.2 The middle curve gives the reciprocal approximation with
one Newton-Raphson step and said initial value.
Finally, the top curve is the reciprocal after two
Newton-Raphson steps. Note, first, that we have
simulated all input values between 0.5 exclusive
and 1 inclusive that are representable by 24-bits;
and, second, that the obtained 1/x approximations
have all a resolution that is better than the targeted
24 bits.
5. Conclusions
We have presented a compact hardware for
the computation of the reciprocal 1/x of a number x ∈ (0.5, 1]. The number x to be inverted is either represented by a 24-bits or by a 53-bits two’s
complement representation, meaning that the hard2 To be specific, it is the polynomial with the coefficients
a = 2.62463, b = −5.87755, and c = 4.26392.
1
ware is useful as a subsystem for I EEE 754 singleand double-precision arithmetic.3 The hardware is
used three times in succession (four times in the
case of the double-precision target): In a first usage it computes by polynomial approximation the
initial value for subsequent Newton-Raphson iterations; in a second usage it performs the first
Newton-Raphson step; and in a third usage it
computes the second Newton-Raphson step giving
the 24-bits representation of 1/x; (a forth usage
of the hardware targeted at 53-bits computes the
1/x representation with a third Newton-Raphson
step). As compared to many alternative reciprocal algorithms, our approach distinguishes itself in
that it determines the initial value of the NewtonRaphson computations by a second-order polynomial approximation that uses the same hard-wired
logic as the subsequent Newton-Raphson iterations
themselves.
References
[1] S. F. Obermann and M. J. Flynn. Division algorithms
and implementations. I EEE Transactions on Computers,
46(8):833–854, Aug 1997.
[2] I. Koren. Computer Arithmetic Algorithms, 2nd Edition.
A K Peters, Ltd., Natick, Mass., Jul 2002.
[3] O. L. Macsorley. High-speed arithmetic in binary computers. Proceedings of the I RE , 49(1):67–91, Jan. 1961.
[4] James E. Robertson. A New Class of Digital Division
Methods. I RE Trans. on Electronic Computers, EC7:218–222, 1958.
[5] K. D. Tocher. Techniques of multiplication and division
for automatic binary computers. The Quarterly Journal of Mechanics Applied Mathematics, 11(3):364–384,
1958.
[6] Jack E. Volder. The birth of C ORDIC. The Journal of
V LSI Signal Processing, 25(2):101–105, June 2000.
[7] John Stephen Walther. The story of unified C ORDIC.
The Journal of V LSI Signal Processing, 25(2):107–112,
June 2000.
[8] Robert E. Goldschmidt. Applications of division by
convergence, 1964. Master’s Thesis, M IT , Cambridge,
Mass.
[9] R. Hägglund, P. Löwenborg, and M. Vesterbacka. A
polynomial-based division algorithm. In I EEE International Symposium on Circuits and Systems, I SCAS, volume 3, pages III–571–III–574 vol.3, 2002.
3 Note
that we have two different hardware units, one for
single-, the other for double precision; these two hardware
units have the same structure but use different bit-widths parameterization and different polynomial coefficients.
[10] R. Goldberg, G. Even, and P. M. Seidel. An F PGA implementation of pipelined multiplicative division with
I EEE rounding. In 15th Annual I EEE Symposium
on Field-Programmable Custom Computing Machines,
F CCM, pages 185–196, April 2007.
[11] U. Küçükkabak and A. Akkas. Design and implementation of reciprocal unit using table look-up and NewtonRaphson iteration. In Euromicro Symposium on Digital
System Design, D SD, pages 249–253, 2004.
[12] M. J. Schulte, J. E. Stine, and K. E. Wires. High-speed
reciprocal appoximations. In Conference Record of the
Thirty-First Asilomar Conference on Signals, Systems &
Computers, volume 2, pages 1183–1187, Nov 1997.
[13] Theodore J. Rivlin. An Introduction to the Approximation of Functions. Blaisdell Publishing Company, 1969.
reprinted by Dover.
[14] John A. Nelder and R. Mead. A simplex method for
function minimization. Computer Journal, 7:308–313,
1965.