An Efficient Hardware Implementation for a Reciprocal Unit Andreas Habegger, Andreas Stahel, Josef Goette, and Marcel Jacomet Bern University of Applied Sciences, MicroLab CH-2501 Biel-Bienne, Switzerland Abstract The computation of the reciprocal of a numerical value is an important ingredient of many algorithms. We present a compact hardware architecture to compute reciprocals by two or three Newton-Raphson iterations to obtain the accuracy of I EEE 754 single- and double-precision standard, respectively. We estimate the initialization value by a specially designed second-order polynomial approximating the reciprocal. By using a second-order polynomial, we succeed in using one single hardware architecture for both, the polynomialapproximation computations as well as the Newton-Raphson iterations. Therefore, we obtain a most compact hardware implementation for the complete reciprocal computation. Keywords: Arithmetic inversion, reciprocal, Newton-Raphson, polynomial initialization, Nelder-Mead, hardware algorithm. 1. Introduction Division is among the four basic operations the most complicated one. Nevertheless, hardware algorithms often need fast and compact division units, because division is time-, chip area-, and power-consuming. Division can either be done directly, N/D, or by first computing the reciprocal of the denominator, 1/D, followed by a multiplication with the numerator N, a method that is especially useful if different numerators are to be divided by the same denominator. We concentrate on computing the reciprocal. In [1] an overview and comparison of various division algorithms is given. Division approximation algorithms are recursive in nature and can be grouped into algorithms with linear convergence and algorithms with quadratic convergence, [2]. Non-restoring, restoring, S RT (Sweeney, Robertson, and Tocher [3, 4, 5]) and C ORDIC, [6, 7], are examples of linear convergence; reciprocation by Newton-Raphson and Goldschmidt’s division by convergence, [8], are examples for algorithms of quadratic convergence. The algorithms with linear convergence suffer from a high latency, the algorithms with quadratic convergence are costly in terms of chip area and computational complexity. Many efforts have been made to improve reciprocal division algorithms with quadratic convergence. A polynomial-based division algorithm is proposed in [9] for low resolutions. An improvement of Goldschmidt’s division by convergence algorithm has recently been shown in [10]. Different improvements for Newton-Raphson methods have been published, some of which concentrating on a so-called modified Newton-Raphson algorithms, others focusing on the initial start value approximation of the Newton-Raphson algorithm. Look-up table solutions for the initial value approximation are common [11], but by nature need quite large memories if high accuracy combined with low iteration counts are required. To reduce the look-up table memory-size, [12] introduces the Symmetric Bipartite Table Method. 2. Architecture The Newton-Raphson method is an iterative approximation algorithm. A high hardware effi- ciency with respect to chip area in iterative algorithms is achieved by using all hardware elements in every iteration cycle; no sleeping hardware block should be present. We distinguish two phases in the Newton-Raphson method, the computation of the initial start value and the iterative Newton-Raphson solution approximation. The hardware architectures presented so far use different hardware units for these two phases, thus one of them is always inactive. Our approach for an efficient hardware implementation for a reciprocal hardware unit is to use a single hardware unit for both, the initial value approximation as well as the Newton-Raphson approximation. With such an approach, the hardware unit will never be partially in sleep mode, resulting in an optimally efficient use of the chip area without additional delay or latency. The Newton-Raphson method for computing the reciprocal of a numerical value is −a −D 25 MUX = 18 18 18 MUL D MUX 27 25 18 MUX ADD 25 26 c 0 18 MUL 18 MUX 28 xi+1 = xi · (2 − xi · D) −D · x2i + 2 · xi · +0 . b 2 18 18 ADD 26 (1) Observe that the right hand side of the above equation is a second order polynomial with the constant term being zero. Therefore, using a second order polynomial for the initial value computation, we can use the same hardware element for the polynomial computations as for the Newton-Raphson iterations; only a minor modification is needed, see Figure 1. The second order polynomial used to calculate the initial value x0 , approximating the reciprocal of D, is x0 = a · D2 + b · D + c . (2) As both equations, (1) and (2), evaluate second-order polynomials, a hardware unit combining both of them can easily be found as Figure 1 shows. We see that setting the multiplexors to their right positions, the hardware unit calculates the initial value approximation, using the polynomial with coefficients a, b and c. If the multiplexors are in their left positions, it computes NewtonRaphson iterations using the data on the feedback path. Our goal is to achieve an approximation precision needed for single precision I EEE 754 float- clock REG 14 Figure 1. Efficient hardware implementation for a reciprocal unit combining initial value and Newton-Raphson approximation in a single hardware element. ing point values in two Newton-Raphons iterations, and a precision needed for double precision floating point values in three Newton-Raphson iterations. The Newton-Raphson method doubles its precision with every iteration, thus the needed 24 bits for a 32 bits single-precision floating point number requests an initial value precision larger than 6 bits, and the 53 bits needed for 64 bit double-precision floating point number requests an even higher initial value precision larger than 6.625 bits. The goal is thus to find a second-order polynomial to achieve the requested initial precision of 6.625 bits (see the discussion in Section 3). According to the above arguments, the number widths indicated in Figure 1 are valid for an accuracy of single precision using two NewtonRaphson iterations; if we like to realize double precision accuracy, the indicated number widths must be replaced as follows: 14 → 27, 18 → 18, 25 → 54, 26 → 55, and 28 → 57. As mentioned above, we are ending up with a hardware unit that calculates in a first cycle the initial guess and further on the final value depending on the desired accuracy in two or three steps, respectively. 3. Theory To determine the value of 1/D for some D ≤ 1 we solve the equation f (x) = 1 2 < 1 −D = 0, x with the help of the Newton-Raphson iteration 1/xi − D f (xi ) = xi − ′ f (xi ) −1/x2i = xi (2 − xi · D) . xi+1 = x j − Once the value of xi is close to the desired result 1/D, the iteration will converge quadratically, and, therefore, the critical part is to start with a good initial value x0 . Since the hardware for the iteration must be used for the initialization and the Newton-Raphson iteration steps, we approximate the function 1/x in the interval 21 < x ≤ 1 with a second order polynomial g(x) = a · x2 + b · x + c, such that the maximal value of the relative error is minimized, i.e. minimize 1 2 · x . E(a, b, c) = a · x + b · x + c − x As a good initial guess for the coefficients we use the standard Chebyshev polynomial approximation of order 2, e.g. [13]. Next a Nelder-Mead simplex approximation, [14], will generate the optimal values. We find g(x) = a · x2 + b · x + c ≈ 1 , x with a = 2.58586, b = −5.81818, and c = 4.24242. The optimality may be verified by the fact that the maximal relative error is attained at the two boundary points and at two interior points, as is also seen in Figure 2, the solid curve. Since − log2 (|g(x) − 1/x| · x) ≥ 6.625 , see the dotted level in Figure 2, we achieve the required 6-bit accuracy for the initial values x0 ≈ 1/D for the Newton-Raphson method, leading to a final 24-bits accuracy with two iterations; starting with 6.625 and using three Newton-Raphson iterations, we achieve 53-bits accuracy. Note that the above argumentation bases on using floating point arithmetic, but that the resulting polynomial g(x) and the Newton-Raphson iterations are implemented in our hardware in fixed point arithmetic. Therefore, we need an additional accuracy-margin to minimize the bit-widths in the hardware of Figure 3. We have carried-out the calculations above with the cost function (g(x) − 1/x) · x being the relative error, because each subsequent NewtonRaphson iteration squares this relative error. Because bit-precision is not measured as a relative error, but, instead, in absolute number of bits, we have defined an alternative cost function. It uses the absolute error of the final result taking into account two Newton-Raphson iterations; instead of the 6.625 · 2 · 2 = 26.5 bits we now get 27.8 bits. Simple calculations lead to this improved cost function (g(x) · x − 1)4 /x. The coefficients of the now resulting polynomial g(x) are shown in Table 1. The dashed curve in Figure 2 shows the relative error of the initial polynomial approximation with the new cost function. As compared to the solid line of the first cost function, we see that leftmost local minimum at x = 0.5 is higher, whereas the remaining three local minima are lower. To calculate the relevant absolute error of the final inversion-result, the leftmost minimum has the largest influence.1 Accordingly, we find the improved cost function for three Newton-Raphson iterations needed for double precision: (g(x) · x − 1)8 /x. Again, the 1 At x = 0.5, relative and absolute errors are identical, because both divide the error by 1/x = 1/0.5 = 2; at x = 1 the relative error only divides by 1/x = 1, whereas the absolute error still divides by 2. 40 coefficients of the polynomial g(x) are shown in Table 1. Here we achieve instead of the 6.625 · 2 · 2 · 2 = 53 bits now 53.5 bits. 35 30 precision single double 2.65548 2.60477 -5.92781 -5.84702 4.28387 4.25284 25 resolution polynomial coefficients a b c 24 bits 20 15 Table 1. Table with polynomial coefficients for the improved cost functions needed for singleand double-precision arithmetic, respectively. 10 12 bits 5 6 bits 0 0.5 7.4 7.2 relative error in bits 7.1 7 6.9 6.9 6.8 6.8 0.65 0.7 0.75 input value 0.8 0.85 0.9 0.95 6.7 6.7 6.625 bits 6.6 0.5 6.6 0.505 0.51 6.5 6.4 0.5 0.6 Figure 3. Performance at various stages of our reciprocal hardware. The lowest curve shows the resolution obtained by our polynomial initialization algorithm. The middle curve shows the resolution obtained with one Newton-Raphson step. Finally, the top curve shows the resolution obtained after a second Newton-Raphson step. 7.3 7 0.55 0.55 0.6 0.65 0.7 0.75 input value 0.8 0.85 0.9 0.95 Figure 2. Relative errors of the initial values obtained by the polynomial approximation. The solid curve is the precision for an optimized relative error; the dashed and dashed-dotted curves are for improved cost functions, targeted for single and double precision, respectively. The insert shows the relative error in a zoomed-in area in the vicinity of the left border. 4. Verification To verify the correct functioning of the proposed reciprocal hardware unit, we have simulated its performance with Matlab/Simulink in fixed point. Figure 1 contains the needed number of bits at various positions inside of the hardware structure. The shown parameterization is sufficient to obtain the resolution for I EEE 754 single-precision floating point numbers. 1 Figure 3 gives the following results: The lowest curve shows the performance of the hardware when used to compute the initial value for the Newton-Raphson algorithm with our improved polynomial as described in Section 3.2 The middle curve gives the reciprocal approximation with one Newton-Raphson step and said initial value. Finally, the top curve is the reciprocal after two Newton-Raphson steps. Note, first, that we have simulated all input values between 0.5 exclusive and 1 inclusive that are representable by 24-bits; and, second, that the obtained 1/x approximations have all a resolution that is better than the targeted 24 bits. 5. Conclusions We have presented a compact hardware for the computation of the reciprocal 1/x of a number x ∈ (0.5, 1]. The number x to be inverted is either represented by a 24-bits or by a 53-bits two’s complement representation, meaning that the hard2 To be specific, it is the polynomial with the coefficients a = 2.62463, b = −5.87755, and c = 4.26392. 1 ware is useful as a subsystem for I EEE 754 singleand double-precision arithmetic.3 The hardware is used three times in succession (four times in the case of the double-precision target): In a first usage it computes by polynomial approximation the initial value for subsequent Newton-Raphson iterations; in a second usage it performs the first Newton-Raphson step; and in a third usage it computes the second Newton-Raphson step giving the 24-bits representation of 1/x; (a forth usage of the hardware targeted at 53-bits computes the 1/x representation with a third Newton-Raphson step). As compared to many alternative reciprocal algorithms, our approach distinguishes itself in that it determines the initial value of the NewtonRaphson computations by a second-order polynomial approximation that uses the same hard-wired logic as the subsequent Newton-Raphson iterations themselves. References [1] S. F. Obermann and M. J. Flynn. Division algorithms and implementations. I EEE Transactions on Computers, 46(8):833–854, Aug 1997. [2] I. Koren. Computer Arithmetic Algorithms, 2nd Edition. A K Peters, Ltd., Natick, Mass., Jul 2002. [3] O. L. Macsorley. High-speed arithmetic in binary computers. Proceedings of the I RE , 49(1):67–91, Jan. 1961. [4] James E. Robertson. A New Class of Digital Division Methods. I RE Trans. on Electronic Computers, EC7:218–222, 1958. [5] K. D. Tocher. Techniques of multiplication and division for automatic binary computers. The Quarterly Journal of Mechanics Applied Mathematics, 11(3):364–384, 1958. [6] Jack E. Volder. The birth of C ORDIC. The Journal of V LSI Signal Processing, 25(2):101–105, June 2000. [7] John Stephen Walther. The story of unified C ORDIC. The Journal of V LSI Signal Processing, 25(2):107–112, June 2000. [8] Robert E. Goldschmidt. Applications of division by convergence, 1964. Master’s Thesis, M IT , Cambridge, Mass. [9] R. Hägglund, P. Löwenborg, and M. Vesterbacka. A polynomial-based division algorithm. In I EEE International Symposium on Circuits and Systems, I SCAS, volume 3, pages III–571–III–574 vol.3, 2002. 3 Note that we have two different hardware units, one for single-, the other for double precision; these two hardware units have the same structure but use different bit-widths parameterization and different polynomial coefficients. [10] R. Goldberg, G. Even, and P. M. Seidel. An F PGA implementation of pipelined multiplicative division with I EEE rounding. In 15th Annual I EEE Symposium on Field-Programmable Custom Computing Machines, F CCM, pages 185–196, April 2007. [11] U. Küçükkabak and A. Akkas. Design and implementation of reciprocal unit using table look-up and NewtonRaphson iteration. In Euromicro Symposium on Digital System Design, D SD, pages 249–253, 2004. [12] M. J. Schulte, J. E. Stine, and K. E. Wires. High-speed reciprocal appoximations. In Conference Record of the Thirty-First Asilomar Conference on Signals, Systems & Computers, volume 2, pages 1183–1187, Nov 1997. [13] Theodore J. Rivlin. An Introduction to the Approximation of Functions. Blaisdell Publishing Company, 1969. reprinted by Dover. [14] John A. Nelder and R. Mead. A simplex method for function minimization. Computer Journal, 7:308–313, 1965.
© Copyright 2026 Paperzz