single precision floating point square root computation

Single Precision Floating Point Square Root Computation
SINGLE PRECISION FLOATING POINT SQUARE ROOT
COMPUTATION
1
NAJIB GHATTE, 2SHILPA PATIL, 3DEEPAK BHOIR
1,2,3
Fr. Conceicao Rodrigues College of Engineering, Fr. Agnel Ashram, Bandstand, Bandra (W), Mumbai: 400 050, India
Abstract- Square Root operation has found its prominence in many digital signal processing but it is very elusive to
implement on FPGA due to its complicated computations. Many iterative algorithms which include restoring and nonrestoring algorithms, SRT were proposed. Most of them implement with slow or large components which are less suitable
for real-time applications than the addition or multiply components. This paper deals with the novel algorithm of square root
computation of single precision floating point division. Verilog Code is written and implemented on Virtex-5 FPGA series.
Significant enhancement is being observed in the performance on comparative analysis with the conventional algorithms.
Keywords- Single Precision, Binary Square Root, Vedic, Virtex, FPGA, Dvanda, IEEE-754.
normalization, and Cholesky decomposition. The
floating-point divide and square root operators
support many different floating-point formats
including IEEE standard formats. Both modules
demonstratea good trade-off between area, latency
and throughput. They are also fully pipelined to aid
the designer in implementing fast, complex, and
pipelined designs. It is the most important goal of a
designer to enhance the performance of the ALU
thereby reducing its design complexity to have better
figure of merit. Due to the latency gap between
addition/multiplication and division/square root, the
latter operations increasingly become performance
bottlenecks. As the performance gap widens between
addition/
subtraction/
multiplication
and
division/square root, many signal and image
processing algorithms involving floating-point
operations have been rewritten to avoid the use of
division and square root. Moreover, it is difficult to
implement square root on hardware. Therefore, poor
implementations of floating-point division and square
root results in severe performance degradation.
I. INTRODUCTION
The term floating point implicates that there is no
fixed number of digits before and after the decimal
point; i.e. the decimal point can float. Floating-point
representations are slower and less accurate than
fixed-point representations, but can handle a larger
range of numbers. Because mathematics with
floating-point numbers requires a great deal of
computing power, many microprocessors come with
a chip, called a floating point unit (FPU ), specialized
for performing floating-point arithmetic. FPUs are
also called math coprocessors and numeric coprocessors. Floating-point representation has a
complex encoding scheme with three basic
components: mantissa, exponent and sign. Usage of
binary numeration and powers of 2 resulted in
floating point numbers being represented as single
precision (32-bit) and double precision (64-bit)
floating-point numbers. Both single and double
precision numbers as illustrated in Fig. 1 are defined
by the IEEE 754 standard.
For a single precision format, 8-bits are reserved for
exponent thereby having a bias value of +127 and 23
bits are reserved for mantissa. When sign bit is 1, it
indicates negativenumber and when it is 0, it argues
as a positive number.
Vedic math is known to more optimised and efficient
than algorithms based on conventional logic. The
sutras defined can be used in digital design to
improve the performance of ALUs based on
conventional logic. Dvanda Yoga Sutra deals with the
division. These sutras find its limitations when
number of bits are increased as this paper deals with
IEEE-754 floating-point representation.
II. VARIOUS SQUARE ROOT ALGORITHMS
Researchers have proposed many algorithms and
procedural architectures to carry out square root in
order to reduce the computational time and thus
enhancing the performance.
Fig. 1 IEEE-754 Floating-point Representation Standards
The similar explanation is extended for double
precision format where exponents are biased to
+1023. Division and square root are important
operators in many digital signal processing (DSP)
applications including matrix inversion, vector
A.
Radix-2 SRT Algorithm
Named for its creators (Sweeney, Robertson, and
Tocher), SRT for radix-2 is an iterative method to
Proceedings of Fifth IRF International Conference, 10th August 2014, Goa, India, ISBN: 978-93-84209-45-2
29
Single Precision Floating Point Square Root Computation
compute square root of a number. Each iteration deals
with left-shift and addition of a digit. The algorithm is
rather complex especially for more precision. Also, it
may generate a wrong resulting value.
B.
Use of Look-up Table
Algorithms such as Newton-Raphson’s method as
depicted in Eq. 1, is used to carry out computation,
which is pretty faster, but due to presence of single
precision division makes it tedious and performance
degrading.
f(x )
(1)
x=x −
f′(x )
To get rid of complex computation of division, use of
look-up table is incorporated. Taylor series/Maclaurin
series expansion for square root is considered as
formulated in Eq. 2.
√x = x +
x
1
(2)
+
x
2
2 x
Then, a range of possible values of fraction f (0 to ~1)
is divided into n sub-ranges by using log bits of f as
an index into a table which contains the first two coefficients of the Taylor expansion of the square root
of the mantissa (1.0 to ~2) over that the sub-range.
=
Fig. 2: Dvanda Yogi Sutra for Square Root: Stepwise
Computation
Due to increased complexity with increase in the bits
(23-bit mantissa), this sutra found some limitations
with its implementation.
C.
Vedic Approach : Dvanda Yoga Sutra
Vedic Sutra which is used as an alternative for
simplified mathematical computation. Fig. 2 shows
the computational steps to evaluate square root of
17689 which comes out as 133.
D.
Paper and Pencil Method
It is one of the most orthodox and traditional method
to compute square root. Fig. 3 illustrates the square
root of 16 (100002) of which answer comes out as 4
(1002).The stepwise algorithm for binary numbers is
extended as follows:
1.
Group terms in a pair of 2 from right to left.
Since it has 5 digits, 1st group will have 1 digit (1).
Write down as shown besides.
2.
Find the perfect square of leftmost number
(1) which is 1.
To left of 1st vertical line write 1 and add same to it.
So add = 2.
Below 1, write same square root (1) and carry
forward the difference between above number and
perfect square (i.e. 1 - 1 = 0)
3.
Now divide underline 07 by obtained 2 =
{Quotient = 3 and Remainder = 1}.
Write Quotient (3) below and carry forward
Remainder (1).
4.
Calculate Dvanda of numbers present after
2nd vertical line D (3) = 9.
Subtract above 9 from underlined 16 (=7).
Divide 7 by 2 = {Quotient = 3 and Remainder = 1}.
Write Quotient {3} below and carry forward
Remainder (1).
5.
Calculate Dvanda of numbers present after
2nd vertical line D (33) = 18.
1.
Group the bits from right towards left. Since,
it has 5 bits, the first group is of single bit (1).
2.
Find the perfect square of leftmost number
(1)
which is 1.
Write 1 in the quotient as well down below and add it
to get 10.
3.
Find the number suffixing 10 to get the
product of it less than or equal to the partial
difference and next group of bits.
4.
Repeat the procedure till all the group of bits
are done.
Subtract above 18 from underlined 18 = (0).
Divide 0 by 2 = {Quotient = 0 and Remainder = 0}.
Write Quotient (0) below and carry forward
Remainder (0).
Fig. 3: Paper and Pencil Method for Square Root: Stepwise
Computation
Proceedings of Fifth IRF International Conference, 10th August 2014, Goa, India, ISBN: 978-93-84209-45-2
30
Single Precision Floating Point Square Root Computation
developed and then is simulated using ModelSim SE
Plus 6.5.
III. PROPOSED SQUARE ROOT ALGORITHM
This paper deals with the efficient algorithm which
incorporates the positive attributes of the square root
computational mentioned. The square root algorithm
find its limitations with the wide increase in the
number of bits as this paper deals with the floating
point representation where mantissa is of 23-bit wide.
The basic algorithm for the proposed design as
follows:
1) Sign Bit
Sign bit of the result is same as the sign bit of the
original number.
2) Exponent Computation
Exponent of the result depends on the biased
exponent of the number. If biased exponent is odd,
then 127 is added to it and final sum is right shifted
(divide by 2 operation).
Er =
If biased exponent is even, then 126 is added and
final sum is right shifted (divide by 2 operation). In
addition, shift flag is set to indicate that the mantissa
should be shifted to the left by 1 bit before computing
its square root
Er =
3) Mantissa (Square Root) Evaluation
The block of code which carry out square-root
computation is based on iterative approach where it
deals with two registers namely temp and ANS of 27bit wide and Mr of 26 bit wide.
Consider an example of evaluating square root of 16
(100002).
InitializeTEMP as 0000…0000 and ANS as
0100…0000. The iteration process is carried out.
On first iteration, TEMP is loaded with mantissa and
then is compared with ANS, shift operation is carried
out depending on the comparison results.
Verilog HDL code was break down into modules
which deals with the exponent computation (8-bit)
and mantissa evaluation (23-bit).Top module
connects all of them as shown in Fig. 5.
Various sets of inputs are fed to the top modular
block to get the results. The further part of the
document deals with simulation and synthesis results.
Fig. 5 Single Precision Square Root Computation: Block
Diagram
A.
ModelSim Simulation
Consider square root computation of a number, a =
256 (01000011100000000000000000000000) were
fed to the algorithm to get the desired output as b =
16
(01000001100000000000000000000001)
as
shown in Fig. 6.
B.
Xilinx ISE Synthesis
Verilog HDL Code for square root computationof
IEEE-754 Single Precision (32-bit) numbers are then
synthesized for device XC5VLX30 having package
as FF324 of VirtexTM-5 FPGA family.From the
datasheet cited in, this device has following attributes
manifests in Table I.
If it is greater or equal to ANS, the contents of TEMP
are subtracted from those of ANS, shifted to the left,
and stored in TEMP. The contents of ANS are shifted
to the left by one bit starting from the current pointer
position. Then, a 1 is inserted in the current bit
position in ANS. Note that in each iteration, a pointer
points to the current bit that will be replaced in ANS.
If TEMP is less than ANS, its contents are shifted to
the left and stored in TEMP. The contents of ANS are
shifted to the right by one bit starting from the current
pointer position. Then, a 0 is inserted in the current
bit position bit in ANS. After the last iteration, the
contents of ANS, with the exception of the two least
significant bits are the bits which considered as final
result. Fig. 4 shows the architectural block diagram of
the square root computation.
TABLE I
XILINX VIRTEXTM-5 XC5VLX30 ATTRIBUTES
Device
CLB Array
Total Max
(One CLB = Four Slices .
Slices × 2
Use
r
Row Colum Tota
I/O
s
n
l
xc5vlx3 80
480 19,20 220
0
30 0
0
Table II shows the Device Utilisation Summary of the
Verilog HDL code, so written, it is been observed
that number of device parameters used are very less.
Hence, an optimum Device Utilisation is obtained.
IV. DESIGN IMPLEMENTATION
From the timing report obtained, it is found that the
maximum combinational path delayis 56.487ns.
Maximum combinational path delay is only for paths
Verilog HDL code for Square Root Computation of
IEEE-754 Single Precision Numbers is being
Proceedings of Fifth IRF International Conference, 10th August 2014, Goa, India, ISBN: 978-93-84209-45-2
31
Single Precision Floating Point Square Root Computation
that start at an input to the design and go to an output
of the design without being clocked along the way.
evaluator are normally hooked to 4989.3. Also one
can visualize that this proposed algorithm is just
consuming 324 number of slices compared to other
algorithms which utilizes 351 number of slices. Thus,
there is near about 7.69% improvement in the
utilisation of the resources, thereby enhancing the
packaging density. The maximum combinational
delay incurred is just a meagre value of 56.487
nanoseconds. The square root computational block
based on conventional logic suffers a time delay of
about 104 ns. Thus, proposed design requires less
time to carry out computation, thereby enhancing the
speed of operation.
TABLE II
FLOATING POINT SQUARE ROOT COMPUTATION
(SINGLE PRECISION):
DEVICE UTILISATION SUMMARY
CONCLUSION
V. PERFORMANCE ANALYSIS:
VERSUS CONVENTIONAL
The importance and usefulness of floating point
format nowadays does not allow any discussion. Any
computer or electronic device, which operates with
real numbers, implements this type of representation
and operation. Square Root is one of the most
important arithmetic operation and difficult to
implement in terms of hardware. Various
architectures were proposed which include recursive
iterations to increase the computational performance
of the ALU. The proposed design uses algorithm
which makes the system more efficient as it occupies
less device area and less time for computation. Thus,
nearly 8% improvement is observed where the
algorithm is synthesised on Virtex – 5 platform. Thus,
proposed designis more efficient than traditional
ALUs and serve as better optimisation technique as
per today’s need and wants.
VEDIC
It can be easily deduced from the Table III which
shows the performance analysis of the proposed
square root computation with that of conventional
algorithm based square root computational block that
proposed algorithm is more optimistic.
TABLE III
PERFORMANCE ANALYSIS:
PROPOSED WORK (VEDIC) VERSUS
CONVENTIONAL
REFERENCES
Number of slice LUTs used in the design are pretty
less, 964 whereas other algorithms based square root
[1]
Alex N. D. Zamfirescu, “Floating Point Type for
Synthesis”, CA USA, 2000.
[2]
Xiaojun Wang, “Variable Precision Floating-Point Divide
and Square Root for Efficient FPGA Implementation of
Image and Signal Processing Algorithms” Ph. D
thesis,Department of Electrical and Computer Engineering,
Northeastern University, December 2007.
[3]
Tole Sutikno, Aiman Zakwan Jidin, Auzani Jidin,Nik
Rumzi Nik Idris, “Simplified VHDL Coding of Modified
Non-Restoring Square Root Calculator” in International
Journal of Reconfigurable and Embedded Systems,Vol. 1,
pp. 37-42, Mar. 2012.
[4]
K. Piromsopa, C. Aporntewan, P. Chongsatitvatana, “An
FPGA Implementation of a Fixed-Point Square Root
Operation” Thailand, Feb. 2002.
[5]
Deepa and Sanal's (2005) Vedic Mathematics [Online].
Available:
http://www.sanalnair.org/articles/vedmath/intro.htm
[6]
Rahul Bhangale (2012) Dvanda Yoga
Available:http://mathlearners.com/vedicmathematics/squares/dvanda-yoga/
[7]
Guillaume Bedard, Frederic Leblanc, Yohan Plourde, Pierre
Marchand, “Fast Square Root Calculation”in Mactech | The
Journal of Apple Technology, Vol. 14, pp. 1-5, 1998.
[8]
Padala Nandeeswara Rao, “FPGA Implementation of
Double Precision Floating Point Square Root with BIST
Proceedings of Fifth IRF International Conference, 10th August 2014, Goa, India, ISBN: 978-93-84209-45-2
32
[Online].
Single Precision Floating Point Square Root Computation
Capability” M.Tech thesis, Department of Electronics and
Communication Engineering, Thapar University, July 2009.
[9]
[10] Taek-Jun Kwon, Jeff Sondeen, Jeff Draper, “Floating-Point
Division and Square Root Implementation using a TaylorSeries Expansion Algorithm” IEEE, pp. 702-705, 2008.
Virtex-5 Family Overview datasheet, Xilinx 2009.
Fig. 4Floating Point Square Root Computation (Single Precision): Architectural Block Diagram

Proceedings of Fifth IRF International Conference, 10th August 2014, Goa, India, ISBN: 978-93-84209-45-2
33