International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 7, July 2012)
Parallel Squarer Design Using Pre-Calculated Sum of Partial
Products
Manasa S.N1, S.L.Pinjare2, Chandra Mohan Umapthy3
1
Manasa S.N, Student of Dept of E&C &NMIT College
2
S.L Pinjare ,HOD of E&C &NMIT College
3
Chandra Mohan Umapthy ,Assistant Professor &NMIT College
By using dedicated resources one can save a
considerable amount of power which allows designers to
remain inside their power budgets.
Recently, lot of research has been conducted in order to
develop different methodologies to implement squarer’s,
giving more importance to improve delay & reducing area
constraints. Due to which a new scheme was developed to
compromise the above-mentioned trade-offs, which is
called Hybrid Squarer’s. Greater emphasis is given on
Hybrid Squarer’s, which Comprises of Memory Elements
& Computing Logic.
The remainder of this paper is organized as follows.
Section II presents a brief description of existing
algorithms used in the designs of squaring units for
unsigned/Signed followed by the designs multiplication of
two binary numbers for unsigned/Signed. We present a way
to use Quarter squaring units to perform multiplication of
two binary numbers in section III. Section IV details the
implementation and experimental results followed by a
conclusion in section V.
Abstract—Power is becoming a precious resource in
modern VLSI design, even more than area. With large
number of Applications requiring support of functional units
like squares, cubes and other higher order units, it becomes
imperative that such functions be implemented in hardware.
This paper proposes a novel architecture for modular,
scalable &reusable hybrid squaring circuit. Comparison is
made between different implementationof squaring circuit.
The implementation results show a significant improvement in
performance in terms of area, power & timing
Keywords—Squarer,SquaringCircuit,Multiplier,Low
Power etc.
I. INTRODUCTION
The advances in VLSI technology, more and more
functionality complexity has been integrated into digital
designs to better support target applications. With many
applications requiring support for floating point arithmetic,
complex arithmetic modules like multipliers and powering
units are now being extensively used in design. With
technology scaling, the goal has been to operate designs at
the fastest possible frequency to achieve high performance.
The problem with these complex arithmetic blocks like
multipliers and squaring units is that they require longer
cycle times for computation. In order to achieve the
frequency requirements, these designs invariably end up
being pipelined, which results in increases in area and thus
incurs a power penalty for operating at higher clock speeds.
In many applications a higher power penalty cannot be
tolerated and designers have to budget the power associated
with individual resources.
Multiplier designs require large area and consume a
considerable amount of power per computation. For
powering operations where a general-purpose multiplier is
not necessary, this results in power being wasted. We
propose to use dedicated powering units which perform a
specific function in place of a multiplier which has been
designed for general-purpose computation. The advantage
with using dedicated Squaring units is that they consume
less power compared to general-purpose multipliers.
Squaring is a special case of multiplication.
II. BINARY MULTIPLICATION AND SQUARING
A. Binary Squarer’s
Squares are a special case of multiplication where both
inputs are identical. Since the two inputs are identical,
many optimizations can be made in the implementation of
a dedicated squaring unit[3]. Such a squaring unit requires
less area compared to multipliers as nearly half of the
partial products can be combined using the equivalence Ai
Aj + Aj Ai = 2 Ai Aj which can be represented by adding Ai
Aj to the next column to the left. This reduces the depth,
which can be defined as the number of partial products to
be added together in a column. With a reduction in depth,
the design can operate faster as the number of terms on the
critical path reduces. Fig.1 shows a 4-bit unsigned squaring
unit. We can observe from Fig.1 that two A1A0 terms in
column 2 are reduced to having only one A1A0 term in
column 3. Similarly other partial products can be reduced.
Also the property that A0 A0 = A0 allows reducing terms in
the final partial products.
35
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 7, July 2012)
The square of a 4-bit number can be computed by adding
the rows at the bottom part of Fig. 1. From Fig. 1 we
observe that the depth has also reduced; an initial depth of
four for a multiplier configuration was reduced to three for
squaring[4]. Fig 3 shows the proposed architecture[8].
Find the sum of the above results to get the square of
X = 152399025.
The above theory can be extended for any given number
X. Hence, by mathematical inspection; the proposed
algorithm is proven for any arbitrary number X
B. Binary Multiplication Using Mux
The techniques for performing binary multiplication
involve three basic steps: namely, – Generation of Partial
Products, Reduction of Partial Products and Addition of the
final two rows of partial products. An M×N bit
multiplication can be viewed as forming N partial product
arrays, each of M bits and adding them together according
to their weights. Multiplication is performed either by using
a Shift – Add algorithm or by using Parallel multiplication
techniques. The Shift – Add method requires M-cycles to
perform M×N-bit Multiplication
In this method we are using 2-Mux to generate partial
product, the select line of Mux are controlled by counter.
The output of Mux is given to a Multiplier, the result of
Multiplier is stored in Register & controlled By
clock.When clock Enables the Register we perform the
Shift-Add method requires M-cycles to perform M×N-bit
Multiplication All the recoding bit arrays are then added
together according to their weights to obtain the final
product.
The architecture of
Multiplier designed using Mux
shown below:
Fig1: Partial Product Reduction in Unsigned Squaring Operation
Fig3: Block diagram of the Proposed Architecture
Proposed Algorithm:
The algorithm consists of following steps:
The given input is partitioned into two parts, each part is
treated as a separate unit processed individually by further
units.
· Find the square of each part.
· Find twice the product of individual part.
· Add the above results suitably to get the final result.
If X is a five-digit number, who’s square has to be
computed.
X = abcde.
Find square of abc = (abc) 2.
Find square of de = (de) 2.
Find twice the product of abc & de = 2(abc)(de)
Find the sum of the above results to get the square of X.
Ex:
1. Let X = 12345. a = 123, b = 45.
Find square of abc = (123) 2 = 15129.
Find square of de = (45) 2 = 2025.
Find twice the product of abc & de = 2 * (123) * (45) =
11070.
Fig4:Multiplier designed using mux.
C. Digital Multiplier:
A basic LUT-based multiplier is simply a lookup table
with the addresses arranged so that part of the address is the
multiplicand and the other part is the multiplier. The data
width should be set to the sum of the address width to
accommodate the product.
Implementing a Basic/Digital Multiplier:
In the case where a four-bit value is multiplied by a fourbit value, you will need a memory block that is eight bits
wide and 256 words deep. The first four bits of the address
can be configured as the multiplicand and the second four
bits can be configured as the multiplier.
36
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 7, July 2012)
The memory will store the appropriate product values.
To multiply the upper four bits by the lower four bits, feed
both values into the address and clock the memory. The
appropriate product value will appear on the RAM output.
A diagram of this LUT-based multiplier implementation is
shown in Figure 1 on page 2. Since the memory block is
synchronous, this configuration will result in a synchronous
multiplier, whose clock frequency is only limited by the
data access time of the memory. While this approach is
more efficient than implementing multipliers in gates, it
can consume a large amount of memory. The amount of
memory required increases with the square of the bit width.
Theexample above demonstrates a 4 x 4 bit multiplier with
256 eight-bit words of storage required. For an 8 x 8 bit
multiplier, 65,536 16-bit words must be stored using this
technique.
Characteristics of Basic Multiplier:
A. Iterative shift add routine
B. N clock cycles to complete
C. Very compact design
D. Serial input can be MSB or LSB first depending on
direction of shift in accumulator
E. Parallel output
Partial Product Multipliers
One way to mitigate the amount of memory required is
to use partial product multiplication. This technique
combines the lookup table approach with elements of
longhand multiplication. For example, to multiply 24 x 43
= 1,032 using longhand, simplify the problem into the sum
of four multiply functions and three add functions (Figure
2).
Using a basic lookup table technique, an eight-bit by
eight-bit multiply would require 128 kb of storage. As
shown in Figure 3 on page 3, using partial product
multipliers, the same procedure can be accomplished using
1 kb of storage. In order to accomplish this in logic, using
A as the multiplicand and B as the multiplier, take the
lower four bits of A and multiply it by the lower four bits
of B using the lookup table technique. Then take the upper
four bits of A and multiply it by the lower four bits of B
and shift the partial product result to the left by four. Then
add the two results together for the first part of the product.
For the second part of the product, multiply the lower four
bits of A by the upper four bits of B. Then do the same with
the upper four bits of both A and B and shift this partial
product value to the left by four. Add the two values of the
previous calculation and shift the whole result to the left by
four. Then add the first part of the product to the second
part of the product for the final result. While this technique
is not as fast as implementing the entire multiply as a single
memory element, it does greatly reduce the amount of
memory required at the expense of using more core tiles.
III. QUARTER SQUARE T ECHNIQUE
The squaring units requiring less area and power as
compared to multipliers, it is interesting to assess the use of
squaring units to perform multiplication. There are various
Methods to obtain a multiplication of two numbers using
squares instead of using multipliers. One of the most
widely used methods in algebra is the quarter square
method [5]. In mathematical terms, the quarter square
algorithm can be expressed as.
A x B=1⁄ 4{(A+B)2 - (A-B)2}
In this method, to obtain the product of two numbers, we
obtain their sum and difference. The obtained sum and
difference are squared, and the difference of these two
squares when divided by 4 provides the result. As in binary
arithmetic, divide by 4 operation can be easily
accomplished by shifting right two digits. The quarter
square technique is illustrated in Fig.5.
(4 × 3 + ((2 × 3) × 10)) + (((4 × 4) + ((2 × 4) × 10)) × 10) = 1,032
Fig 5.Partial Product Multiplier technique
Implementing a Partial Product Multiplier
In logic this same technique can be used to reduce the
amount of memory required to perform a multiply function.
Fig 6. Quarter Square Technique
37
4
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 7, July 2012)
TABLE II
Delay (ns) Requirement using Spartan3
From Fig.6 we observe that if we have two 8-bit
unsigned numbers, the sum can result in a carry, similarly
with two 8-bit signed numbers, the difference can generate
an overflow. In order to produce a correct result we need a
(8+1) bit adder for computation of sum and difference, and
hence one would need at least (n+1) bit squaring units to
correctly perform an n-bit squaring operation.
IV. EXPERIMENTS AND RESULTS
An 8/16/32/64-bit multiplier performing signed/
unsigned operations based on multiplier using Mux
Algorithm has been described in Verilog. As multipliers
support signed operations, we use squaring units designed
for signed operations for all the results and comparisons.
We implemented the Quarter square algorithm using the
Squaring unit designs to perform signed/unsigned
multiplication. We designed 8/16/32/64/bit squaring units
to support 8-, 16-, 32-,64 bit multiplication, respectively.
The performance of all squaring circuits are evaluated on
the same device Spartan xc3s400 & Vertex XC2vp30 with
a speed grade of 4 & 7.The results suggest that the
proposed architecture is faster than Multiplier.
4
8
16
Squarer
4
6
8
Multiplier
usingMux
Quarter
Multiplier
Digital
Multiplier
13
19
35
7
11
17
4
10
40
squarer
Multiplier
usingMux
Quarte
Multiplie
r
Digital
Multiplier
4
13.48
19.6
18.93
13.68
8
16.44
21.12
19.91
18.13
16
16.66
21.89
21.25
22.32
Delay=Input delay + Output delay
From tables I & II, we can conclude that the proposed
scheme is more efficient in terms of area, timing & power.
The above results can be further improved by using the
Look Up Table (LUT) approach to calculate the
intermediate squaring values.
TABLE III
Area Requirement using spartan3
Bits
TABLE I
Area Requirement Using Spartan3
Bits
Bits
With lut
Without lut
2
2
3
4
6
8
4
7
20
6
10
26
Fig 8. Area Requirements of various designs.
Fig 7. Area Requirements of various designs.
38
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 7, July 2012)
TABLE IV
Delay (ns) Requirement using Spartan3
Bits
2
4
6
8
Squarer
using
Lut
2.69
2.73
4.71
7.35
TABLE VI
Delay (ns) Requirement using Vertex2p
Squarer
using
Without Lut
3.43
6.52
7.73
10.9
Number
of Bits
Squarer
Multiplier
Using
Mux
Digital
Multiplier
Quarter
Multiplier
4
7
10.39
7.6
10.2
8
16
8.95
9.29
11.8
12.24
10.09
12.47
11.09
12.89
32
11.32
13.10
15.5
17.92
64
16.52
20.92
19.61
24.13
Delay = Input delay+ output delay.
Simulation Result:
Fig 9. Delay Requirements of various designs.
Fig 11. 8-bit Squaring Circuit
TABLE V
Area Requirement using Vertex2p
Bits
Squarer
4
8
16
32
64
3
5
8
13
87
Multiplier
Using Mux
14
19
35
67
167
Quarter
Multiplier
7
9
11
85
421
Digital
Multiplier
3
10
40
157
534
Fig 12. 16-bit Squaring Circuit.
Fig 13. 64-bit Squaring Circuit
Fig 10. Area Requirements of various designs.
39
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 7, July 2012)
We compare four designs based on their area or the
maximum number of partial products in a column. Table I
shows the area requirements for the types multiplier and the
squaring units. As seen from Table II, the maximum delay
requirement for the Multiplier & the other Multiplier unit is
more than that of the Squarer. From the table we can prove
that area reduces means power automatically reduces.
Fig. 7&10 plots the area requirements for various
designsUnder the same constraints. From the results we can
observe that the squaring units require only about 55% of
the multiplier area. Designing multipliers with quarter
Squarer techniques results in an area penalty of about 2060% over Digital multiplier. From the area requirements of
quarter square multiplier and squaring units, we find that
the area overhead of adders in the multiplier design is about
20-30% of the area of the squaring unit.
The Delay required for each design is shown in Table II
we observe that the squaring units consume about 50% of
the Delay consumed by the Digital multiplier to perform
squaring. However, when a multiplier is built using the
quarter square technique, it consumes more power than the
Digital Multiplier as the design requires the use of two
squaring units and three adders for every multiplication.
The adder overhead significantly affects the overall power.
The Table III & IV shows area & delay requirement
using without Lut for Multiplier & with Lut for Squarer
architecture. With Lut consume less Area & Delay
compared to Without Lut. The Table V & VI shows the
area & delay requirement using Vertex.
Fig 11,12,13 shows the simulation result of 8,16,64 bit
Squarer. Result remains the same for all other types of
multiplier.
We provide results for area and power requirements in
unsigned/Signed squaring units and quarter square
multiplier for 8/16/32-bits. The low area and power
required per computation provide significant advantages
when dedicated squaring units are used in a design instead
of a general purpose multiplier. The Salient Feature are
Modular & Scalable architecture, Easy & simple to
implement, Low Power consumption, Less Area & Better
Timing can be achieved.
REFERENCES
[1 ] Risojevic, V.; Avramovic, A.; Babic, Z.; Bulic, P,‖ A simple
pipelined squaring circuit for DSP ―,IEEE 29th International
Conference Computer Design (ICCD),2011, Page(s): 162 – 167.
[2 ] Kuan Jen Lin; Yu Chan Chiu; Tzu-Hao Lin‖ A decimal squarer with
efficient partial product generation‖, 18th IEEE/IFIP VLSI System
on Chip Conference, 2010 , Page(s): 213 – 218
[3 ] Garofalo. V. Coppola. M. De Caro. Napoli. E. Petra, N.. Strollo,
A.G.M.‖ A novel truncated squarer with linear compensation
function‖, IEEE International Symposium on Circuits and Systems
(ISCAS), Proceedings, 2010 , Page(s): 4157 – 4160
[4 ] Oberman, Stuart F. and Flynn, Michael J. "Division Algorithms and
Implementations." IEEE Transcation on Computers (1997): pp833854, 2010
[5 ] Datla, S.R, Thornton, M.A, Mutual, D.W. , "A Low Power High
Performance Radix-4 Approximate Squaring Circuit," Application
specific Systems, Architectures and Processors,. 20th IEEE
International Conference on , vol., no., pp.91-97, 7-9 July2009
[6 ] Taek-Jun Kwon, Jeff Sondeen, Jeffrey Draper, ―Floating-Point
Division and Square Root using a Taylor-Series Expansion
Algorithm‖, 50th IEEE International Midwest Symposium on
Circuits and Systems, August 2007, pp. 305 – 308
[7 ] Cho, K.-J.; Chung, J.-G.‖A parallel squarer design using precalculated sum of partial product ―Electronics letter ,2007,vol 43
pp1414-1416
[8 ] Hong, Sun-Ah.Kim, Yong-Eun, Chung. Jin-Gyun. Lee, Sung-Chul,
―Efficient Squarer Design Using Group Partial Products ―IEEE
Workshop on Signal Processing Systems,2007 , Page(s): 146 – 150
[9 ] Cho, K.-J, Chung.J.-G.‖ Low error fixed-width two's complement
squarer design using Booth-folding technique‖ Computers & Digital
Techniques, IET , 2007 , Page(s): 414 – 422.
[10 ] Shuli Gao, Chabini. N, Al-Khalili. D, Langlois. P‖ Efficient
Realization of Large Integer Multipliers and Squarers‖ IEEE NorthEast Workshop on Circuits and Systems. 2006 , Page(s): 237 - 240
[11 ] Chandra Mohan Umapathy ―High speed squarer‖ 20th IEEE/IFIP
VLSI System on Chip Conference, 2010 .
V. CONCLUSION
The paper presents a case for the use of dedicated
squaring units in applications where squares are required in
large numbers, which otherwise would be implemented
using general purpose multipliers. A method of using
squaring units to perform multiplications is presented, and
the tradeoffs as compared to conventional multipliers are
presented.
40
© Copyright 2026 Paperzz