Low Power and High Speed Multiplication Design Through Mixed

Low Power and High Speed Multiplication
Design
Through Mixed Number Representations
Menghui Zheng and Alexander Albicki
Department of Electrical Engineering
University of Rochester
Rochester, NY 14627, USA
Abstract
in terms of the number conversions, it is more energy
efficient and has an operating speed close to the Wallace
tree architecture [4] and faster than the multipliers
proposed in [l, 2, 31.
The paper is organized as follows. The ESA in a
multiplication unit is addressed in Section 2. Then, in
Section 3, a novel method to reduce the ESA and increase
the operation speed is presented. The corresponding
multiplication algorithm and the VLSI architecture are
discussedin Section 4. Finally, some conclusion remarks
are drawn in Section 5.
A low power multiplication algorithm and its VLSI
architecture using a mixed number representation is
proposed. The reduced switching activity and low power
dissipation are achieved through the Sign-Magnitude (SM)
notation for the multiplicand and through a novel design
of the Redundant Binary (RB) adder and Booth decoder.
The high speed operation is achieved through the CarryPropagation-Free (CPF} accumulation of the Partial
Products (PP) by using the RB notation. Analysis
showed that the switching activity in the PP generation
process can be reduced on average by 90%. Compared to
the same type of multipliers [I, 2, 31, the proposed
design dissipates much less power and is 18% faster on
average.
2: The ESA in 2’C representation
2’C numbers and the radix-4 Booth’s algorithm are
predominantly used for multiplier design, since the
arithmetic operations can be easily carried out with 2’C
numbers and the Booth’s algorithm can largely reduce the
number of PPs. But, the Booth’s algorithm often requires
the negation of the multiplicand, and the negation of a 2’C
number requires many bits to be switched which results in
high switching activity. Without losing generality, we
use the radix-4 Booth’s algorithm to demonstrate the
probability of the negation of the multiplicand to be
generated and how many bits on average have to be
switched. This would give us the ESA during the PP
generation.
As shown in Table I, the radix-4 Booth’s algorithm
requires -Y and -2Y, where Y is the multiplicand. For 2’C
representation -Y = ? + 1, and, to generate -Y given Y,
all the bits of Y have to be switched and then the ‘1’ be
added to get the correct 2’C result. The same operations
are needed to generate -2Y, except a left shift is needed
before the bit complementation takes place. The negation
process is highly energy consuming, as it requires the
charging and discharging of all the nodes associated with
be
the PP.
Indeed, let an n-bit multiplier
X=x,-*x,-~ . . . . XI+). The radix-4 Booth’s algorithm
takes a triplet x2k+l~2k~2k-l as input and generates a PP
1: Introduction
We shall show that by the use of the SM notation for
the multiplicand, the use of Two’s Complement (2’C)
representation for the multiplier, and the use of RB
representation for the PP accumulation, the Expected
Switching Activity (ESA), and therefore the power
dissipation, can be significantly reduced. The ESA
reduction occurs any time the negation of the multiplicand
is needed in order to generate the PPs upon the radix-4
Booth’s algorithm. High speed operation is sustained
through the RB notations for accumulating the PPs, since
a CPF addition can be executed with RB numbers. The
inputs and outputs of the multiplication unit are assumed
to be in 2’C notation. Although we only consider integer
multiplications and radix-4 Booth’s algorithm here, the
proposed techniques can be easily extended to floating
point multiplications and higher radix Booth’s algorithms.
It is interesting to point out the fact that although the
proposed algorithm and its VLSI architecture is complex
This work was supported in part by the National ScienceFoundation
underawardNo. MIP-9300936.
1063-6404/95 $4.00
0 1995 IEEE
566
Proceedings of the International Conference on Computer Design: VLSI in computers & Processor (ICCD '95)
1063-6404/95 $10.00 © 1995 IEEE
Pk according to Table I, where k = 0, 1, .. . .
and
N p-even,‘)(2+3x”-2,2+i
2 n 8
xmI E 0. So, it scans 3 bits for one PP with one bit
overlap between two adjacent triplets. If n is odd, then the
+l=n.
Therefore, multiplier
X can be exactly grouped into n / 2 triplets and no sign
extension is needed. For parallel multiplication, all
triplets can be scannedat the same time.
I
I
ololol
a
4n
number must be sign extended and (n+1)/2 triplets are
needed to cover all the bits of the multiplier. Based on the
sign extension rule, the triplet x,x,-,x,,-~ has four
possible patterns: 000, 001, 110, 111. Among them there
is just one NEG. So, the probability of a NEG to occur
in triplet x,x,...~x,-~ is l/4. For x1x0, same as the case
when n is even, the probability of a NEG to occur in the
triplet x,x0x-, is l/2. For the remaining (n-3) bits, the
probability Of a NEG t0 OCCUr in a triplet x2k+lx2kx2k-1
is 3/8. Therefore, the average probability of a NEG that
may appear in a triplet X2k+1X2kX2&1is:
+2xn-3-=3
x2+Ix2
(3)
4 n+l
2 n+l
8 n+l
8
Combining cases 1 and 2, the average probability for
a NEG to appear in triplet x2k+l~2k~2k-l is
Therefore, an extra bit
+l=n-1.
8
Case 2: n is odd,
x,=x,-, (sign extension) must be appended to the left of
x,-~ to make the triplet x~x,-~x,-~. If n is even, then
the largest index 2
n
+OYistringof
if n is even
(4)
if n is odd
Since, for 2’C numbers -Y = r+l and the generation
of ? requires the complementation of every bit of Y, the
ESA in the PP generation process1 is:
Ni xn
ESA,.,=---=
(5)
NP
n
From Table I, when the radix-4 Booth’s algorithm
catches the multiplier patterns ‘llO’, ‘101’ and ‘loo’, it
has to generate -Y or -2Y. These patterns, which will be
referred to as the NEG - the negation patterns hereafter
- are directly related to the ESA in the Booth PP
generator. The average probability of a NEG patterns to
occur in any given triplet x~~+~x+~~-, of the multiplier
can be analyzed as follows.
Assume an n-bit 2’C number X = x,-~x,-~....x~x~,
and the probability of being ‘1’ for each bit of the
multiplier is 0.5.
Table II: Average ESA in Booth
partial product
generation
Operand 4-bit
8-bit
16-bit 32-bit 64-bit
Length
ESA 0.4375 0.4063 0.3906 0.3828 0.3789
We computed in Table II the ESA values for some
typical operand lengths. On the average, the ESA in the
partial product generation process is about 0.40. This
results in a large power dissipation!
Case 1: n is even,
triplets are needed to cover all the bits of the multiplier
and the sign extension is not needed. For x1x0, since the
Booth’s algorithm assumes bit xel to be always zero,
there are only four choices for the triplet xIxox~l: 000,
010, 100 and 110. Two of them are NEGs. Hence, the
probability of a NEG to appear in x1x0 and x-, positions
is l/2. For the remaining (n-2) bits, each triplet
x2k+l~2k~2k-l has 8 possible patterns and 3 of them are
NEGs. So, the probability of a NEG to appear in the
remaining (n-2) bits is 3/8. Therefore, the average
probability of a NEG that may appear in a triplet
X2k+lX2kX2k-l
3: Reducing
the switching
activity
Clearly, the high switching activity in the Booth PP
generatoris causedby the generationof-Y and -2Y and the
fact that the 2’C representation is chosen for the
multiplicand Y. The latter holds as the negation of a
given 2’C number is equivalent to the complementation of
all its bits and then adding ‘1’. On the other hand, the
lIn this paper we only consider the ESA associated with the
complement&m process. The ESA associated with the ‘adding 1’
process is not included here, since in the VLSI implementation the
‘adding 1’ processis implemented through the adder tree.
is’
567
Proceedings of the International Conference on Computer Design: VLSI in computers & Processor (ICCD '95)
1063-6404/95 $10.00 © 1995 IEEE
and SM notations are identical - no conversion is needed.
For negative numbers, the conversion from 2’C to SM can
be implemented by complementing all the bits except the
sign bit yn-, and adding the ‘1’ to the final result. If one
assumes an uniform distribution of positive and negative
numbers, then the probability that the number has to be
converted is 0.5. Although the conversion adds some
delay, it does not offset the power dissipation “gain” due
to the SM representation for the multiplicand. Indeed, if
the multiplicand is in 2’C notation one has to execute the
negation process for about 40% of all the PPs needed and
the number of the negation processes increases as the
operand length increases, while the conversion from 2’C
to SM takes place only once for any operand length. For
the “add 1” operation, instead of using an n-bit adder which
introduces delay and power overhead we generate a
correction term associated with each PP and then add this
correction term to all PPs through the binary addition tree
as shown in Figure 3. In this manner, only one more
input for the addition tree is added while the whole n-bit
addition operation is avoided. The correction term can be
generated according to Table IV. The logic for Cl and C2
is trivial: Cl= y,-t*lY and C2=y,-, *2Y. The block
diagram, as shown in Figure 1, indicates that the 2’C-toSM conversion adds only one inverter delay or about 0.5
gate delay2 which comes from the complementation
operation of the 2’C number. The correction term does
not introduce extra power overhead compared to the
traditional 2’C implementation, since in the traditional
2’C implementation one also needs a similar correction
term generator (‘adding 1”) to generate the negation of the
multiplicand.
negation of a SM number is simple - just
complementing the sign bit. Hence, if one uses the SM
representation instead of 2’C for the multiplicand Y, a
significant reduction of ESA during the Booth PP
generation process should be expected. Consequently, we
propose the SM representation for the multiplicand Y, yet
keep the multiplier X in the 2’C form. The correctness of
the radix-4 Booth’s algorithm applying to this mixed
number representation can be proved as follows: according
to [5], the radix-4 Booth’s algorithm gives correct results
when applied to 2’C numbers and the validity of the Booth
coding results depends exclusively on the pattern of the
multiplier. Since the multiplier is kept in 2’C notation,
the radix-4 Booth’s algorithm stands valid for our mixed
number representation.
Now, let us evaluate the ESA of SM numbers. Since
the multiplier is in its 2’C form, the average probability
of a NEG pattern to appear in any triplet x~~+,x~~x~~-,of
an n-bit multiplier is the same as in (4). Also, negation
of a SM number is just to complement the sign bit,
therefore, the ESA for SM number in the Booth PP
generation process is:
3+1
if n is even
N,, x 1
8n 4n2 ’
ESA, = = 0
(6)
n
-3
if n is odd
8n ’
A comparison of ESA for the SM and 2’C number is
reported in Table III. The reduction of the ESA is
significant, ranging from 87.5% for 8 bit operands to
98.4% for 64 bit operands. As the operand length
increases, the ESA for the even bit 2’C numbers decreases
with the asymptotic value of 318 and the ESA for the odd
bit 2’C numbers is a constant value of 3/8. For the SM
numbers, the ESA decreases at the rate of 0(1/n) and
asymptotically reaches zero. Thus, for longer operandsthe
ESA reduction and therefore the power saving is more
profound.
Table
HI: ESA
for
2’C
and
Table
IV:
2’C
to
Correction
terms
SM conversion
BoothI
2’C Number 1
SM in
i
4: The algorithm
4.1: Conversion
A
from
SM . number
for
2’C number
and architecture
2’C to SM notation:
can
be
expressed
as
n-2
Y = (-l)yfi-l cyi2’ and a 2’C number can be expressedas
i=O
n-2
Y = -yn42n-’
+ Cy;2’.
Fig we
For positive numbers, the 2’C
i=O
1:
2’C-to-SM
conversi
ion
*We will refer ‘one gate-delay’ as a 2-input NAND gate delay.
568
Proceedings of the International Conference on Computer Design: VLSI in computers & Processor (ICCD '95)
1063-6404/95 $10.00 © 1995 IEEE
pair fashion, (x,~~x,-~),(x,-~x,~~),.~.,(x,-~x,),(x,-,x~),
and interpret the pairs according to the SM coding rule
shown in Table V. Clearly, we do not need any
operations except some wiring.
4.2: Speeding up the PP accumulation
We have substantially reduced the ESA in the PPs
generation, but SM numbers are hard to manipulate for
arithmetic operations, since the signs of the operands have
to be identified separately through a sequence of decisions
- costing excess control logic, execution time and power
dissipation.
On the other hand, the RB numbers,
n-1
represented in the form R = 2 ri 2’ , with digit ri E { i, 0,
i=o
11, are more suitable for high speed parallel arithmetic
computations [ 1, 61. Due to the redundancy in RB
numbers one can perform the CPF addition through the
selection of different numbers for the same value. Hence,
we further convert the PPs into the RB representation.
We are adopting the selection rule proposed by Takagi in
[l] to perform CPF addition for the PP accumulation.
The rule is shown in Table VI. Let us give an example.
The CPF addition of X iOlOi1 lo=98 ;andY= OOliil 11=15
is shown in Figure 2. One can see that, the carry is
limited within adjacent digits and there is no global carry
propagation.
= -98
Augend iOlOlll0
Addend+OOliilll
= 15
Intermediate Sum i 0 0 1 0 0 0 i
Intermediate Carry + 0 0 1 I I 1 1 1
oilioiiii
=-83
Sum
Figure
2: CPF
RB
Table
V: Conversion
rules
for
SM
to RB
4.3: Converting the RB number into 2’C number
The summation of the PFs is in RB form and it has
to be converted back into 2’C form. This conversion is
carried out easily in the following manner: from Table V,
every digit xRBi = (rir:) of the RB number X, is
composed of two bits. The left bit ri[ represents the sign
and the right bit r[ represents the magnitude. One can
easily form a number XIB from the positive digits of
X,, and form another number X,, from the negative
digits of X,. Then, subtracting XiB from X&, one
can get the result in the 2’C form. The process can be
implemented using a fast adder. Since a fast adder is
essential for all the multiplication algorithms to carry out
the final result, the RB-to-2’C conversion does not
introduce any extra overhead.
addition.
4.4: The algorithm and its VLSI architecture
THE ALGORITHM
:
Step 1: Convert the multiplicand
Step 2:
Step 3:
Step 4:
The conversion of SM-to-RB can be carried out as
following: as the RB representation uses a digit set of { i,
0, 1 ), one needs two bits r:( to represent one digit Yi. If
we use a SM coding to represent a Rl3 digit, that is, ri to
Step 5:
from 2’C into
the SM representation and keep the
multiplier in 2’C form.
Apply the radix-4 Booth’s algorithm to
generate all the FFs represented in SM
notation.
Convert all the partial products from SM
into RB representation.
Sum up all the PPs through a RB adder
tree.
Convert the final result from RB into 2’C
notation.
The corresponding VLSI architecture for the algorithm
is shown in Figure 3. It is composed of two major parts:
the PP generator and the redundant binary addition tree.
The key components in this architecture are: the RB adder
in the addition tree and the Booth decoder in the FF
generator. A novel design for the RB adder and Booth
decoder based on SM coding has been developed. The new
represent the sign and r/’ to represent the magnitude, we
can easily convert a SM number into an RB number. For
a SM number X= x~~Ix,~~...xi...xlxo, the sign of the
number is decided by the sign bit x,-, . Therefore, we can
group the sign bit x,-, with all the rest bits in a pair by
569
Proceedings of the International Conference on Computer Design: VLSI in computers & Processor (ICCD '95)
1063-6404/95 $10.00 © 1995 IEEE
multiplication schemes, ours is on average 18% faster in
terms of gate-delays.
RB adder has a critical path delay of 4 gate-delay, while the
previously reported fastest RI3 adder has a critical path
delay of 5 gate-delay [3]. The new Booth decoder needs
only 3 transistors and 3 control lines instead of 9
transistors and 5 control lines for the 2’C based Booth
decoder[7].
Table VII. A comparison
of the speed (gatedelavsj
of -different archite&ures-OperandLength
CBitsj
1 8 1 16 ( 32 1 64 1 128 1
I
MultiplicandY in 2’C
+,
y”.,
Makino
Wallace
5: Conclusion
I
3: Low power and high
multiplier
architecture.
22
33 1 38.5
1 12 1 18 1 24 1 30
remarks
A low power multiplication algorithm and its VLSI
architecture are proposed. The reduced switching activity
and low power dissipation are achieved through the SM
representation for the multiplicand and through a novel
design of the RB adder and the Booth decoder for the SM
numbers. The high speed operation is achieved through
the CFF accumulation of the FPs by using RB numbers.
The SM-to-RB conversion is carried out by grouping the
sign bit with all other bits, which does not require any
operation except some wiring. Analytical study indicates
that the ESA in the multiplicand negation process for PP
generation can be reduced on average by 90 percent.
Further research on the low power redundant binary
addition tree design is under investigation.
SM-To-RB
Partial
Products
in RB
Figure
1 9
speed
References
4.5: Comparisons
[l] N. Takagi, et al, “High-Speed VLSI Multiplication
Algorithm with a Redundant Binary Addition Tree,” IEEE
on Computers,
Vol.C-34, No.9, pp.789-796,
September 1985.
[2] H.Makino, et al, “A 8.8-ns 54x54-bit Multiplier Using
New Redundant Binary Architecture,” Proceedings of 1993
Since speed is always at premium, let us make a
comparison of our algorithm with some reported fast
multiplication algorithms [l, 2, 31 and the Wallace tree
architecture which is commonly thought to be the fastest
architecture for multipliers [4]. The comparison in Table
VII is made in terms of the equivalent gate delays along
the critical path required by the partial product addition tree
for different operand lengths. The extra 0.5 gate-delay
overhead introduced by the 2’C-to-SM conversion is
included. It is assumedthat the delay of the partial product
generator and the final fast adder are the same for all the
architectures. The delay of a full adder which is used in
the Wallace tree is assumed to be 3 gate-delays. From
Table VII, the gate-delays of our architecture is close to
the Wallace tree when the operand length is less than 64
bits. When the operand length exceeds 64 bits, our
architecture becomes faster than the Wallace tree
architecture. Furthermore, the 2-to-1 binary reduction tree
of our architecture implies much simpler layout and
routing than the 3-to-2 Wallace tree; this advantagewill be
more profound when the technology goes into deep submicron. As compared with other reported RB binary tree
Trans.
International
Conference
on
Computer
Design,
Cambridge, MA, USA, pp.202-205, October 3-6, 1993.
[3] X.Huang, et al, “A High-Performance CMOS Redundant
Binary Multiplication-and Accumulation (MAC) Unit,”
IEEE Trans. on Circuit and Systems-I:
Fundamental
Theory and Applications,
Vo1.41, No.1, pp.33-39,
January 1994.
[4] C.Wallace, “A Suggestion for a Fast Multiplier,” IEEE
Trans. on Electronic Computer, Vol.EC- 13, pp. 14- 17,
February 1964.
[5] L. P. Rubinfield, “A Proof of the Modified Booth’s
Algorithm for Multiplication,” IEEE Trans. on Computers,
Vo! C-24, No.10, pp.1014-1015, October 1975.
[6] A. r\vizienis, “Signed-Digit Number Representations for
Fast Parallel Arithmetic,” IRE Trans. on Electronic
Computer, Vol.EC-10, pp.389-400, September, 1961.
[7] N.Weste and K.Eshraghian, Principles of CMOS VLSI
Design: A System Perspective, 2nd Edition, pp. 555,
Addison-Wesley Publishing Company, 1993.
570
Proceedings of the International Conference on Computer Design: VLSI in computers & Processor (ICCD '95)
1063-6404/95 $10.00 © 1995 IEEE