Modified Booth Algorithm Carry Save Adder for High

Modified Booth Algorithm Carry Save Adder for
High-Speed Multiplier
Mahyar Shahsavari
July 2012
Abstract
Designing an optimized processor has been a main concern of the
computer and hardware designers during the recent decades. Many approaches have been tested and implemented. Different methods for addition are applied which cause different method for multiplication and
division. Having high-speed multipliers is critical for the performance of
processors. Infact 8.72 % of all instructions in a typical scientific program
are Multipliers (Kyoung, H.L., 2003). In this report, we present a parallel multiplier by applying modified booth algorithm along with Carry
Save Adder (CSA). We enhanced Leon 3 software processor from Gaisler
company. The implementation is based on a Xilinx Vertix4 board.
1
Introduction
For multiplication the conventional iterative add-shift methods are inexpensive
to implement in term of hardware but the resulting execution speeds are too
low to satisfy the increasing demand for high-speed computing. Since the speed
of CPUs have increased tremendously in recent years, parallel multipliers can
be implemented such that meet high-speed requirements. Between different
types of multiplication techniques, the modified Booth algorithm is very prominent. This technique, along with the use of Carry Save Adder (CSA) approach,
increases the performance of parallel multipliers. In this paper, we work on
enhancing the performance of 32 bit leon3 open source multiplier. The idea
was to integrate the computer architecture and computer arithmetic concepts
and utilize them in order to design the required functional units and optimize
the overall processor performance. Furthermore, these functional units and the
overall processor needed to be implemented on Virtex4 ML410 FPGA board in
order to run the benchmark and measure the performance. Basically, the most
important principle of computer design is to focus on the application cases like,
When making a design trade-off, favor the frequent case over the infrequent case.
For instance, the instruction fetch and decode unit of a processor may be used
much more frequently than a multiplier and divider. Similarly, the multiplication operation is performed more often than division thus, more performance
gain can be achieved by improving multipliers than the dividers.
1
2
Present schemes used
There are 3 methods for multiplications:
• Binary Multiplicaton
• Array Multiplier
• Multiplier and Accumulator Unit (Tree Multiplier)
Binary multiplication is a software method. In this case processors do not have a
hardware multiplier. Binary multiplier is fine but it is slow, The entire process
consists of three steps, partial product generation, partial product reduction
and final addition. Next choice, the array multiplier, is a vast improvement
in speed over the traditional bit serial multipliers. Array multiplier is very
regular in its structure and uses only short wire to connect to the next full
adder. Thus, it has a very simple and efficient layout in VLSI. This method
still is not fast enough and the area and power would be the most obvious
shortage of this technique. The tree multiplier or in the other word, Multiplier
and Accumulator (MAC), has this capability to be fully paralleled. Efficiency
can dramatically be improved if we use high-performance CSA and using higher
radix multiplier. These multiplication schemes handle more than one bit of the
multiplier in each cycle. A higher representation radix leads to fewer digits.
Thus, the multiplication algorithm requires fewer cycles, which means fewer
partial products.
3
Overview of our design
The multiplication algorithm has 4 steps to which improving each one can have
better consequences in whole process. These steps are partial product generation, partial product addition, final addition and accumulator. Using several
techniques [3]such as the Baugh- Wooley (BW), Booth Algorithm (BA) and
Modified Booth Algorithm (MBA) cause having faster and efficient partial product generation. For n-bit multiplier, the number of summands are n-bit, ≤n/2
and n/2 for BW, BA and MBA respectively. In addition to the encoding step,
the BA and MBA algorithms also require generation of the twos complement
of the multiplier which introduces extra delay. The delay for twos-complement
generation is not trivial, but has been consistently neglected in most of the
proposed designs in the literature. The method for improving partial product
addition characteristics is related to using the proper adder. In our design we
applied Carry-Save Adder as what is illustrated in Figure 1. For storing the final
multiplier result 2n-bit accumulator is required. Modified Booth Algorithm [2]
is the method that we have chosen for producing partial product. In the conventional MBA, three-bit strings of the multiplier are scanned and appropriate
operations are carried out on the multiplicand. We express n bit numbers A and
B by sequences an−1 an−2 . . . a0 and bn−1 bn−2 . . . b0 , respectively. The product
of the two numbers can be written as
P =A×B =
n−1
X
i=0
ai 2i ×
n−1
X
j=0
bj 2j =
n−1
X n−1
X
ai bj 2i+j
(1)
i=0 j=0
In a straightforward parallel multiplication operation of two n bit numbers,
all the partial products are generated simultaneously. Since parallel hardware
2
Figure 1: A partial schematic of the adding 32-bit CSA
implementation lends itself only to a fixed number of partial products, the algorithm was modified by MacSorley [1] which could encodes 3-bit strings of the
multiplier at a time with an overlapping bit. The multiplier can be written as
B=
n
X
(−2bi+1 + bi + bi−1 )2i =
n/2
X
Qi 4i
(2)
i=0
i=even
where
Qi = −2b2i+1 + b2i + b2i−1
with b−1 = 0 and Qi ∈ {−2, −1, 0, +1, +2}. The product of the multiplication can be written as
n/2
X
P =
AQi 4i
(3)
i=0
An encoder accepts three-bit strings of the multiplier as input and outputs
the appropriate control signals like what is shown in Figure 2. The truth table for the encoder and the mathematical operations effected by each three-bit
sequence of the multiplicand is shown in Table 1. The control signals generated by the encoder are Z, ADD, 2ADD, 2SUB, SUB and NEG. Z is the signal
for which the multiplexer modifies the multiplicand to output zero. ADD and
2ADD are signals for which the multiplexer produces the multiplicand and twice
the multiplicand, respectively. The SUB and 2SUB control signals allow the
multiplexer to generate the complement and complement of twice the multiplicand, respectively. Finally, NEG generates a 0 or a 1 depending upon whether
the multiplexer generates a positive or a complemented number. Subtraction
3
Figure 2: The radix 4 schematic using booth encoding method
Multiplier bits
b2i+1 b2i b2i−1
0
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
1
1
1
Table 1: Modified Booth algorithm
Booth modified outputs
Z ADD 2ADD 2SUB SUB NEG Mux Out
1
0
0
0
0
0
0
0
1
0
0
0
0
+A
0
1
0
0
0
0
+A
0
0
1
0
0
0
2A
0
0
0
1
0
1
-2A
0
0
0
0
1
1
-A
0
0
0
0
1
1
-A
1
0
0
0
0
0
0
can be carried out using 2s complement addition. This involves adding one to
the complement of the multiplicand at the LSB for SUB and 2SUB operations.
The extra one is generated by the encoding logic.
4
Implement Multiplier
We designed a multiplier which can multiply two 32 bit signed numbers in 3
cycles. 32 bits of the multiplicand and 16 bits of the multiplier are fed to the
multiple generation block. 16 outputs of multiple generation block combine
with the sum and carry from the previous cycle. Lower 16 bits of sum and
lower 15 bits of carry are inserted into a 16-bit CPA to produce lower 16-bits
of product and after 2 iterations of this process the lower 32 bits of product are
obtained. After choosing the suitable algorithms, the first step is writing the
code for multiplication and its testbench to verify the correctness of our design
too. After multiplication verification by itself, we would replace it in the whole
project of leaon3. The VHDL code for applying Booth encoder is
two_a <= (30 downto 0)
a_bar <= not a;
two_a_bar <= a_bar(30
aa <= a when b="001" or
else two_a when
& ’0’;
--shift Lest to produce 2a
-- generate (-a)
downto 0) & ’0’; -- generate (-2a)
b="010" --Check to use proper booth output
b="011"
4
else two_a_bar when b="100"
-- cin=1
else a_bar when b="101" or b="110" -- cin=1
else x"00000000";
cin <= ’1’ when b="100" or b="101" or b="110"
else ’0’;
topbit <= a(31) when b="001" or b="010" or b="011"
else a_bar(31) when b="100" or b="101" or b="110"
else ’0’;
Figure 3: Simulation results of testbench (Modelsim)
With another look at the Table 1, we will find out easier how this code
checking the 3 bits of b and base on these three bits choose the Booth encoder
output as a partial product. Running testbench of designed multiplier Figure
3, can help us to see the results which confirm we are using the trustworthy
multiplication.
5
Timing Report
After implement our design in ISE, the time summaries in Table 2 were obtained. In order to run the Dhrystone benchmark, we had to implement the
modified processor on Virtex 4 FPGA. For this we had to do placement and
routing of our design. The actual clock of the design is not what is mentioned in the synthesis report. The actual speed on which the design can run is
given after actually placing and routing the design on the target FPGA. Taking
privilege of Booth encoder in radix 4, in addition to the improvement in time
and minimum period, the level of logic decreased too. For instance in case of
slack (setup path), source: l3.cpu[0].u0/p0/iu0/r.x.result3 (F F ) and destination: l3.cpu[0].u0/cmem0/dme.dtags1.dt1.dt0[1].dtags0/xc2v.x0/a9.x[0].r0
(RAM ) the level of logic decreased from 15 to 13 which could save a notable
amount of area too.
5
Table 2: Timing Summary
Processor
Baseline
Modefied
Constraints (paths)
7435292
6775941
Constraints (connections)
62271
61821
Min period
12.418ns
12.408ns
Max freq
80.528MHz
80.593MHz
Figure 4: Device Utilization Summary for Baseline Processor
6
Performance Results
Arfet these modifications and doing implementation, we are going to compare
two Device Utilization Summaries output of ISE regard to baseline soft core as
well as modified one. Figure 4 and 5 are shown below which are snapshots of
design summery of Xilinx ISE tools version 13.3.
7
Summary and Conclusion
In this report, We have presented an algorithm to do faster and efficient multiplication. Multiplication is more frequently used by processor. Therefore, we
expected better performance of processor. We did our implementation on leon3
32 bit open source soft-core. Our platform was xilinx board Vertix 4 and the
frequency which we applied was the same with what leon3 itself was applied
6
Figure 5: Device Utilization Summary for Modified Processor
(80 MHz). By doing this modification, the number of occupied slices decreased
and we save the area and consequently reduction in power consumption. Our
multiplier works with 3 clock pulses so we have a faster processor now as it is
shown in the timing summary section.
For the future work let me see I (instead of we) have a plan to work on the
divider and apply a new efficient algorithm for it. The other thing which I am
thinking about for future work is using higher frequency for this core. For this
work, I could not fully investigate the power reduction due to clock gating and
other techniques because of the time limitation, but I am planning to work on it
in summer. There is a possibility of using low power intelligent tool environment
(LITE) with back annotation to investigate more about the power consumption
but due to lack of time I could not work on that.
As a final comment, I would like to mention that this course was very interesting
project and I learnt many things of this course. I understood the concept of
soft-core, how to use Modelsim, Xilinx ISE, writing VHDL codes how to check
new arithmetic algorithms and ideas and many other technical things related to
computer arithmetic. By this exercise, we have practically realized the role of
different factors in the performance of processor, realization of the arithmetic
circuits and their improvement. The only limitation which I had and took me
more time to progress, was my isolation and working alone without enough
feedback.
7
References
[1] Algirdas Avizienis. Binary-compatible signed-digit arithmetic. In Proceedings of the October 27-29, 1964, fall joint computer conference, part I,
AFIPS ’64 (Fall, part I), pages 663–672, New York, NY, USA, 1964. ACM.
[2] Shiann-Rong Kuang, Jiun-Ping Wang, and Cang-Yuan Guo. Modified booth
multipliers with a regular partial product array. Trans. Cir. Sys., 56(5):404–
408, May 2009.
[3] Behrooz Parhami. Computer arithmetic: algorithms and hardware designs.
Oxford University Press, Oxford, UK, 2000.
8