A Simple High-Speed Multiplier Design

A Simple High-Speed Multiplier Design
D.S.Dawoud
Peter D. Dawoud
University of KwaZulu Natal
University of KwaZulu Natal
Email: [email protected]
email: [email protected]
Ndagije Charles
National University of Rwanda
email: [email protected]
ABSTRACT
The performance of multiplication is crucial for
multimedia applications such as 3D graphics and signal
processing systems, which depend on the execution of
large numbers of multiplications. Many algorithms are
proposed to implement high speed parallel multipliers.
These algorithms mainly focused on rapidly reducing
the partial products rows down to final sums and
carries used for the final accumulation. These
techniques mostly rely on circuit optimization and
minimization of the critical paths. This paper focuses
on reducing the number of generated partial product
rows before applying any of the partial products
reduction techniques. Fewer partial products rows
means lowering the overall operation time.
The paper introduces two techniques for
reducing the number of generated partial product rows.
The first technique uses an algorithm and a circuit that
result in finding the 2’s complement in a fast way. The
second technique uses the conventional way of getting
the 2’s complement (complement a binary number and
add 1 to the complemented number), but a simple
hardware is proposed to implement the “add 1”
operation without any carry propagation.
In addition to the speed improvement, our algorithms
result in a true diamond-shape for the partial product
tree, which is more efficient in terms of
implementation.
The simulation of our proposed techniques
showed large improvement in speed and in power
consumption when compared to conventional
multiplication algorithms.
Index Terms—Multiplier, Booth Algorithm, modified
Booth encoding (MBE), partial products.
1. INTRODUCTION
Many applications e.g. digital signal processing
systems and 3D graphics are highly multiplicationintensive. The performance of such systems strongly
depends on the performance of the multiplier and the
algorithm used for implementing the multiplication
operations. Therefore, there has been much work on
advanced multiplication algorithms and architectures
during the last decade [1], [2], [3], [4], [5], [6], [7], [8],
[9], [10], [11], [12], [13], [14], [15], [16], [17], [18],
[19].
Multiplication consists of three major steps. In
the first step, the partial products are generated. In the
second step, the partial products are reduced to one row
of final sums and one row of carries. In the third step,
the final sums and carries are added to generate the
result. For the first step, most of the authors employ the
Modified Booth Encoding (MBE) approach [3], [4],
[5], [11], [12], [19] because of its ability to cut the
number of partial products rows in half. In the second
step, the authors use, normally, some variation of the
known partial products reduction schemes, such as
Wallace trees [4], [6], [21], [26] or compressor trees
[6], [12], [13], [14], [15]. They use such techniques to
rapidly reduce the number of partial product rows to
the final two (sums and carries). In the third step, they
use some kind of advanced adder approach such as
carry-lookahead or carry- select adders [6], [12], [18]
to add the final two rows, resulting in the final product.
The main focus of recent multiplier papers [3],
[9], [10], [14], [16], [15] has been on rapidly reducing
the partial product rows by using some kind of circuit
optimization and identifying the critical paths and
signal races. In other words, the goals have been to
optimize the second step of the multiplication
described above.
It is easily to deduce that in all these techniques the
time required to reduce the partial product rows
depends on the number of rows to be reduced. For
example, when using Wallace tree or compressor tree,
the propagation time depends on the depth of the tree
(the number of levels in the tree). This means that to
improve the performance of the multiplier we have to
start by looking for algorithms and techniques that
produce fewer partial product rows. This paper
concentrates on the first step, i.e. forming the partial
product array. The paper target is finding a
multiplication algorithm and circuits which will
produce fewer partial product rows. By having fewer
partial product rows, the reduction tree can be smaller
in size and faster in speed.
A well defined approach was presented by
Jung-Yup Kang and Jean-Luc Gaudiot in [11]. This
paper describes a scheme to eliminate the sign
extension problem produced in MBE Wallace
multipliers. The authors of [11] have proved that taking
the 2’s complement of the multiplicand reduces one
row of partial products and a negative signal, which
reduces the carry save adder tree (CSAT). The
reduction of CSAT also reduces the delay and power
consumption. In this paper they also presented a fast
technique to compute the 2’s complement of a binary
number. The main problem with this proposal is that it
is practical for multiplying words of limited width
(they recommended 8 x 16). Increasing the word length
results in hardware complexity and also increases the
multiplication time. These limitations are completely
avoided in our proposals. The proposed techniques are
suitable for any word length.
The paper is organized as follows: In SectionII, the conventional multiplication method is described
with an emphasis on its weaknesses. In Section-III, a
step-by-step procedure to prevent the adverse effects of
some conventional multiplication algorithms is
presented. In Section-IV, the effectiveness and the
implementation evaluation and analysis of our method
are described and, finally, a summary is presented.
number of partial product rows to be accumulated is
reduced from n to n/2, where n is the multiplier width.
This explains why it is used in implementing many
multipliers [5], [7], [8], [14], [15], [26]. However, it is
important to note that there are two unavoidable
consequences of using MBE: sign extension and
negative encoding. The combination of these two
unavoidable consequences results in the formation of
one additional partial product row and, of course, this
additional partial product row requires not only more
hardware, but also, and more importantly, time (to add
this one more row of partial products).
Let us look at the benefit and the overhead of
MBE by an example. For an 8-bit x 8-bit
multiplication, a multiplier without MBE will generate
eight partial product rows (because there is one partial
product row for each bit of the multiplier). However,
with MBE, only n/2 (= 4) partial products rows are
generated, as shown in the example of Figure 1.
The partial products may take one of the values 2Y, Y,
0, -Y, and -2Y. Because of the possibility of generating
negative encoding, it is needed to add the neg signals
(neg0, neg1, neg2, and neg3). This means that the actual
number of partial product rows is [(n/2) +1] and not n/2
(neg3 in Figure 1).
Having one more partial product row adds at
least one more EXOR-delay to the time to reduce the
partial products. This fact that one additional partial
product row brings delay is even more critical for
multiplications of words with small word length (e.g. 8
x16) than with longer operands because of the
relatively higher delay effect that this additional row
brings. There is also an extra hardware cost since one
more carry saving adder stage hardware is necessary.
2. THE CONVENTIONAL
MULTIPLICATION ALGORITHMS AND
THE OVERHEAD
Overhead due to Sign Extension:
Overhead due to negative Partial Product
terms:
The second overhead of using MBE is the sign
extension, Figure 1. When adding the n/2 partial
products Pi’s, the Pi+1th partial product is placed two
bits to the left of the Pith partial product. However, all
Pi’s should be sign extended to the m.n binary position.
Modified Booth Encoding (MBE) is one of the most
efficient techniques that can be used to reduce the
number of partial products. Applying MBE, the
pp80
pp81
pp82
pp83
pp80
pp81
pp82
pp83
pp80
pp81
pp82
pp73
pp80
pp81
pp82
pp63
pp80
pp81
pp72
pp53
pp80
pp81
pp62
pp43
pp80
pp71
pp52
pp33
pp80
pp61
pp42
pp23
x7
x6
x5
x4
x3
x2
x1
x0
y7
y6
y5
y4
y3
y2
y1
y0
pp70
pp51
pp32
pp13
pp60
pp41
pp22
pp03
pp50
pp31
pp12
pp40
pp21
pp02
neg2
pp30
pp11
pp20
pp01
neg1
pp10
pp00
neg0
neg3
Fig.1
The array of partial products for signed multiplication with MBE
row, and the remaining bits are added to the other
partial product rows. Concerning the execution time,
the signal neg3 is the only overhead.
The sign extension results in a non regular shaped
partial product array. Regular shaped partial product
array is a more efficient configuration for VLSI
implementation.
3. PREVENTING THE ADDITIONAL
PARTIAL PRODUCT ROW
Sign Extension prevention
The sign extension can be avoided if we accumulate
the partial products Pi’s one at a time. In case of
parallel multipliers, e.g. use of Wallace tree, we add all
of the partial products in parallel. Thus such a sign
extension becomes very costly. We are considering
here one way to prevent sign extension.
This technique assumes that all the partial products are
negative. In this case, the sum of all sign extensions
can be pre-calculated:
To prevent the extra partial product row and, thus, save
the time of one additional carry save adding stage and
the hardware required for the additional carry save
adding, it is necessarily to find ways to remove the last
neg signal (neg3 in Figure 3) . Removing the last neg
signal can also generate a more regularly shaped partial
product array, making it a more efficient configuration
for VLSI implementation. In this paper we are
proposing two techniques to remove the last neg. The
first proposed technique is based on stopping the
generation of the neg signal by introducing a fast
method to find the 2’s complements. In the second
proposal, the signal neg is generated as usual but a
suitable circuit is used to take it into consideration but
not as a separate partial product row.
M /2
signs =
∑ ((−1).2
N
).4 i
i =0
2M −1
= 2 (−1)(
)
3
N
The above relation can be interpreted to mean that a
fixed number (-1)[(2M -1)/3], should be added to the
(unextended) partial products, starting from the Nth
binary position leftwards. This number, if expressed in
binary is equal to 1010101…01011, where there are
M/2 – 1 zeros. If it turns out that a partial product, Pi =
di.Y (di = 2, 1, 0, -1, -2) is indeed positive, we simply
replace its sign bit with a “1” to undo the effect of our
earlier assumption about the negativeness of the partial
product Pi. The technique is shown in Figure 2 and it
can be easily modified to take the form shown in
Figure 3. The modified form, Figure 3, means that the
additional word that is used for sign extension
prevention does not increase the number of rows that
result when using MBE. The first three bits of the word
are taken into consideration by the first partial product
1
pp83
pp73
0
1
pp82
pp63
pp72
pp53
pp81
pp62
pp43
pp71
pp52
pp33
pp80
pp61
pp42
pp23
0
1
0
1
1
3.1 Removing the Last neg Signal by Using
a Fast Method to Find 2’s Complement
As mentioned before, if the encoding technique
generates only the signals +2Y, +Y, and 0 (where Y is
the multiplicand) the neg signals would not be
necessary, i.e. there would not be need for the
additional overhead partial product row (neg3 in our
x7
x6
x5
x4
x3
x2
x1
y7
y6
y5
y4
y3
y2
y1
x0
y0
pp70
pp51
pp32
pp13
pp60
pp41
pp22
pp03
pp50
pp31
pp12
pp40
pp21
pp02
neg2
pp30
pp11
pp20
pp01
neg1
pp10
pp00
neg0
neg3
Fig. 2 Multiplication algorithm with negative encoding and sign extension prevention.
x7
1
0
pp83
1
pp73
0
pp82
pp63
1
pp72
pp53
pp80
pp81
pp62
pp43
/pp80
pp71
pp52
pp33
/pp80
pp61
pp42
pp23
x6
x5
x4
x3
x2
x1
x0
y7
y6
y5
y4
y3
y2
y1
y0
pp70
pp51
pp32
pp13
pp60
pp41
pp22
pp03
pp50
pp31
pp12
pp40
pp21
pp02
neg2
pp30
pp11
pp20
pp01
neg1
pp10
pp00
neg0
neg3
Fig. 3 Modified form of Figure 2.
examples). MBE encoding generates the signals +2Y,
+Y, 0, -2Y, -Y which necessitates the existence of the
signals neg to calculate the negative values (to get the
2’s complement).
Taking into consideration the fact that neg
exists to produce the 2’s complement, we can say that
if we could somehow produce the two’s complement of
the multiplicand while the other partial products were
produced, there would be no need for the last neg
because this neg signal would have already been
applied when generating the two’s complement of the
multiplicand. In such a case, we “only” need to find a
faster method to calculate the two’s complement of a
binary number. The first proposal is, thus, based on
finding an efficient way of finding the two’s
complement of a binary number.
Figure 4), then the value is kept unchanged and, if the
conversion signal is “1” (the checks in Figure 4), then
the value is complemented. The conversion signals
after the rightmost “1” are always 1. They are “0”
otherwise. Once a lower order bit has been found to be
a “1”, the conversion signals for the higher order bits to
the left of that bit position should all be “1”.
The main problem with this technique is how to find a
fast algorithm that can be used for searching for the
rightmost “1”. The searching for the rightmost “1”
could be as time consuming as rippling a carry through
to the MSB since the previous bits information must be
transferred to the MSB. In this section we are
proposing a method to expedite this detection of the
rightmost “1”.
In a recent paper [11], the authors proposed a
technique that can achieve the search for the rightmost
“1” in logarithmic time by using a binary search treelike structure. This technique is practical for small
word lengths.
The proposed algorithm and its implementation is
shown in Figure 5. The output of the ith stage, si is
given by equation - 1.
A Fast Method to Find Two’s
Complements
The conventional method of getting the 2’s
complement by first complementing the binary number
and the add “1” to the complemented number (this is
the way used effectively in Figures 1 to 3) can not be
used in case of parallel multiplication. This is because
the propagation delay of the carry linearly increases
with the word size and it would be much greater than
the delay to generate the partial products.
In the second well known algorithm for getting the 2’s
complement, which we are going to use in our
proposal, all the bits after the rightmost “1” in the word
are complemented but all the other bits are unchanged.
The two’s complement of a binary number 01010100 is
10101100 (Figure 4). For this number, the rightmost
“1” happens in bit position 2 (the check mark position
in Figure 4). Therefore, values in bit positions 3 to 7
can simply be complemented, while values in bit
positions 0, 1 and 2 are kept unchanged. Using this
Algorithm, two’s complementation now comes down
to finding the conversion signals that are used for
selectively complementing some of the input bits. If
the conversion signal at any position is “0” (the crosses
in
Bit position
Input binary
2’s Complement
7
0
/
1
6
1
/
0
5
0
/
1
4
1
/
0
3
0
/
1
si = a i ⊕ (ai −1 + a i −2 + .... + a1 + a 0 )
a i
=
a i
if (ai −1 + a i −2 + ... + a 0 ) = 0
if (ai −1 + a i −2 + ... + a 0 ) = 1
(1)
In this expression “+” means logic “OR”, “ ⊕ ” means
EXOR and (ai-1+ai-2 + ….. + a0) represents the
conversion signal.
This algorithm achieves all the requirements for getting
the 2’s complement:
• (ai-1 + ai-2 +…+ a0) = 0 means that the conversion
signal is “0” and the value of ai kept unchanged.
This happens only when all the bits to the right of
ai are zeros.
•
(ai-1 + ai-2 +…+ a0) = 1 means that the conversion
signal is “1” and the value of ai is complemented
( si
= ai ). This happens only when one of the
bits to the right of ai is “1”.
2
1
/
X
1
1
0
X
0
0
0
X
0
First 1’s Appearance from LSB
Complement
Fig.4 2’s Complement conversion example
This algorithm is suitable for any word length without
any significant increase in the time required to get the
2’s complement. In order to remove any possibilities
for any delay that may arise due to the propagations of
the signals between the OR gates, it is possible to
group the word into groups of 4 bits (for example) and
between each two groups we use an “OR” gate to stop
any possible propagation as shown in Figure 5(b).
The proposed partial products after removing the last
neg (neg3 in Figure 3), is shown in Figure 6.
(a)
(b)
Fig. 5 Finding 2’s Complement using the proposed algorithm
1
0
s8
1
s7
0
pp82
s6
1
pp72
s5
pp80
pp81
pp62
s4
/pp80
pp71
pp52
s3
/pp80
pp61
pp42
s2
x7
x6
x5
x4
x3
x2
x1
x0
y7
y6
y5
y4
y3
y2
y1
y0
pp70
pp51
pp32
s1
pp60
pp41
pp22
s0
pp50
pp31
pp12
pp40
pp21
pp02
neg2
pp30
pp11
pp20
pp01
neg1
pp10
pp00
neg0
Fig. 6 Proposed partial products after removing the last neg
m/2
0
1
0
pp45
1
pp35
0
pp44
pp25
1
pp34
pp15
0
pp43
pp24
pp05
neg5
1
pp33
pp14
0
pp42
pp23
pp04
neg4
pp60
p41
pp22
pp03
neg3
1
pp32
pp13
/pp50
pp31
pp12
/pp40
pp21
pp02
neg2
------- n --------------
X3
X2
X1
X0
Y3
Y2
Y1
Y0
pp30
pp20
pp10
pp00
pp11
pp01
neg0
neg1
Fig. 7 Partial products when m > (n+4)
By applying the first proposal for getting the 2’s
complementation, the last PPR (in Figure 3) is
correctly replaced without the last neg (Figure 6). Now,
the multiplication can have a smaller critical path. This
avoids having to include one extra carry saving adding
stage. It also reduces the time to find the product and
saves the hardware corresponding to the carry saving
adding stage.
The only difference between the multiplier
architecture in our case and the conventional multiplier
architectures is that, for the last partial product row, our
architecture has no partial product generation but
partial product selection with a two’s complement unit.
A 3-5 coder selector is needed to select the correct
value from five possible inputs (2 x X, 1 x X, 0, -1 x X,
-2 x X) which are either coming from the two’s
complement logic or the multiplicand itself and input
into the row of 5-1 selectors. Unlike the other rows
which use PPRG (Partial Product Row Generator), the
two’s complement logic does not have to wait for MBE
to finish: The two’s complementation is performed in
parallel with MBE and the 3-5 decoder.
-
In the case of m > n+4 (Figure 7), the last neg
signal can be included in the first row without
affecting any of the original bits in that row.
In other words, if Wallace tree or any similar
way is used for reducing the partial product
rows, the last neg signal (negn/2-1) will not
represent any additional row.
The above two observations represent the base of the
second proposal. Figure 8 shows the proposed circuit in
case of m = n. The least significant n - 2 bits of the first
row stay without any change. The last five bits in the
first row together with the neg signal of the last row are
used as inputs (address) to a small lookup table of
capacity 26 words x 6 bits. The output of the lookup
table will be exactly as the input in case of neg =0 or it
will be the inputs incremented by one if neg = 1.
Partial Product Row 0
pp80 /pp80 /pp80 pp70 pp60 pp50 ........ pp00
26 x 6-bit Look up Table
negn/2-1
3.2
Removing the last neg signal by
using a Simple Look-up Table
Figure 3 given before represents the shape of the partial
product rows (PPR) in the case when the multiplier
length (m) equals to the multiplicand length (n). Figure
7 represents the corresponding shape when the
multiplier length (m) is greater than the multiplicand
length by 4 bits or more. From these two figures, it is
easy to note the following:
In the case when n = m (Figure 3), if we added
the last neg signal to the first partial product
row, this will have an effect on the last five
bits of this row (the first row).
Modified Partial Product Row 0
Fig. 8 Proposed way of adding neg to first PPR
The value of the last neg signal negn/2-1 depends on the
most significant three bits of the multiplier (see the
MBE truth table):
neg m / 2−1 = y m−1 .( y m −2 + y m−3 )
(2)
Equation (2) shows that it is possible to form the last
neg signal while forming the first partial product row.
This means that the circuit proposed at Figure 8 can be
The proposed multiplier architecture in case of m = n is
shown in Figure 9.
used to generate the modified six bits of the first partial
product row without any need for any extra delay.
4. CRITICAL PATH AND
PERFORMANCE ANALYSIS
We introduced two techniques to achieve that. The two
techniques are suitable for any word length. The results
showed improve of 15% improvement in speed
increases with the increase of the word length.
The performance of our multiplier architecture when
using the first proposal is clearly depend on the speed
of the generating the last PPR (in other words the speed
of generating the two’s complementation step). if we
can generate the last partial product row of our
multiplier architecture within the exact time that the
other partial product rows are generated, the
performance will be improved as we have predicted
because of the removal of the additional partial product
row. Similarly in the case of the second proposal the
performance depends on the speed of getting the
modified PPR0.
The time delays required to generate the last PPR (or
the modified first PPR in the second case) must be
calculated and compared to the delays of conventional
partial product generation methods. We run such
comparison using normalized gate delays. Also we
investigated the overall performance (in terms of
speed, area, and power) of using our multiplier
architectures as compared to the conventional methods.
The main important result of the analysis is the fact
that the delay required to generate the last PPR (or the
first PPR) is almost constant and does not depend on
the word length. This is a very good result compared
with the technique proposed in [11]. The details of the
analysis and the table of comparison will be published
in the complete text of the paper on the web.
smaller number of partial product rows to add. As we
have shown, we can achieve this using less hardware.
Multiplicand X
MBE0
negn/2-1
PP0modified
Multiplier
Multiplicand X
MBE1
PP1
:
:
:
5. CONCLUSIONS
In this paper, we have presented simple, high-speed,
and well-structured multiplication algorithms. Our
multiplication algorithms focus on the first step of a
multiplication algorithm, which is the partial product
generation step to reduce from n/2 + 1 to n/2 the
number of partial product rows generated. By doing so,
the structure of the partial product array becomes more
regular and easier to implement. Even more
importantly, the product is found faster because of the
Multiplicand X
MBE
n/2
Fig.9 Proposed multiplier architecture
6.
REFERENCES
[1] A.D. Booth, “A Signed Binary Multiplication
Technique,” Quarterly J. Mechanical and Applied
Math., vol. 4, pp. 236-240, 1951.
[2] L. Dadda, “Some Schemes for Parallel Multiplier,”
Alta Frequenza, vol. 34, pp. 349-356, 1965.
[3] F. Elguibaly, “A Fast Parallel MultiplierAccumulator Using the Modified Booth Algorithm,”
IEEE Trans. Circuits and Systems, vol. 47, no. 9, pp.
902-908, 2000.
[4] J. Fadavi-Ardekani, “M x N Booth Encoded
Multiplier Generator Using Optimized Wallace Trees,”
IEEE Trans. Very Large Scale Integration, vol. 1, no.
2, pp. 120-125, 1993.
[5] A. Farooqui and V. Oklobdzija, “General Data-Path
Organization of a MAC Unit for VLSI Implementation
of DSP Processors,” Proc. 1998 IEEE Int’l Symp.
Circuits and Systems, vol. 2, pp. 260-263, 1998.
[6] Dawoud D.S., “Design of Fast Parallel Multiplier
Using an Algorithm Approach,” Proc. IEEE
International Conference on Electronic, Circuits &
Systems, EGYPT, Dec. 15-18, 1997, pp. 919- 924
[7] Dawoud D.S.” Carry-Save Multiplier with Recoded
Operands” Al –Azhar University Engineering Journal,
pp. 159-168, September, 2000 (International Refereed
Journal, ISSN: 1110-6409).
[8] Dawoud D.S., “Novel Serial-Parallel Multiplier”
Proc. Joint Conference of 5th World Multiconference
on Systemics, Cybernetics and Information (SCI 2001)
and the 7th International Conference on Information
Systems Analysis and Synthesis (ISAS2001),
ORLANDO – USA, July 2001.
[9] Dawoud D.S., “Very High Speed Pipelined SerialParallel Multiplier”, Proc. Joint Conference of 5th
World Multiconference on Systemics, Cybernetics and
Information, ORLANDO-USA, July 2001.
[10] N. Itoh, Y. Naemura, H. Makino, Y. Nakase, T.
Yoshihara, and Y. Horiba, “A 600-MHz 54x54-bit
Multiplier with Rectangular-Styled Wallace Tree,”
IEEE J. Solid-State Circuits, vol. 36, no. 2, pp. 249257, 2001.
[11] J.-Y. Kang and J.-L. Gaudiot, “A Fast and WellStructured Multiplier,” EUROMICRO Symp. Digital
System Design, pp. 508- 515, Aug. 2004.
[12] J.-Y. Kang, W.-H. Lee, and T.-D. Han, “A Design
of a Multiplier Module Generator Using 4-2
Compressor,” Proc. Korea Inst. of Telematics and
Electronics (KITE) Fall Conf, vol. 16, pp. 388-392,
1993.
[13] M. Nagamatsu, S. Tanaka, J. Mori, T. Noguchi,
and K. Hatanaka, “A l5ns 32x32-bit CMOS Multiplier
with an Improved Parallel Structure,” Digest of
Technical Papers, IEEE Custom Integrated Circuits
Conf., 1989.
[14] V.G. Oklobdzija, D. Villeger, and S. S. Liu, “A
Method for Speed Optimized Partial Product Reduction
and Generation of Fast Parallel Multipliers Using an
Algorithmic Approach,” IEEE Trans. Computers vol.
45, no. 3, pp. 294-306, Mar. 1996.
[15] Santoro and M. Horowitz, “SPIM: A Pipelined
64x64-bit Iterative Multiplier,” IEEE Trans. Circuits
and Systems, vol. 24, no. 2, pp. 487-493, 1989.
[16] P.F. Stelling, CU. Martel, V.G. Oklobdzija, and R.
Ravi, “Optimal Circuits for Parallel Multipliers,” IEEE
Trans. Computers, vol. 47, no. 3, pp. 273-285, Mar.
1998.
[17] Moises E. Robinson, Earl Swartzlander “A
reduction scheme to optimize the Wallace multiplier”
IEEE International Conference on Computer Design
(ICCD) 1998.
[18] A. Weinberger, “4:2 Carry-Save Adder Module,”
IBM Technical Disclosure Bull., vol. 23, 1981.
[19] W.-C. Yeh and C.-W. len, “High-Speed Booth
Encoded Parallel Multiplier Design,” IEEE Trans.
Computers, vol. 49, no. 7, pp. 692- 701, July 2000.
[20] M.D. Ercegovac and T. Lang, Digital Arithmetic.
Los Altos, Calif.: Morgan Kaufmann, 2003.
[21] M.J.Liao, C.F.Su, Chang and Allen Wu “A carry
select adder optimization technique for highperformance Booth-encoded Wallace tree multipliers”
IEEE International Symposium on Circuits and
Systems 2002.
[22] D.A. Patterson and J.L. Hennessy, Computer
Architecture: A Quantitative Approach. San Mateo,
Calif.: Morgan Kaufmann, 1996.
[23] D. Gajski, Principles of Digital Design. Prentice
Hall, 1997.
[24] R. Hashemian and C. P. Chen, “A New Parallel
Technique for Design of Decrement/Increment and
Two’s Complement Circuits,” Proc. 34th Midwest
Symp. Circuits and Systems, vol. 2, pp. 887-890, 1991.
[25] Z. Huang and M. Ercegovac, “High-Performance
Left-to-Right Array Multiplier Design,” Proc. 16th
Symp. Computer Arithmetic, pp. 4-11, June 2003.
[26] D. Bakalis, E. Kalligeros, D. Nikolos ” Low power
BIST for Wallace tree based fast multipliers” First
International Symposium on Quality of Electronic
design, 2000.
[27] King Fai Pang “Architectures for pipelined
Wallace
tree
multiplier-accumulators”
IEEE
International Conference on VLSI in computers and
processors 1990 (ICCD’90).
This paper is based on research work conducted by
Peter Dawoud (University of KwaZulu Natal) under
the supervision of Prof. D.S.Dawoud (Head of
Computer Engineering Program, University of
KwaZulu Natal- South Africa).
Peter Dawoud’s research work aims in designing a
high-performance arithmetic unit and its VLSI
implementation.