UNCORRECTED PROOF

3B2v7:51c
GML4:3:1
Prod:Type:com
pp:1236ðcol:fig::NILÞ
VLSI : 601
ED:indira
PAGN: KESHAVA SCAN: vinod
ARTICLE IN PRESS
1
3
INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
5
7
To Booth or not to Booth
9
Wolfgang J. Paula,*, Peter-Michael Seidelb
11
a
Computer Science Department, University of the Saarland, 66041 Saarbruecken, Germany
Computer Science and Engineering Department, Southern Methodist University, Dallas TX 75275, USA
F
b
O
13
Abstract
17
Booth Recoding is a commonly used technique to recode one of the operands in binary multiplication. In
this way the implementation of a multipliers’ adder tree can be improved in both cost and delay. The
improvement due to Booth Recoding is said to be due to improvements in the layout of the adder tree
especially regarding the lengths of wire connections and thus cannot be analyzed with a simple gate model.
Although conventional VLSI models consider wires in layouts, they usually neglect wires when modeling
the delay. To make the layout improvements due to Booth recoding tractable in a technology-independent
way, we introduce a VLSI model that also considers wire delays and constant factors. Based on this model
we consider the layouts of binary multipliers in a parametric analysis providing answers to the question
whether to use Booth Recoding or not.
We formalize and prove the folklore theorems that Booth recoding improves the cost and cycle time of
‘standard’ multipliers by certain constant factors. We also analyze the number of full adders in certain 42trees.
r 2002 Elsevier Science B.V. All rights reserved.
25
27
29
PR
D
23
TE
21
EC
19
O
15
Keywords: Multiplier design; Booth recoding; VLSI model; Layout analysis; Delay and area complexity; Wire effects
R
31
1. Introduction
35
Addition trees are the central part in the design of fixed point multipliers. They produce from a
sequence of partial products a carry-save representation of the final product [1–8]. Booth recoding
[9–13] is a classical method which cuts down the number of partial products in an addition tree by
a factor of 2 or 3 at the expense of a more complex generation of partial products.
In general, VLSI designers tend to implement Booth recoding, and a widely accepted rule of
thumb says, that Booth recoding improves both the cost and the speed of multipliers by some
constant factor around 14: In [9,10,14–16], however, the usefulness of Booth recoding is challenged
41
43
O
C
N
39
U
37
R
33
*Corresponding author.
E-mail addresses: [email protected] (W.J. Paul), [email protected] (P.-M. Seidel).
0167-9260/02/$ - see front matter r 2002 Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 7 - 9 2 6 0 ( 0 2 ) 0 0 0 4 0 - 8
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
2
15
17
19
21
23
25
27
29
31
F
*
O
13
*
In this paper, we introduce a VLSI model which accounts—in a hopefully meaningful way—for
constant factors as well as wire delays and which is at the same time simple enough to permit the
analysis of large circuits. The model depends on a parameter n which specifies the influence of the
wire delays. In this model, we study two standard designs of partial product generation and
addition trees: the simple linear depth construction and a carefully chosen variant of 42-trees. We
show that Booth encoding improves for all reasonable values of n both the delay and the area of
the resulting VLSI layouts by constant factors between 26% and 50% minus low-order terms
which depend on n and the length n of the operands. For practical operand length in the range
8pnp64 we specify the gain exactly.
The paper is organized in the following way. In Section 2, we review Booth recoding as
explained in [10,24]. Section 3 provides a combinatorial lemma which will subsequently permit to
count the number of full adders in certain 42-trees. In Section 4, we analyze the circuit complexity
of Booth recoding. In Section 5, we introduce the VLSI model. The detailed circuit model from
[25] is combined with a linear delay model for nets of wires. Section 6 contains the specification
and analysis of layouts. The layouts for 42-trees follow partly a suggestion from [26]. In Section 7,
the savings due to Booth recoding are evaluated. We analyze both the asymptotical behavior and
the situation for practical n: We conclude in Section 8 and list some further work.
O
11
*
The standard models of circuit complexity [17] and VLSI complexity [18,19] are only trusted to
be exact up to constant factors.
The standard theoretical VLSI model ignores wire delays.
Conway/Mead style rules [20,21] do consider wire delays, but do not invite paper and pencil
analysis due to their level of detail.
Trivial addition trees with linear circuit depth have nice and regular layouts. But Wallace trees
[2,6,7] and even 42-trees [3,4,22,23] have more irregular layouts. In the published literature, these
layouts tend not to be specified at a level of detail which permits, say, to read off the length of
wires readily.
PR
9
*
D
7
TE
5
EC
3
altogether. Finally, in [15] an ambitious case study of around 1000 concrete designs was
performed, and for some technologies with small wire delays the fastest multipliers turned out not
to use Booth recoding.
In spite of the obvious importance of the concept the (potential) benefits of Booth recoding
have apparently received no theoretical treatment yet. This is probably due to the following facts:
R
1
39
O
For a ¼ a½n 1 : 0 ¼ ða½n 1; y; a½0ÞAf0; 1gn we denote by
/aS ¼
C
37
2. Preliminaries
n1
X
a½i2i
N
35
R
33
i¼0
43
the number represented by a; if we interpret it as a binary number. Conversely, for pAf0; y; 2n 1g we denote the n-bit binary representation of p by binn ðpÞ: For xAf0; 1gn and sAf0; 1g we define
U
41
x"s ¼ ðx½n 1"s; y; x½0"sÞ:
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
1
3
An ðn; mÞ-multiplier is a circuit with n inputs a ¼ a½n 1 : 0; m inputs b ¼ b½m 1 : 0 and n þ m
outputs p ¼ p½n þ m 1 : 0 such that /aS/bS ¼ /pS holds.
3
2.1. Multiplication
5
For jAf0; y; m 1g and k; hAf1; y; mg we define the partial sums
7
Sj;k ¼
9
¼ /aS/b½ j þ k 1 : jS2j
O
o 2nþkþj :
Then, we have
O
Sj;1 ¼ /aSb½ j2j ;
17
Sj;kþh ¼ Sj;k þ Sjþk;h
PR
21
F
p ð2n 1Þð2k 1Þ2j
13
19
/aSb½t2t
t¼j
11
15
jþk1
X
and
/aS/bS ¼ S0;m
31
33
35
TE
EC
29
2.2. Booth recoding
In the simplest form of Booth recoding (called Booth-2 recoding or Booth recoding radix 4) the
multiplicator is recoded as suggested in Fig. 2. With b½m þ 1 ¼ b½m ¼ b½1 ¼ 0 and m0 ¼
Jðm þ 1Þ=2n one writes
/bS ¼ 2/bS /bS ¼
0
m
1
X
R
27
B2j 4 j ;
j¼0
R
25
Because Sj;k is a multiple of 2j it has a binary representation with j trailing zeros. Because Sj;k is
smaller than 2jþnþk it has a binary representation of length n þ j þ k (see Fig. 1).
where
O
23
D
¼ S0;m1 þ Sm1;1 :
B2j ¼ 2b½2j þ b½2j 1 2b½2j þ 1 b½2j
41
43
C
The numbers B2j Af2; 1; 0; 1; 2g are called Booth digits and we define their sign bits s2j by
N
39
¼ 2b½2j þ 1 þ b½2j þ b½2j 1:
U
37
0 0 0 0 0
0 0 0 0 0
n+k
Fig. 1. Sj;k in carry–save representation.
j
S j,k
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
4
1
2<b>
bm
bm-1
3
-<b>
0
bm
bi+1 bi
bm-1
Bm
5
bi-1
bi+1 bi
Bm-2
b1
bi-1
Bi
B2
b0
0
b1
b0
B0
Fig. 2. Booth digits B2j :
7
(
9
0
if B2j X0;
1
if B2j o0:
With
C2j ¼ /aSB2j Af2nþ1 þ 2; y; 2nþ1 2g;
15
D2j ¼ /aSjB2j jAf0; y; 2nþ1 2g;
d2j ¼ binnþ1 ðD2j Þ;
O
17
O
13
F
11
s2j ¼
/aS/bS ¼
0
m
1
X
/aSB2j 4 j
j¼0
21
¼
23
0
m
1
X
C2j 4 j
D
19
PR
the product can be computed from the sums
25
¼
0
m
1
X
35
43
C
0
¼ 2nþ1þ2m :
N
Because 2m0 > m these terms are congruent to zero modulo 2nþm : Thus
/aS/bS U
41
e0 ¼ binnþ4 ðE0 Þ:
This is illustrated in Fig. 3. The additional terms sum to
!
0
0
m
1
X
4m 1
nþ1
j
nþ1
¼2
1þ3
4
1þ3
2
3
j¼0
37
39
EC
E0 ¼ C0 þ 4 2nþ1 ;
e2j ¼ binnþ3 ðE2j Þ;
R
E2j ¼ C2j þ 3 2nþ1 ;
R
33
In order to avoid negative numbers C2j one sums the positive E2j instead
O
31
s2j D2j 4 j :
j¼0
27
29
TE
j¼0
0
m
1
X
E2j 4 j mod 2nþm :
j¼0
With respect to the standard subtraction algorithm for binary numbers, the e2j can be computed
by
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
1
1
1
5
E2 :
7
1
0
1
1
E4 :
1
1
11
1
E2m’-2 :
1
0
0
= <a>
0
0
B
2m’-2
2m’-2
0
1
s
0
1
s2m’
s 2m’
d2m
2m’’
0 2m-2
m
m’
Fig. 4. Partial product construction.
25
/e2j S ¼ /1s2j ; d2j "s2j S þ s2j ;
29
/e0 S ¼ /s0 s0 s0 ; d0 "s0 S þ s0 :
EC
27
R
Based on these equations, the computation of the numbers
O
f2j ¼ binnþ3 ðF2j Þ;
R
F2j ¼ E2j s2j ;
f0 ¼ binnþ4 ðF0 Þ
C
is easy, namely
N
f2j ¼ ð1s2j ; d2j "s2j Þ;
f0 ¼ ðs0 s0 s0 ; d0 "s0 Þ:
U
43
D
g 2m’
s0
s2
TE
23
s0
O
d
21
41
0
B4
s s s
g0
g2
g4
19
39
B2
PR
17
37
0
Fig. 3. Summation of the E2j :
15
35
0
B0
0
d4 = < a >
0
d
0
d0 = < a >
d2 = < a >
0
+
-
13
0
+
0 0
+
-
9
33
0
+
-
F
E0 :
O
3
31
5
Instead of adding the sign bits s2j to the numbers F2j one incorporates them at the proper position
into the representation of F2jþ2 as suggested in Fig. 4. The last sign bit does not create a problem
because B2m0 2 is always nonnegative. Formally, let
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
6
1
g2j ¼ ðf2j ; 0s2j2 ÞAf0; 1gnþ5 ;
3
g0 ¼ ðf2j ; 00ÞAf0; 1gnþ6 :
9
/g2j S ¼ 4/f2j S þ s2j2
and hence we get the product by
0
0
m
1
m
1
X
X
j
E2j 4 ¼
/g2j S4 j1
/aS/bS ¼
j¼0
11
13
j¼0
with the g2j as partial products. We define
jþk1
X
0
¼
/g2t S4t1 :
S2j;2k
F
7
Then
O
5
t¼j
15
0
¼ /g2j S4 j1 ;
S2j;2
19
0
0
0
¼ S2j;2k
þ S2ðjþkÞ;2h
:
S2j;2ðkþhÞ
0
0
One can easily show, that S2j;2k
is a multiple of 22j2 and S2j;2k
o2nþ2jþ2kþ2 : Therefore, at most
0
in both carry-save or binary form.
n þ 2k þ 4 nonzero positions are necessary to represent S2j;2k
D
21
PR
17
23
35
37
Level c
where v ranges over all nodes of level c and in
m1
X
H¼
Hc :
c¼0
39
U
We show
41
Lemma 1. HpðmmÞ=2:
43
TE
EC
R
33
R
31
O
29
Let T be a complete binary tree with depth m: We number the levels c from the leaves to the root
from 0 to m: Each leaf u has a weight W ðuÞ: For some natural number k we have W ðvÞAfk; k þ 1g
for all leaves and the weights are non-decreasing from left to right. Let m be the sum of the
weights of the leaves. For m ¼ 4; m ¼ 53 and k ¼ 3 the leaves could,
Pfor example, have the
weights 3333333333344444: For each subtree t of T we define W ðtÞ ¼ ðW ðuÞÞ where u ranges
over all leaves of t: For each interior node v of T we define LðvÞ (resp. RðvÞÞ as the weight of the
subtree rooted in the left (resp. right) son of v: We are interested in the sums
X
LðvÞ;
Hc ¼
C
27
3. A combinatorial lemma
N
25
O
Then, we have analogously to the nonBooth case
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
1
3
7
Proof. By induction on the levels of T one shows that in each level weights are nondecreasing
from left to right and their sum is m: Hence
X
X
2Hc p
LðvÞ þ
RðvÞ
Level c
5
X
¼
Level c
W ðvÞ
Level c
7
¼m
17
19
21
23
F
hðvÞ ¼ ðLðvÞ þ RðvÞÞ=2 LðvÞ
¼ ðRðvÞ LðvÞÞ=2Þ:
O
15
In the above estimate we have replaced each weight LðvÞ by the arithmetic mean of LðvÞ and
RðvÞ hereby overestimating LðvÞ by
O
13
&
For each node v in level c the 2c leaves of the subtree rooted in v form a continuous subsequence
of the leaves of T: Hence, all nodes in level c except at most one have weights in fk2c ; ðk þ 1Þ2c g:
Therefore, in each level c there is at most one node vc such that hðvc Þa0: We set hðcÞ ¼ hðvc Þ; if it
exists. Otherwise, we set hðcÞ ¼ 0: It follows that
H ¼ ðmmÞ=2 m1
X
PR
11
and the lemma follows.
hðcÞ:
D
9
35
37
39
41
43
EC
R
R
33
In this section, we use a simple gate model [25] to analyze delay and cost of the multiplication
circuits considering only the influence of gates. This model does not consider fanout restrictions or
wire effects. The delay is the maximum gate delay of all paths from input bits a½i and b½ j to
output bits p½k: The cost is computed as the cumulative cost of all gates used. For this purpose the
basic gate delays and costs can be extracted from any design system. We will consider the
Motorola technology [27] (Fig. 5), but the reader might choose his favorite technology in our
parametrical analysis.
O
31
4. Gates
C
29
N
27
In the above example, we have h0 ¼ 12; h1 ¼ 12; h2 ¼ 32; h3 ¼ 52 and H ¼ 101: For m ¼ 3; m ¼ 27
and weights 33333444 we have h0 ¼ 12; h1 ¼ 12; h2 ¼ 32 and H ¼ 38:
U
25
TE
c¼0
Fig. 5. Motorola technology parameters.
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
8
1
4.1. Partial product generation
3
4.1.1. NonBooth
One has to compute and shift the binary representations of the numbers
5
/aSb½ j ¼ /a½n 14b½ j; y; a½04b½ jS:
7
The nm AND-gates have a cost of 2nm: As they are used in parallel, they have a total delay of 2.
9
4.1.2. Booth
0
The binary representations of the numbers S2j;2
must be computed (see Fig. 4). These are
g2j ¼ ð1s2j ; d2j "s2j ; 0s2j2 Þ;
13
g0 ¼ ðs0 s0 s0 ; d0 "s0 ; 00Þ;
23
25
27
29
31
if jB2j j ¼ 1;
ða; 0Þ
if jB2j j ¼ 2:
and calculate them by the Booth decoder logic BD from Fig. 6(b), that can be developed from
Fig. 6(a). Such a Booth decoder has cost CBD ¼ 11 and delay DBD ¼ 3:
The selection logic SL from Fig. 6(c) directs either a½i or a½i þ 1 or 0 to position ½i þ 1 of d2j
and the inversion depending on s2j yields g2j ½i þ 3: In the first 2 (3 for rightmost partial product)
and last 2 bit positions this logic is replaced by the simple signal of a sign bit, its inverse, a zero or
a one. The selection logic has cost CSL ¼ 10 and delay DSL ¼ 4:
R
O
35
N
U
43
C
37
41
O
For this computation two signals indicating jB2j j ¼ 1 and jB2j j ¼ 2 are necessary. We denote these
by
(
(
1 if jB2j j ¼ 1;
1 if jB2j j ¼ 2;
b22j ¼
b12j ¼
0 otherwise;
0 otherwise
33
39
O
ð0; aÞ
PR
>
:
D
21
d2j ¼
TE
19
EC
17
shifted by 2j 2 bit positions.
The d2j ¼ binnþ1 ð/aSjB2j jÞ are easily determined from B2j and a by
8
>
< ð0; y; 0Þ if B2j ¼ 0;
R
15
F
11
Fig. 6. (a) Booth digit signals, (b) Booth decoder and (c) selection logic.
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
5
7
9
11
13
15
m0 ðn þ 1ÞCSL ¼ 10m0 ðn þ 1Þ:
4.2. Redundant partial product addition
We study two standard constructions for the reduction of the partial products to a carry-save
representation of the product: plain multiplication arrays and 42-trees. Each construction uses k-bit
carry-save adders (k-CSA, 32-adders) [3–5]. They are composed of k full adders working in parallel
and hence they have cost kCFA and delay DFA : Cascading two k-CSAs one constructs a k-bit 42adder, i.e. a circuit with 4k inputs a½k 1 : 0; b½k 1 : 0; c½k 1 : 0; d½k 1 : 0 and 2ðk þ 1Þ
outputs s½k : 0; t½k : 0 such that
F
3
As m0 booth decoders are necessary, the decoding logic costs m0 CBD ¼ 11m0 : The selection logic
occurs for each partial product ðn þ 1Þ-times. Therefore, the selection logics have altogether a cost
of
O
1
27
29
31
33
PR
D
TE
25
S0;t ¼ S0;t1 þ St1;1 ;
one can compute a carry-save representation of S0;t from a carry-save representation of S0;t1 and
the binary representation of St1;1 by an n-CSA. This works because both S0;t1 and St1;1 can be
represented with n þ t 1 bits and because the binary representation of St1;1 has t 1 trailing
zeros (see Fig. 7(c)). The m 2 many n-CSAs which are cascaded this way have cost nðm 2ÞCFA
and delay ðm 2ÞDFA : The combined cost and delay for (nonBooth) partial product generation
and the multiplication array are
EC
23
R
21
holds. The k-bit 42-adders constructed in this way have cost 2kCFA and delay 2DFA : We consider
full adder implementations with cost CFA ¼ 14 and delay DFA ¼ 6: There are optimized
implementations of 42-adders like described in [23], we do not use them in this analysis to keep the
presentation simple.
In order to introduce techniques of analysis we review quite formally the construction of
standard multiplication arrays [3,21]. A carry-save representation of S0;3 can be computed by a
single n-CSA (see Fig. 7(a)). Exploiting
CArray ¼ nðm 2ÞCFA þ nmCAND
R
19
¼ 16nm 28n;
37
DArray ¼ ðm 2ÞDFA þ DAND
43
C
N
¼ 6m 10:
With Booth recoding one has only to sum the m0 (representations of) partial products g2j : Each
0
partial product has length n0 ¼ n þ 5 except g0 which has n0 þ 1 bits. Arguing with the sums S0;2t
0
0
in place of the sums S0;t one shows that ðm 2Þ many ðn Þ-CSAs suffice to sum the partial
products in a multiplication array with Booth-2 recoding. Taking into account the partial product
generation one obtains cost and delay
U
41
O
35
39
O
/aS þ /bS þ /cS þ /dS /sS þ /tS mod 2nþ1
17
9
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
1
n
0 0
0
3
n
n
5
n+1
2
i
t-1
n
n
n
n+h
n-bit CSA
n+1
i
k
n+h
S i,4
i
0
0
0
0
S i,k
S i+k,h
0 0 0 0
0 0 0 0
S i,k+h
0
0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0 0 0 0
n+h
0
0
0
0
0
0
0
0
n+h
(n+h)-bit 4/2-adder
n+1
n+h
S 0,t
n+1
t-1
n+h
n+h
k
i
(d)
EC
Fig. 7. Partial compression of (a) Si;3 ; (b) Si;4 ; (c) S0;t from a carry–save representation of S0;t1 and St1;1 ; and (d)
Si;kþh from carry–save representations of Si;k and Siþk;h :
R
0
CArray
¼ ðn0 Þðm0 2ÞCFA þ ðn þ 1Þm0 CSL þ m0 CBD
R
¼ 24nm0 þ 91m0 28n 140;
O
D0Array ¼ ðm0 2ÞDFA þ DSL þ DBD
C
¼ 6m0 5:
N
For n ¼ m ¼ 53 (double precision) we get
0
=CArray ¼ 35457=43460 ¼ 81:6%;
CArray
U
43
n
h
0 0 0
0 0 0
S 0,t-1
S t-1,1
31
41
3
PR
n
29
39
n+1
(b)
(c)
37
0 0 0 0 0
0 0 0 0 0
(a)
25
35
n+1
O
n+1
21
33
n
0 0 0 0 0 0
0 0 0 0 0 0
17
27
n
n-bit CSA
n+1
15
23
n
S i,3
13
19
n
n-bit CSA
n+1
S i,3
S i+3,1
n+1
D
11
n
0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0 0 0
n-bit CSA
i
0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
n
9
S i,1
S i+1,1
S i+2,1
n
n+1
0 0
0
0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0 0
O
S i,1
S i+1,1
S i+2,1
2
TE
7
n
i
2
F
10
D0Array =DArray ¼ 157=308 ¼ 50:9%:
For n ¼ m ¼ 24 (single precision) we get
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
1
0
CArray
=CArray ¼ 8139=8544 ¼ 95:2%
3
D0Array =DArray ¼ 73=134 ¼ 54:5%:
11
7
0
0
3
Asymptotically, CArray
=CArray tends to 12
16 ¼ 75% and DArray =DArray tends to 6 ¼ 50%:
Thus, if multiplication arrays would yield the best known multipliers the case for Booth
recoding would be easy and convincing (and hardly worth a paper).
9
4.3. Analysis 42-trees
19
21
23
25
27
29
F
O
O
17
4.3.1. NonBooth
We construct a specific family of 42-trees which happens to be accessible to analysis. The nodes
of the tree are either 32-adders or 42-adders. The representations of the m partial products
S0;1 ; y; Sm1;1 will be fed into the leaves of the tree from right to left. Let M ¼ 2Jlog mn be the
smallest power of two greater or equal to m: Let m ¼ logðM=4Þ ¼ JlogðmÞn 2: We will construct
the entire addition tree T by two parts as shown in Fig. 8.
The lower regular part is a complete binary tree of depth m 1 consisting entirely of 42-adders. It
has M=8 many 42-adders as leaves. For the top level of the tree we distinguish two cases. If
3M=4pmpM; we use a ¼ m 3M=4 many 42-adders and M=4 a many 32-adders. We arrange
the 32-adders of the top level of the tree at the left of the 42-adders.
If M=2omo3M=4 we use in the top level only b ¼ m M=2 many 32-adders (and we feed
m 3b representations of partial products directly into the lower part). We arrange the 32-adders
at the right end of the tree. For m ¼ 24 we have mX24; and use 832 adders as leaves. For m ¼ 53 we
have mX48; and we use 1132-adders and 542-adders as leaves. We only analyze the first case
explicitly.
PR
15
D
13
The situation changes in two respects when we consider addition trees with logarithmic delay
like Wallace trees [2,6,7] or 42-trees [1,3,4,22,23]. First, counting gates in such a tree becomes a
somewhat nontrivial problem. Second, Booth recoding only yields a marginal saving in time.
TE
11
EC
5
(M/4-a)-times
Sm-2,1
Sm-1,1
Sm-3,1
R
33
35
O
S m-3,3
43
N
S4a+1,1 S4a-1,1 S4a-3,1
S4a+2,1 S4a,1
S4a-2,1
S4a-4,1
S 4a+4,3
S 4a,4
complete 4:2 adder tree
U
41
C
37
39
a-times
R
31
Fig. 8. Adder tree construction.
S3,1
S1,1
S2,1
S 4a,4
S0,1
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
12
13
15
17
19
Si;kþh ¼ Si;k þ Siþk;h ;
where a carry-save representation of Si;k is provided by the right son of u and Siþk;h is provided by
the left son of u: By the results of Section 3 we are in the situation of Fig. 7(d). Hence node u has
2h excess full adders.
Referring to Section 3 we label each leaf u of the tree with the number of partial products it
sums, i.e. with W ðuÞ ¼ 3 if u is a 32-adder and with W ðuÞ ¼ 4 if it is a 42-adder, then h ¼ LðuÞ and for
the number E ¼ 2H of excess full adders in the tree we have
21
E ¼ ðmmÞ 2
27
29
31
hc
c¼0
p ðmmÞ:
TE
25
m1
X
This implies, that the 42-trees constructed above have nm þ oðnmÞ full adders.
The proof only depends on the fact that for each interior node u the left son of u sums less
partial products than the right son of u: Hence, it applies to many other balanced addition trees.
One hardly dares to state or prove such a folklore result because it is ‘obviously’ known.
Unfortunately, we have not been able to locate it in the literature and we need it later.
With partial product generation we get cost and delay
EC
23
F
11
O
9
O
7
full adders, because every 32-adder reduces the number of partial products by 1 and every 42-adder
reduces the number of partial products by 2. It remains to estimate the number of excess full
adders in the tree.
Every leaf in the tree which is a 32-adder computes a sum Si;3 : Thus, n full adders suffice (see
Fig. 7(a)). A leaf of the tree which is a 42-adder computes a sum Si;4 : It can be simplified as shown
in Fig. 7(b) such that 2n full adders suffice. Thus, in the top level of the tree there are no excess full
adders.
Each 42-adder u in the lower portion of the tree performs a computation of the form
PR
5
F ¼ nðm 2Þ
D
3
If all 32- and 42-adders in the tree would consist of exactly n resp. 2n full adders, then the tree
would have a total of
CTree ¼ ðnðm 2Þ þ EÞCFA þ nmCAND
R
1
¼ 16nm þ oðnmÞ;
35
DTree ¼ 2ðm þ 1ÞDFA þ DAND
O
R
33
¼ 12m þ 14
41
43
C
N
39
¼ 12JlogðmÞn 10:
4.3.2. Booth
0
Let M 0 ¼ 2Jlog m n the smallest power of two greater or equal m0 and let m0 ¼ logðM 0 =4Þ: For
mAf24; 53g we have 34M 0 pm0 oM 0 : We proceed as above and only analyze this case. The standard
length of 32-adders and 42-adders is now n0 ¼ n þ 5 bits; longer operands require excess full adders.
Let E 0 be the number of excess full adders.
U
37
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
and we get
E 0 ¼ 2ðm0 m0 Þ 4
15
17
19
21
23
25
27
29
31
Taking into account the partial product generation one obtains cost and delay
F
0
¼ ðn0 ðm0 2Þ þ E 0 ÞCFA þ ðn þ 1Þm0 CSL þ m0 CBD
CTree
¼ 24nm0 þ oðnm0 Þ;
O
13
p 2ðm0 m0 Þ:
D0Tree ¼ 2ðm0 þ 1ÞDFA þ DBD þ DSL
¼ 12ðm0 þ 1Þ þ 7:
For n ¼ m ¼ 53 we get
0
CTree
=CTree ¼ 37305=46288 ¼ 80:6%;
D0Tree =DTree ¼ 55=62 ¼ 88:7%
and for n ¼ m ¼ 24 we get
0
=CTree ¼ 8531=9552 ¼ 89:3%;
CTree
D0Tree =DTree ¼ 43=50 ¼ 86:0%:
0
=CTree tends again to 12
Asymptotically, CTree
16 ¼ 75%: Unless m is a power of two, we have m ¼
0
0
m þ 1 and DTree DTree ¼ 7: Hence, D0Tree =DTree tends to one as n grows large.
This contradicts the common opinion that Booth recoding saves a constant fraction of the delay
(independent of n). We try to resolve this contradiction in the next section by considering layouts.
EC
11
ch0c
R
9
X
O
7
E 0 ¼ 4H 0
PR
5
D
3
Considering the sums S0 instead of the sums S one shows that the top level of the tree has no
excess full adders. Let H 0 be the sum of labels of left sons in the resulting tree and for all c let h0c be
the correction term for level c: Because with Booth recoding successive partial products are shifted
by 2 positions, we now have
TE
1
13
5. VLSI model
35
In the circuit complexity model we could only show, that a constant fraction of the cost is saved
by Booth recoding. In order to explain, why Booth recoding also saves a constant fraction of the
run time we have to consider wire delays in VLSI layouts.
The layouts that we consider consist of simple rectangular circuits S; connected by nets N: For
circuits S; we denote by bðSÞ the breadth, hðSÞ the height, CðSÞ the gate count and DðSÞ the
combinatorial delay of S: We do not consider different delays for different inputs and outputs of
the same simple circuit in order to keep the analysis simple. We require
41
43
O
C
N
39
U
37
R
33
bðSÞhðSÞ ¼ CðSÞ;
i.e. area equals gate count and
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
14
1
3
5
7
Fig. 9. (a) Table of basic circuit geometries; (b) shapes of basic circuits.
9
11
17
19
F
O
O
15
i.e. layouts of simple circuits are not too slim or too flat. In order to keep drawings simple we will
place pins quite liberally at the borders of the circuits. Input pins are at one side of the rectangle,
output pins are at the opposite side.
Nets consist of horizontal and vertical lines (wires). They must have a minimal distance d from
each other and from circuits. Thus, a wire channel for t lines has width ðt þ 1Þd: The size jNj of a
net N is the sum of the length of the lines that constitute the net.
We define the delay of a circuit S driving a net N as
PR
13
hðSÞ=2pbðSÞp2hðSÞ;
21
D
TIMEðS; NÞ ¼ DðSÞ þ njNj:
23
TE
27
The parameter n weights the influence of wire delay on the total delay. If n ¼ 0 only gate delays
count. For a square inverter NOT with CðSÞ ¼ 1 we have bðNOTÞ ¼ hðNOTÞ ¼ 1: Suppose we
connect the output of such a gate with a single wire N of length hðNOTÞ ¼ 1 and we have nX1:
Then
EC
25
29
njNjX1 ¼ DðNOTÞ;
39
41
43
R
O
C
37
N
35
i.e. a wire which is as long as the gate contributes to the delay as much as the propagation delay of
the gate or more. This does not seem reasonable. Therefore, we restrict the range of n to the
interval ½0; 1: We do not restrict the fanout of circuits, but within limits the parameter n can also
be used to model fanout restrictions.
We will consider only four types of simple circuits S; namely AND gates, full adders FA; Booth
decoders BD and the selection logic SL: We will use two geometries for the selection logic. The
geometries from Fig. 9(a) happen to make the layouts of addition trees particularly simple. We
place the pins of the full adder and the other basic circuits as specified in Fig. 9(b).1 The exact
position of the input pins of AND gates and the selection logic contributes only marginally to the
run time of addition trees and will not be considered in the analysis. We will use d ¼ 0:1:
1
U
33
R
31
Strictly speaking we use three layouts for the selection logic. Two of the layouts only differ in the position of the
output pin.
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
15
6. Layouts and their analysis
3
First, we specify and analyze the layout of a plain 42-tree T with M=4 leaves where partial
products are generated with simple AND gates. The tree T has depth m ¼ logðM=4Þ as well as m
levels of interior nodes. The number of interior nodes is M=4 1: This will then be compared with
the layout of a tree T 0 with M 0 =4 leaves where partial products are generated by circuits SL
controlled by Booth decoders BD:
All our layouts will consist of a matrix of full adders plus extra circuits and wires. Every node in
the tree is either a 32-adder and occupies one row of the matrix or it is a 42-adder and occupies 2
consecutive rows of the matrix. Every bit position occupies a column of the matrix. Between
neighboring full adders of the same row we leave a wire channel of appropriate width. Inputs a½i
are fed into the layout at the top with indices increasing from right to left. Inputs b½ j are fed into
the layout from the left with indices increasing from top to bottom. Outputs are produced at the
right border and at the bottom of the layout. Let Tl (resp. Tr ) be the subtree rooted in the left
(resp. right) son of a node v: Then, we layout Tr on top of Tl : This is followed by a row for v (see
Fig. 10). Between the layouts of the two subtrees we leave space for one wire. This space will be
filled with boxes4/2a specified below. It only remains to specify where to place the extra circuits
and how to layout the wires. For tree T it suffices to specify the three types of boxes in Fig. 11(a)–
(c):
17
19
O
O
15
*
23
25
*
EC
27
29
31
33
35
Box3: this is one full adder which is part of a leaf v of the tree with weight L ¼ 3: The whole 32adder v has as inputs 3 bits of b which are routed in the b-channel b½2 : 0: Every bit a½i is
needed in three consecutive bit positions of v; then it is routed to the next leaf down the tree and
one position to the left. Thus, 3 lines of an a-channel a½2 : 0 suffice to accommodate the ainputs for v: For all leaves v and all i we feed input a½i into a-channel a½i mod 3: Before a½i can
be fed into the channel, bit a½i 3 has to be removed and routed down to the next leaf of the
tree. This—unfortunately—consumes horizontally distance 2d:
Box 4: this is one pair of full adders which is part of a leaf v of the tree with weight L ¼ 4: The
whole 42-adder v has as inputs 4 bits of b which are routed in the b-channel b½2 : 0 of the first 32adder and in one b-channel of the second 32-adder. Similarly, 4 a-channels are used. Each signal
a½i is fed into 3 full adders of consecutive bit positions in the first 32-adder level and then into
one full adder in the next bit position of the second 32-adder. After that a½i is routed down to the
next leaf of the tree.
Box 42: This is strictly speaking a pair of boxes which is part of an interior node v of the tree with
subtrees Tr and Tl : Box4/2a is inserted between the layout of Tr and Tl : Box4/2b is added at
D
21
PR
13
TE
11
*
R
9
R
7
O
5
F
1
43
N
41
U
39
C
37
partial product
compression
of right subtree
partial product
compression
of left subtree
Τρ
Τ
wire
channels
λ
inner node v (4-2 adder)
Fig. 10. Recursive partition of the tree layout.
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
16
1
2ld(M/4)
2ld(M/4)
Box4
a
5
a
i+1
i
b-channel[2:0]
a-channel[2:0]
full adder
19
a
21
a
i-3
i-4
D
2ld(M/4)
23
25
2ld(M/4)
Box3
2ld(M/4)
EC
27
a
a
R
33
O
35
43
N
41
U
39
C
37
full adder
a
b-channel[2:0]
a-channel[2:0]
a
i-2
i
wire channel
R
31
wire channel
i+1
29
(c)
TE
(a)
i-3
(b)
Fig. 11. Basic boxes for non-booth multiplier construction.
i
F
wire channel
full adder
O
full adder
PR
wire channel
wire channel
17
b-channel
a-channel
wire channel
13
15
a
i+1
O
full adder
a
wire channel
11
Box4/2 b)
wire channel
9
wire channel
7
wire channel
3
Box4/2 a)
2ld(M/4)
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
the bottom of the layout of Tl : The box consumes two wires in each vertical wire channel, one
for a carry output and one for the sum output of one full adder of the root of Tr : We place the
sum lines in the left and the carry lines in the right half the wire channel. It will not matter in the
analysis where we place them exactly.
1
3
5
7
17
All boxes have the same width bðboxÞ: Because each level of interior nodes contributes 2 lines to
the global wire channel,
bðboxÞ ¼ bðFAÞ þ ðm þ 3Þd:
9
PR
The height of the whole layout is
hðTÞ ¼ ðM=4 1Þhðbox42Þ þ ðM=4Þhðbox3Þ þ aðhðbox4Þ hðbox3ÞÞ:
For the size of the nets a½i we consider
*
*
25
*
D
23
O
hðbox4Þ ¼ 2hðFAÞ þ 2hðANDÞ þ 11d:
17
21
O
hðbox3Þ ¼ hðFAÞ þ hðANDÞ þ 7d;
15
19
F
hðbox4=2Þ ¼ 2hðFAÞ þ 7d;
13
The horizontal extension mbðboxÞ:
The vertical extension. This extends from the very top of the layout to the leftmost leaf. Below
the leftmost leaf we have m many boxes4/2b.
The m connections from the a-channels to the AND gates, each of length up to 3d:
TE
11
Box4/2a is placed on top of the b-channel of the rightmost leaf of Tl : Thus, it contributes only d to
the height, and we have
27
EC
Thus, for the largest ones of the nets a½i we have
29
39
R
R
O
37
Next we determine the accumulated delay from a leaf to the root which follows in each box4/2 the
following path: sum bit in Tr —first full adder in v—carry output—second full adder in the box to
the left—sum bit of that adder. We estimate the accumulated length of the wires separately:
*
All wires connected to carry outputs:
Carry4=2 ¼ mðbðboxÞ 2bðFAÞ=3 þ 3dÞ:
41
43
C
35
jcarry4j ¼ bðboxÞ 2bðFAÞ=3 þ hðANDÞ þ 4d:
N
33
For mpn the size of the nets b½i is dominated by the size of the nets a½i: The carry in box4 travels
horizontally over a full box minus 23 of a full adder. Thus, we have
*
U
31
ja½ij ¼ mbðboxÞ þ hðTÞ mðhðbox4=2Þ dÞ hðFAÞ hðANDÞ þ 3md:
All wires connected to sum outputs not counting vertical displacement in the global wire
channels. In the wire channels every one of the distances 2kd occurs exactly once
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
18
1
Sum4=2 ¼ mð4bðFAÞ=3 þ 4dÞ þ
3
m
X
2kd
k¼1
¼ mð4bðFAÞ=3 þ 4dÞ þ dðm2 þ mÞ:
5
7
9
*
Total vertical displacement of the wires connected to sum outputs in the wire channels; from
the bottom of rightmost leaf (a box4 if m ¼ 53) to the very bottom of the layout not counting
the boxes4/2 on the path:
Channel ¼ hðTÞ hðbox4Þ logðM=4Þhðbox4=2Þ:
21
23
25
27
29
O
O
PR
19
TIMETree ðn; nÞ ¼ DAND þ 2DFA ð1 þ mÞ þ nðja½ij þ jcarry4j þ Carry4=2 þ Sum4=2 þ ChannelÞ:
Exactly along the same lines one specifies and analyzes the layout of the addition tree T 0 with
Booth recoding. The tree has M 0 =4 leaves and depth m0 ¼ logðM 0 =4Þ: One uses the boxes specified
in Fig. 12(a)–(c). The selection logic becomes more complex and has to be placed in two rows in
order not to exceed the Full adder width. For this organization in Box40 an additional horizontal
input wire must be routed around the full adder. Also the channel width changes a bit and
becomes m0 d:
D
17
TE
15
The competing paths from the carry output of Tr to the second full adder are faster than the path
considered above for realistic values of n:
We now imagine that all a- and b-inputs are the outputs of drivers whose propagation delay we
ignore. Then the total delay of the tree equals
bðbox0 Þ ¼ bðFAÞ þ m0 d þ 4d:
This change of width is the only change for the Box4=20 ; its height stays the same. In the other two
boxes there must be 5 input wires per selection logic. In Box30 the top selection logic is rotated in
order not to waste area. From the figures one obtains the following equations:
EC
13
F
11
hðbox4=20 Þ ¼ 2hðFAÞ þ 7d;
33
hðbox30 Þ ¼ hðFAÞ þ hðSLÞ þ bðSLÞ þ 17d;
35
hðbox40 Þ ¼ 2hðFAÞ þ 2hðSLÞ þ 26d:
O
R
R
31
Most of the other equations only change slightly:
41
43
C
ja0 ½ij ¼ m0 bðbox0 Þ þ hðT 0 Þ m0 ðhðbox4=20 dÞ hðFAÞ hðSLÞ þ 4m0 d;
N
39
hðT 0 Þ ¼ ðM 0 =4 1Þhðbox4=20 Þ þ ðM 0 =4Þhðbox30 Þ þ a0 ðhðbox40 Þ hðbox30 ÞÞ;
U
37
jcarry40 j ¼ bðbox0 Þ 2bðFAÞ=3 þ 3d;
Carry04=2 ¼ m0 ðbðbox0 Þ 2bðFAÞ=3 þ 3dÞ;
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
1
Box 3 booth
2ld(N/4)
2ld(N/4)
a
a
i+1
3
2
3
5
a
booth-channel[2:0]
a-channel[1:0]
a
i-2
Box 4 booth
3
6
3
4
2ld(N/4)
a
a
i+1
3
3
i
6
3
4
booth-channel[5:0]
a-channel[3:0]
Selection
logic
Selection
logic
13
i-4
a
6
3
Selection
logic
Box4/2 booth a)
D
21
Selection
logic
(b)
a-channel[7:4]
wire channel
19
wire channel
4
23
Full Adder
TE
Box4/2 booth b)
25
Full Adder
29
a
i-7
(a)
R
33
O
35
43
N
U
41
C
37
39
i-8
R
a
EC
27
31
PR
3
a
a
i+1
wire channel
3
a
i-5
booth-channel[11:6]
full adder
wire channel
17
O
a
i-3
O
a
full adder
2ld(M/4)
i
wire channel
15
Full Adder
Selection
logic
wire channel
Selection
logic
a-channel[5:2]
F
9
booth-channel[8:3]
wire channel
2ld(N/4)
11
i
3
Selection
logic
i-1
wire channel
7
19
2ld(M/4)
(c)
Fig. 12. Basic boxes for booth multiplier construction.
i-6
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
20
1
3
0
Sum04=2
0
¼ m ð4bðFAÞ=3 þ 4dÞ þ
m
X
2kd;
k¼1
¼ m0 ð4bðFAÞ=3 þ 4dÞ þ dðm02 þ m0 Þ;
5
Channel 0 ¼ hðT 0 Þ hðbox40 Þ logðM 0 =4Þhðbox4=20 Þ:
19
21
23
25
27
29
31
F
0
TIMETree
ðn; nÞ ¼ DSL þ 2DFA ð1 þ m0 Þ
DBD
0
0
þ n max ja ½ij; jb ðiÞj þ
þ jsum40 j þ jcarry40 j þ Carry04=2
n
þSum04=2 þ Channel 0 :
0
Þ=TIMETree is plotted for n ¼ m ¼
In Fig. 13 the relative improvement ðTIMETree TIMETree
53 as a function of n: This figure is surprisingly complex and counter intuitive. In particular we see
gains due to Booth recoding decrease with increasing n; i.e. with slower wires. For small values of n
until nE0:1 the delay of the Booth decoder DBD ¼ 3 plus the delay of the b-nets is larger than the
delay of the a-nets. For small n we have
0
E55 þ 537n;
TIMETree
whereas for large n we have
41
43
R
O
This explains why the graph has two branches. For n ¼ 0 only gate delays count and we have
0
TIMETree
=TIMETree E55=62E89%:
C
39
TIMETree ¼ 62 þ 710n:
This explains why the branch for the b-nets starts around 100% 89% ¼ 11%: It remains to
explain why the branch for the a-nets is falling with n: For n ¼ 0 the branch starts at 1 52
62E100% 84% ¼ 16%: For large n the savings are dominated by wire delays and approach
1 ð52 þ 614Þ=ð62 þ 710ÞE14%o16%: Thus, the branch falls because the wire delays have not
fallen much due to Booth recoding. The reason for this is the geometry of our particular layouts
which are quite wide and not very high: Denote by wireðTÞ resp. wireðT 0 Þ the wire delays for T and
N
37
We always have
U
35
O
With this we can evaluate the delay of the tree T 0 :
0
E52 þ 614n:
TIMETree
33
O
17
jb0 ½ij ¼ n0 ðbðbox0 Þ þ 10dÞ:
PR
15
D
13
Also the nets b0 ½i can become important. Their delay in sequence to the Booth decoder delay has
to compete against the nets a0 ½i and it depends on the constant n which one is slower. Therefore,
we have to consider the length of the longest jb0 ½ij: It has to reach n0 bit positions and for each
connection in the worst case it crosses all the 9 other selection logic wires.
TE
11
jsum40 j ¼ hðSLÞ þ bðSLÞ=2 þ 12d:
EC
9
Additionally, to these wires we must consider the wire between one selection logic output in the
upper row and the full-adder input in Box40 :
R
7
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
1
21
bw
1-TIME’Tree /TIME Tree [%] for n=m=53
22
3
with b-net and BD influence
5
20
7
18
9
11
F
ag 16
13
15
O
with a-net influence
14
aw
17
O
total TIME savings
by Booth recoding
12
19
0
0.2
0.6
0.4
v
21
wireðT 0 ÞEwðT 0 Þ=2 þ 2hðT 0 Þ;
R
because a-nets travel horizontally over half the layout and vertically over almost the full layout;
the longest paths from the leaves to the root travel vertically over almost the full layout whereas
horizontal displacement is logarithmic. We have
41
43
R
C
O
If we would use as leaves only 42-adders, then the tree T would have n=4 leaves each of height
hðbox4=2ÞE9:9 and around n=4 interior nodes, each of height hðbox4=2ÞE6:7: Thus,
hðTÞEðn=4Þð9:9 þ 6:7Þ ¼ 4:15n:
Similarly,
we
have
hðT 0 ÞEðn=8Þðhðbox0 4Þ þ
0
hðbox4=2ÞEðn=8Þð17:5 þ 6:7Þ ¼ 3:025n: For n ¼ 53 we get wireðT ÞE623 and wireðTÞE742:
Note that we have hðTÞ=wðTÞE13 and hðTÞ=wðT 0 ÞE14: If we make the layouts more square,2 say
by doubling h and halving w; then the savings due to Booth recoding increase but the layouts
become slower!. This incidentally shows, that the common practice of making the layouts of
addition trees roughly square is not always a good idea.
N
39
wðTÞEwðT 0 ÞE2nwðboxÞE2n5:7
U
37
D
wireðTÞEwðTÞ=2 þ 2hðTÞ;
33
35
TE
T 0 : Then, we have for n ¼ 1:
27
31
1
Fig. 13. Relative TIME improvement by Booth recoding for n ¼ m ¼ 53:
25
29
0.8
EC
23
PR
bg
2
This is common practice.
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
22
7. Evaluations
3
In this section, we evaluate the savings due to Booth recoding for a wide range of n and also
analyze the asymptotical behaviour. The evaluations are presented in two parts: In the first part
the gate model from Section 4 is considered, in the second part we consider the layout model from
Section 6. To be able to compare the four different designs for every particular n; on the one hand
layouts had to be analyzed also for the multiplication arrays. On the other hand, the delay, cost
and area formulae of the multiplication tree designs had to be extended for the cases where
M=2pmo3=4M: These extensions have been done following the presented designs in a
straightforward way.
We do not only focus on the savings in cost, area and delay, but we consider the more general
quality metrics, the InverseGateQuality (IGQ) and the InverseLayoutQuality (ILQ), that are
defined below:
19
21
23
25
27
29
31
33
O
IGQX ðn; qÞ ¼ CX ðn; nÞq DX ðn; nÞð1qÞ :
Note, that we get the delay by IGQX ðn; 0Þ ¼ DX ðn; nÞ and the cost by IGQX ðn; 1Þ ¼ CX ðn; nÞ:
Because the delay DX ðn; nÞ of a design X is the inverse of the performance of the design PX ðn; nÞ ¼
1=DX ðn; nÞ; for q ¼ 12 the IGQ metric relates to cost-performance ratio:
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
IGQX ðn; 0:5Þ ¼ CX ðn; nÞDX ðn; nÞ
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ CX ðn; nÞ=PX ðn; nÞ:
Definition 3. We define the IGQ of the best among our designs, that do not use Booth recoding by
bestIGQðn; qÞ ¼ minðIGQArray ðn; qÞ; IGQTree ðn; qÞÞ:
Analogously, the IGQ of the best among our designs using Booth recoding is defined by
bestIGQ0 ðn; qÞ ¼ minðIGQ0Array ðn; qÞ; IGQ0Tree ðn; qÞÞ:
Moreover, we define the relative savings of the IGQ due to Booth recoding:
IGQsaveðn; qÞ ¼ 1 bestIGQ0 ðn; qÞ=bestIGQðn; qÞ:
43
C
ILQX ðn; n; qÞ ¼ AREAX ðn; nÞq TIMEX ðn; n; nÞð1qÞ :
N
41
Definition 4. Depending on a quality paramater 0pqp1; we define the InverseLayoutQuality
(ILQ) of a design X ; that requires the area AREAX ðn; mÞ and has the delay TIMEX ðn; m; nÞ; given
by
U
39
O
35
37
O
17
Definition 2. Depending on a quality paramater 0pqp1; we define the InverseGateQuality (IGQ)
of a design X ; that has the cost CX ðn; mÞ and the delay DX ðn; mÞ; by
PR
15
D
13
TE
11
EC
9
R
7
R
5
F
1
Note, that we get the TIME metric by ILQX ðn; n; 0Þ and the AREA metric by ILQX ðn; n; 1Þ:
Moreover, for q ¼ 12 and q ¼ 23 the ILQ metric relates to the metrics commonly known as AT and
AT 2 ; respectively.
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
23
1
3
5
7
9
Definition 5. We define the ILQ of the best among our designs, that do not use Booth recoding by
bestILQðn; n; qÞ ¼ minðILQArray ðn; n; qÞ; ILQTree ðn; n; qÞÞ:
Accordingly, the ILQ of the best among our designs using Booth recoding is defined by
bestILQ0 ðn; n; qÞ ¼ minðILQ0Array ðn; n; qÞ; ILQ0Tree ðn; n; qÞÞ:
Moreover, we define the relative savings of the ILQ due to Booth recoding:
ILQsaveðn; n; qÞ ¼ 1 bestILQ0 ðn; n; qÞ=bestILQðn; n; qÞ:
11
F
7.1. Gate analysis
13
O
O
15
7.1.1. Asymptotic considerations
The following lemma states, that for large n; the multiplication arrays have the best IGQ for
q ¼ 1; whereas, multiplication trees have the best IGQ for qo1:
21
23
bestIGQ0 ðn; qÞ ¼
IGQArray ðn; qÞ
if q ¼ 1;
IGQTree ðn; qÞ if 0pqo1;
(
IGQ0Array ðn; qÞ if q ¼ 1;
IGQ0Tree ðn; qÞ
25
33
35
EC
IGQArray ðn; qÞ CArray ðn; nÞq DArray ðn; nÞð1qÞ
:
¼
IGQTree ðn; qÞ
CTree ðn; nÞq DTree ðn; nÞð1qÞ
1q
n
n
q
þo
X1
2 logðnÞ
2 logðnÞ
R
31
R
29
Proof. We separate the proof for the two cases: (a) q ¼ 1; and (b) 0pqo1:
(a) For q ¼ 1; the quality metric becomes the cost metric IGQX ðn; 1Þ ¼ CX ðn; nÞ; so that from
0
0
ðn; nÞ ¼ CArray
ðn; nÞ þ E 0 CFA the
the cost formulae CTree ðn; nÞ ¼ CArray ðn; nÞ þ ECFA and CTree
proof of part (a) of the lemma already follows.
(b) For 0pqo1 and large n; we consider the fraction
O
27
X 1:
43
N
The same argumentation can be used for the Booth designs, so that also part (b) of the proof is
completed. &
Theorem 7. For large n; Booth recoding improves the best IGQ by
q
CFA þ CSL
IGQsaveðn; qÞ ¼ 1 7oð1Þ:
2ðCFA þ CAND Þ
U
41
C
37
39
if 0pqo1:
D
bestIGQðn; qÞ ¼
TE
19
Lemma 6. For large n:
(
PR
17
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
24
1
9
11
13
15
17
Because IGQArray ðn; 1Þ ¼ CArray ðn; nÞ ¼ ðCFA þ CAND Þn2 þ oðn2 Þ and
0
IGQ0Array ðn; 1Þ ¼ CArray
ðn; nÞ
nþ1
þ oðn2 Þ
¼ ðCFA þ CSL Þn
2
CFA þ CSL 2
¼
n þ oðn2 Þ;
2
we get IGQ0Array ðn; 1Þ=IGQArray ðn; 1Þ ¼ ðCFA þ CSL Þ=2ðCFA þ CAND Þ7oð1Þ; so that for large n
CFA þ CSL
IGQsaveðn; 1Þ ¼ 1 7oð1Þ:
2ðCFA þ CAND Þ
F
7
IGQsaveðn; 1Þ ¼ 1 IGQ0Array ðn; 1Þ=IGQArray ðn; 1Þ:
O
5
Proof. We separate the proof for the two cases: (a) q ¼ 1; and (b) 0pqo1:
(a) For q ¼ 1; it follows from Lemma 6 that for large n
(b) For 0pqo1; it follows from Lemma 6 that for large n
29
31
33
:
D
CTree ðn; nÞq DTree ðn; nÞð1qÞ
Because for large n; D0Tree ðn; nÞð1qÞ =DTree ðn; nÞð1qÞ ¼ m0 =m7oð1Þ ¼ 17oð1Þ; the equation for
IGQsaveðn; qÞ can be simplified to
0
C 0 ðn; nÞq
CTree ðn; nÞq
þ
o
IGQsaveðn; qÞ ¼ 1 Tree
CTree ðn; nÞq
CTree ðn; nÞq
ðCFA þ CSL ÞnJðn þ 1Þ=2n q
7oð1Þ
¼1 ðCFA þ CAND Þn2
q
CFA þ CSL
7oð1Þ:
X1 2ðCFA þ CAND Þ
TE
27
0
CTree
ðn; nÞq D0Tree ðn; nÞð1qÞ
EC
25
IGQsaveðn; qÞ ¼ 1 R
23
0
Using IGQ0Tree ðn; qÞ ¼ CTree
ðn; nÞq D0Tree ðn; nÞð1qÞ and IGQTree ðn; qÞ ¼ CTree ðn; nÞq DTree ðn; nÞð1qÞ ; we
get
R
21
PR
IGQsaveðn; qÞ ¼ 1 IGQ0Tree ðn; qÞ=IGQTree ðn; qÞ:
19
O
3
This completes the proof of the theorem. &
41
43
C
N
39
Corollary 8. Asymptotically, Booth recoding saves 25% of the cost of the partial product generation
and reduction (Theorem 7 with q ¼ 1 and the presented basic circuit designs). The asymptotic relative
IGQ savings are depicted in Fig. 14 as a function of the quality parameter q:
7.1.2. Considerations for practical n
The relative savings of the best IGQ according to the use of Booth Recoding are depicted in
Fig. 15 for 8pnp64 and 0pqp1: This figure shows that Booth recoding is useful in most
practical situations, except for small n and qE1: For q ¼ 1 the savings of the IGQ due to Booth
U
37
O
35
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
1
25
25
3
5
20
7
9
15
F
11
O
13
10
O
15
PR
17
5
19
23
0
q
0.6
0.8
1
Fig. 14. Asymptotic relative savings (%) of the best IGQ (regarding gate model) due to Booth recoding as a function of
q: IGQsaveðN; qÞ:
EC
27
0.4
TE
0.2
25
D
21
29
recoding are analyzed in detail. The following lemma states, that regarding the gate count
(IGQX ðn; 1Þ), Booth recoding is useful, iff n ¼ 13; 15 or nX17:
33
Lemma 9.
43
N
0
ðn; nÞ and bestIGQðn; 1Þ ¼ CArray ðn; nÞ: We
Proof. From Lemma 6 we get bestIGQ0 ðn; 1Þ ¼ CArray
distinguish two cases: (a) n is even; and (b) n is odd:
0
ðn; nÞ: The condition
(a) If n ¼ m is even, then we can substitute m0 by m0 ¼ n=2 þ 1 in CArray
0
2
139
CArray ðn; nÞpCArray ðn; nÞ is then equivalent to n 8 n þ 49X0; so that for even n > 0 we get the
solution nX16:63; which is equivalent to even nX18: (b) If n ¼ m is odd, then we can substitute m0
0
0
by m0 ¼ n=2 þ 12 in CArray
ðn; nÞ: The condition CArray
ðn; nÞpCArray ðn; nÞ is then equivalent to n2 115
189
8 n þ 8 X0; so that for odd n > 0 we get the solution nX12:48; which is equivalent to odd nX13:
U
41
C
37
39
R
bestIGQ0 ðn; 1ÞobestIGQðn; 1Þ3ððn ¼ 13Þ OR ðn ¼ 15Þ OR ðnX17ÞÞ:
O
35
R
31
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
26
1
3
5
26
7
9
11
24
22
20
18
14
O
15
F
16
13
12
8
PR
19
O
10
17
6
4
25
0
D
23
2
-2
TE
21
-4
-6
29
-8
EC
27
-10
-12
R
33
-14
-16
R
31
0
43
N
41
0.2
0.3
10
20
30
0.4
0.5
q
40
0.6
0.7
0.8
50
0.9
1
m
60
Fig. 15. Relative savings (%) of the best IGQ (regarding gate model) due to Booth recoding: IGQsaveðn; qÞ:
U
39
0.1
C
37
O
-18
35
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
1
10
27
IGQsave(n,1) [%]
3
m
8
5
9
10
11
12
13
14
15
16
17
18
19
20
21
0
7
-10
9
11
F
-20
O
17
-30
Fig. 16. Relative Savings (%) of the best IGQ due to Booth recoding for q ¼ 1: IGQsaveðn; 1Þ:
PR
15
O
13
19
Thus, the solutions of part (a) and part (b) combine to the lemma. The situation of the IGQ
savings due to Booth recoding IGQsaveðn; 1Þ is depicted in Fig. 16 for 8pnp20: &
D
21
23
Finally, Fig. 17 depicts which of the 4 different designs should be chosen according to the IGQ
for 0pqp1; 8pnp64 and 10pnp1010 :
27
7.2. Layout analysis
29
7.2.1. Asymptotic considerations
39
41
EC
R
R
O
37
C
35
N
33
Lemma 10. With the definition of the condition COAR3ðð0pqo1Þ AND ðna0ÞÞ OR ðq ¼ 1Þ; the
multiplication array designs have the asymptotically best ILQ for ðCOAR ¼ 1Þ: For ðCOAR ¼ 0Þ
the multiplication tree designs have the asymptotically best ILQ, so that for large n
(
ILQArray ðn; n; qÞ if COAR;
bestILQðn; n; qÞ ¼
ILQTree ðn; n; qÞ otherwise:
(
ILQ0Array ðn; n; qÞ if COAR;
0
bestILQ ðn; n; qÞ ¼
ILQ0Tree ðn; n; qÞ otherwise:
U
31
TE
25
Proof. We separate the proof for the two cases: (a) ðn ¼ 0Þ; and (b) ð0onp1Þ:
43
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
28
1
q
0.2
0
q
0.6
0.4
1
0.8
0
0.2
0.4
0.6
0.8
1
1e+10
64
3
5
56
7
1e+08
48
11
13
40
1e+06
m
m
F
9
O
32
19
PR
24
17
16
21
27
D
Booth Recoding:
Multiplication Tree
Multiplication Tree
Multiplication Array
Multiplication Array
TE
non-Booth:
25
100
10
8
23
10000
O
15
EC
Fig. 17. Value ranges of q and n for that the 4 designs have the best IGQ (regarding gate model).
29
(i) From ðq ¼ 1Þ and
R
35
(a) The proof for ðn ¼ 0Þ is again partitioned into two parts for the cases: (i) ðq ¼ 1Þ; and (ii)
ð0pqo1Þ:
ILQTree ðn; 0; 1Þ ¼ AREATree ðn; nÞ ¼ Oðn2 logðMÞÞ;
O
33
R
31
ILQ0Tree ðn; 0; 1Þ ¼ AREA0Tree ðn; nÞ ¼ Oðn2 logðMÞÞ;
39
ILQArray ðn; 0; 1Þ ¼ AREAArray ðn; nÞ ¼ Oðn2 Þ;
43
N
ILQ0Array ðn; 0; 1Þ ¼ AREA0Array ðn; nÞ ¼ Oðn2 Þ;
U
41
C
37
it follows, that for large n
bestILQðn; 0; 1Þ ¼ ILQArray ðn; 0; 1Þ;
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
ILQArray ðn; 0; qÞ ¼ AREAArray ðn; nÞq TIMEArray ðn; n; 0Þð1qÞ
¼ Yðnðqþ1Þ Þ;
7
25
27
29
31
33
35
37
39
41
43
F
PR
D
23
This completes part (a) of the proof.
(b) In the following, we assume for the proof of case (b), that ð0onp1Þ: Because
0
ðn; n; nÞ ¼ Oðn logðnÞÞ; TIMEArray ðn; n; nÞ ¼ OðnÞ and
TIMETree ðn; n; nÞ ¼ Oðn logðnÞÞ; TIMETree
0
TIMEArray ðn; n; nÞ ¼ OðnÞ; we get for large n
bestILQðn; n; 0Þ ¼ TIMEArray ðn; n; nÞ;
ð1Þ
0
bestILQ0 ðn; n; 0Þ ¼ TIMEArray
ðn; n; nÞ:
EC
21
In the same way we can show for 0pqo1; that for large n:
bestILQ0 ðn; 0; qÞ ¼ ILQTree ðn; 0; qÞ:
ð2Þ
Accordingly,
from
AREATree ðn; nÞ ¼ Oðn2 logðnÞÞ;
AREA0Tree ðn; nÞ ¼ Oðn2 logðnÞÞ;
AREAArray ðn; nÞ ¼ Oðn2 Þ and AREA0Array ðn; nÞ ¼ Oðn2 Þ; it follows for large n that
ð3Þ
bestILQðn; n; 1Þ ¼ AREAArray ðn; nÞ;
R
19
We get for large n
bestILQðn; 0; qÞ ¼ ILQTree ðn; 0; qÞ:
bestILQðn; n; 1Þ0 ¼ AREA0Array ðn; nÞ:
R
17
ð4Þ
Because for n > 0; the multiplication arrays have both, the asymptotically smallest AREA and the
asymptotically fastest TIME, we get for large n
O
15
Because for 0pqo1:
nðqþ1Þ
¼ n1q =logðnÞ
nð2qÞ logðnÞ
X 1 þ Oð1Þ:
bestILQðn; n; qÞ ¼ AREAArray ðn; nÞq TIMEArray ðn; n; nÞð1qÞ
C
13
¼ Yðnð2qÞ logðnÞÞ:
¼ ILQArray ;
N
11
ILQTree ðn; 0; qÞ ¼ AREATree ðn; nÞq TIMETree ðn; n; 0Þð1qÞ
0
bestILQ0 ðn; n; qÞ ¼ AREA0Array ðn; nÞq TIMEArray
ðn; n; nÞð1qÞ
U
9
O
5
(ii) For ðn ¼ 0Þ;
O
3
bestILQ0 ðn; 0; 1Þ ¼ ILQ0Array ðn; 0; 1Þ:
TE
1
29
¼ ILQ0Array :
This completes case (b) of the proof. &
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
30
1
3
Theorem
11. We
use
the
previous
definition
of
the
condition
COAR3ðð0pqo1Þ AND ðna0ÞÞ OR ðq ¼ 1Þ: Then, for large n the use of Booth recoding
relatively saves the best ILQ by
5
7
9
50
O
13
F
11
O
15
17
PR
45
19
23
D
21
TE
40
25
27
EC
35
29
R
31
30
43
N
41
0.2
0
U
39
0
C
37
O
35
R
33
0.4
0.2
0.6 nu
0.4
q
0.6
0.8
0.8
1 1
Fig. 18. Asymptotic relative savings (%) of the best ILQ (regarding layout model) due to Booth recoding for 0onp1
and 0pqp1:
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
1
3
8
ð1qÞ
>
< 1 0:641q 11:66n þ 3
7oð1Þ
13:75n þ 6
ILQsaveðn; n; qÞ ¼
>
:
1 0:728q 7oð1Þ
31
if COAR;
otherwise:
5
The asymptotic relative ILQ savings are depicted in Fig. 18 as a function of q and n:
7
Proof. First, we summarize the TIME and AREA formulae for the different designs:
15
AREA0Tree ðn; nÞ ¼ 1:26n2 logðMÞ þ oðn2 logðMÞÞ;
F
13
AREAArray ðn; nÞ ¼ 23:92n2 þ oðn2 Þ;
AREATree ðn; nÞ ¼ 1:73n2 logðMÞ þ oðn2 logðMÞÞ;
O
11
AREA0Array ðn; nÞ ¼ 15:49n2 þ oðn2 Þ;
0
TIMEArray
ðn; n; nÞ ¼ 11:66nm þ oðnmÞ þ 3m þ oðmÞ;
O
9
TIMEArray ðn; n; nÞ ¼ 13:75nm þ oðnmÞ þ 6m þ oðmÞ;
19
0
TIMETree
ðn; n; nÞ ¼ 0:2nn logðMÞ þ oðnn logðMÞ þ 12 logðMÞ þ oðlogðMÞÞ;
21
TIMETree ðn; n; nÞ ¼ 0:2nn logðMÞ þ oðnn logðMÞ þ 12 logðMÞ þ oðlogðMÞÞ:
PR
17
35
37
39
41
43
TE
EC
R
R
33
Finally, by the use of Lemma 10, we get, that for large n
8
0
< 1 ILQArray ðn;n;qÞ if COAR;
ILQArray ðn;n;qÞ
ILQsaveðn; n; qÞ ¼
: 1 ILQ0Tree ðn;n;qÞ otherwise:
ILQTree ðn;n;qÞ
8
< 1 0:641q 11:66nþ3 ð1qÞ 7oð1Þ if COAR;
13:75nþ6
¼
:
1 0:728q 7oð1Þ
otherwise:
O
31
0
ILQ0Tree ðn; n; qÞ AREA0Tree ðn; nÞq TIMETree
ðn; n; nÞð1qÞ
¼
ILQTree ðn; n; qÞ AREATree ðn; nÞq TIMETree ðn; n; nÞð1qÞ
1:26 q ð1qÞ
¼
1
7oð1Þ:
1:73
C
29
N
27
0
ILQ0Array ðn; n; qÞ AREA0Array ðn; nÞq TIMEArray
ðn; n; nÞð1qÞ
¼
ILQArray ðn; n; qÞ AREAArray ðn; nÞq TIMEArray ðn; n; nÞð1qÞ
15:49 q 11:66n þ 3 ð1qÞ
¼
7oð1Þ
23:92
13:75n þ 6
U
25
D
From these equations, we can easily derive
23
&
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
32
1
50
rel. asymptotic TIME improvement
[%] by Booth recoding
3
45
5
40
7
9
35
11
26
0
0.2
0.4
0.6
0.8
17
25
27
Moreover, we can extract the TIME savings (q ¼ 0) for large n:
8
< 11:66nþ3 7oð1Þ if ðn > 0Þ;
13:75nþ6
ILQsaveðn; n; 0Þ ¼
: 7oð1Þ
otherwise:
The asymptotic TIME savings are depicted in Fig. 19.
29
37
39
R
R
O
35
C
33
7.2.2. Considerations for practical n
The relative savings of the best ILQ according to the use of Booth recoding are depicted in
Fig. 20(a) for 8pnp64; 0pqp1 and n ¼ 0:05; and in Fig. 20(b) for 8pnp64; q ¼ 0 and 0pnp1:
These figures show that Booth recoding is useful for most practical parameters regarding the ILQ.
Fig. 21 depicts which of the 4 different layouts should be chosen according to the ILQ for
0pqp1 and 8pnp64: Because the best practical ILQ layouts in this range do not seem to
converge to the best asymptotical ILQ layouts from Lemma 10, the choice of the best ILQ layout
is also depicted for a larger range of n with 10pnp1010 and an even larger range of n with
104 pnp1080 in Fig. 21.
N
31
PR
ILQsaveðn; n; 1Þ ¼ 35:97oð1Þ:
D
23
Corollary 12. From Theorem 11 we can extract the AREA savings (q ¼ 1) for large n:
TE
21
Fig. 19. Asymptotic TIME savings due to Booth recoding.
EC
19
1
O
ν
O
15
F
30
13
8. Conclusions
43
We formally investigate the complexity of Booth recoding for fixed point multiplication. As
main contribution of this investigation we provide a formal version of the folklore theorem, that
U
41
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
33
1
3
30
5
20
20
15
savings in [%]
13
savings [%]
11
15
10
0
10
F
9
O
7
-10
O
5
-20
19
0
1
0.8
21
56
48
m
23
35
37
39
41
43
8
0
TE
EC
R
R
33
O
31
0.2
16
48
40
m 32
24
0.6
0.4 nu
0.2
16
8
0
(b) q=0
Booth recoding saves the delay and the area of the multipliers’ adder tree by constant factors
between 26% and 50% minus low order terms which depend on the length n of the operands and a
parameter n for the wire delay. For the cost analysis we provide a formal version of the folklore
theorem that balanced addition trees in n-bit multipliers have n2 þ Oðn log nÞ full adders. This
required the combinatorial argument from Section 3. In our investigations we did not only
analyze the costs and delays considering gates, but we also considered layouts including the effects
of wire delays. For practical n in the range 8pnp64 we specify the delay and the area gain due to
Booth recoding exactly for the proposed designs. Parts of our gate analysis are already used and
referenced by [28].
Our analysis are not only parametrical considering the operand size n; but also the relative delay
of the wires, the costs and delays of basic gates and the geometries of the basic circuits can be
adjusted by parameters, so that the interested reader may choose his favorite technology/
implementation in the analysis. Regarding the wire delay parameter n; we have demonstrated the
surprising effect, that for particular multiplier designs the relative delay savings due to Booth
recoding might decrease as wires become slower. Empirical studies on a limited number of designs
so far have suggested that Booth recoding should be particularly helpful if wires become slow [16].
We have analyzed this effect in some more detail.
C
29
24
56
1
0.8
Fig. 20. Relative savings (%) of the best ILQ (regarding layout model) due to Booth recoding for (a) 0oqp1;
8pmp64 and n ¼ 0:05; and (b) 0pnp1; 8pmp64 and q ¼ 0:
N
27
32
(a) nu=0.05
U
25
40
64
D
64
0.6
0.4 q
PR
17
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
34
1
q
0
0.2
0.6
0.4
0.8
non-Booth:
1
64
3
5
Booth Recoding:
Multiplication Tree
Multiplication Tree
Multiplication Array
Multiplication Array
56
7
48
9
40
11
m
F
32
O
13
17
O
24
15
19
8
q
q
0
0.2
0.4
0.6
0.8
0
1
1e+10
TE
23
0.2
25
0.4
D
21
PR
16
0.6
0.8
1
80
70
1e+08
29
60
50
1e+06
31
R
m
R
33
O
35
43
N
30
20
100
10
10
4
Fig. 21. Value ranges of q and n for that the 4 designs have the best ILQ (regarding layout model).
U
41
C
37
39
10000
40
log10(m)
EC
27
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
*
5
*
7
*
13
15
17
19
wireðTÞ==E==wðTÞ=2 þ 2hðTÞ
F
11
*
suggests that trivial folding of layouts produces slower designs. This clearly needs closer
investigation.
O
9
One can systematically study where drivers should be placed in order to make nets smaller and
hence speed up signal propagation.
One can analyze hybrid layouts (arrays of small trees) and the layouts of many other
multiplication designs [2,6,8–10,22,23,29–32].
The layouts of various adder and shifter designs can be analyzed in a quite realistic way.
It is common practice to ‘fold’ layouts of addition trees into more square layouts. But the
formula
O
3
Technically, the main contribution of the paper is the VLSI model from Section 5 and the
techniques of analysis of the same section. The detailed yet tractable nature of this model opens
the way for many further investigations. We list just a few
Acknowledgements
PR
1
For helpful and inspiring discussions the authors thank Michael Bosch and Guy Even.
21
D
References
23
35
37
39
41
43
TE
EC
R
R
33
O
31
C
29
N
27
[1] L. Dadda, Some schemes for parallel multipliers, Alta Frequenza 34 (1965) 349–356.
[2] B.C. Drerup, E.E. Swartzlander, Fast multiplier bit-product matrix reduction using bit-ordering and parity
generation, J. VLSI Signal Process. 7 (1994) 249–257.
[3] J. Keller, W.J. Paul, Hardware Design, 2nd Edition, Teubner Verlagsgesellschaft, Stuttgart, Leipzig, 1997.
[4] I. Koren, Computer Arithmetic Algorithms, Prentice-Hall International, Englewood Cliffs, NJ, 1993.
[5] A.R. Omondi, Computer Arithmetic Systems; Algorithms, Architecture and Implementations, Series in Computer
Science, Prentice-Hall International, Englewood Cliffs, NJ, 1994.
[6] N. Takagi, H. Yasuura, S. Yajima, High-speed VLSI multiplication algorithm with a redundant binary addition
tree, IEEE Trans. Comput. C-34 (9) (1985) 217–220.
[7] C.S. Wallace, A suggestion for parallel multipliers, IEEE Trans. Electron. Comput. EC-13 (1964) 14–17.
[8] Z. Wang, A. Jullien, C. Miller, A new design technique for column compression multipliers, IEEE Trans. Comput.
44 (8) (1995) 962–970.
[9] H. Al-Twaijry, Area and performance optimized CMOS multipliers, Ph.D. Thesis, Stanford University, August
1997.
[10] G.W. Bewick, Fast multiplication: algorithms and implementation, Ph.D. Thesis, Stanford University, March
1994.
[11] A.D. Booth, A signed binary multiplication technique, Q. J. Mech. Appl. Math. 4 (2) (1951) 236–240.
[12] P.E. Madrid, B. Millar, E.E. Swartzlander, Modified Booth algorithm for high radix multiplication, IEEE
Computer Design Conference, 1992, pp. 118–121.
[13] L.P. Rubinfeld, A proof of the modified Booth’s algorithm for multiplication, IEEE Trans. Comput. (October
1975), pp. 1014–1015.
[14] H. Al-Twaijry, M. Flynn, Multipliers and datapaths, Technical Report CSL-TR-94-654, Stanford University,
December 1994.
[15] H. Al-Twaijry, M. Flynn, Performance/area tradeoffs in Booth multipliers, Technical Report CSL-TR-95-684,
Stanford University, November 1995.
U
25
35
VLSI : 601
ARTICLE IN PRESS
W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]]
36
21
23
25
F
O
O
PR
19
D
17
TE
15
EC
13
R
11
R
9
O
7
C
5
N
3
[16] H. Al-Twaijry, M. Flynn, Technology scaling effects on multipliers, Technical Report CSL-TR-96-698, Stanford
University, July 1996.
[17] I. Wegener, The Complexity of Boolean Functions, Wiley, New York, 1987.
[18] C.D. Thompson, Area-time complexity for VLSI. Proceedings of the 11th Annual ACM Symposium on the
Theory of Computing, Vol. 11, 1979, pp. 81–88.
[19] J.D. Ullman, Computational Aspects of VLSI, Computer Science Press, Rockville, MD, 1984.
[20] L.A. Glasser, D.W. Dobberpuhl, The Design And Analysis Of VLSI Circuits, Addison-Wesley, Reading, MA,
1985.
[21] C. Mead, L. Conway, Introduction to VLSI Systems, Addison-Wesley, Reading, MA, 1980.
[22] L. Kuehnel, H. Schmeck, A closer look at VLSI multiplication, INTEGRATION, VLSI J. 6 (1988) 345–359.
[23] R.K. Yu, G.B. Zyner, 167 MHz Radix-4 floating point multiplier, Proceedings of the 12th Symposium on
Computer Arithmetic, Vol. 12, 1995, pp. 149–154.
[24] P. Kornerup, private communication.
[25] S.M. Mueller, W.J. Paul, The Complexity of Simple Computer Architectures, in: Lecture Notes in Computer
Science, Vol. 995, Springer, Berlin, 1995.
[26] J. Vuillemin, A very fast multiplication algorithm for VLSI implementation, INTEGRATION, VLSI J. 1 (1983)
39–52.
[27] C. Nakata, J. Brock, H4C Series: Design Reference Guide, CAD, 0.7 Micron Leff ; Motorola Ltd., 1993,
Preliminary.
[28] S.M. Mueller, W.J. Paul, Computer Architecture—Complexity and Correctness, Springer, Berlin, 2000 ISBN# 3540-67481-0.
[29] Z.-J. Mou, F. Jutand, Overturned-stairs adder trees and multiplier design, IEEE Trans. Comput. 41 (8) (1992)
940–948.
[30] V.G. Oklobdzija, D. Villeger, S.S. Liu, A method for speed optimized partial product reduction and generation of
fast parallel multipliers using an algorithmic approach, IEEE Trans. Comput. 45 (3) (1996) 294–306.
[31] R.M. Owens, R.S. Bajwa, M.J. Irwin, Reducing the number of counters needed for integer multiplication,
Proceedings 12th Symposium on Computer Arithmetic, Vol. 12, 1995, pp. 38–41.
[32] K.F. Pang, R. Soong, H.-W. Sexton, P.H. Ang, Generation of high speed CMOS multiplier-accumulators,
Proceedings IEEE International Conference on Computer Design: VLSI in Computers and Processors, 1988, pp.
217–220.
U
1