3B2v7:51c GML4:3:1 Prod:Type:com pp:1236ðcol:fig::NILÞ VLSI : 601 ED:indira PAGN: KESHAVA SCAN: vinod ARTICLE IN PRESS 1 3 INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 5 7 To Booth or not to Booth 9 Wolfgang J. Paula,*, Peter-Michael Seidelb 11 a Computer Science Department, University of the Saarland, 66041 Saarbruecken, Germany Computer Science and Engineering Department, Southern Methodist University, Dallas TX 75275, USA F b O 13 Abstract 17 Booth Recoding is a commonly used technique to recode one of the operands in binary multiplication. In this way the implementation of a multipliers’ adder tree can be improved in both cost and delay. The improvement due to Booth Recoding is said to be due to improvements in the layout of the adder tree especially regarding the lengths of wire connections and thus cannot be analyzed with a simple gate model. Although conventional VLSI models consider wires in layouts, they usually neglect wires when modeling the delay. To make the layout improvements due to Booth recoding tractable in a technology-independent way, we introduce a VLSI model that also considers wire delays and constant factors. Based on this model we consider the layouts of binary multipliers in a parametric analysis providing answers to the question whether to use Booth Recoding or not. We formalize and prove the folklore theorems that Booth recoding improves the cost and cycle time of ‘standard’ multipliers by certain constant factors. We also analyze the number of full adders in certain 42trees. r 2002 Elsevier Science B.V. All rights reserved. 25 27 29 PR D 23 TE 21 EC 19 O 15 Keywords: Multiplier design; Booth recoding; VLSI model; Layout analysis; Delay and area complexity; Wire effects R 31 1. Introduction 35 Addition trees are the central part in the design of fixed point multipliers. They produce from a sequence of partial products a carry-save representation of the final product [1–8]. Booth recoding [9–13] is a classical method which cuts down the number of partial products in an addition tree by a factor of 2 or 3 at the expense of a more complex generation of partial products. In general, VLSI designers tend to implement Booth recoding, and a widely accepted rule of thumb says, that Booth recoding improves both the cost and the speed of multipliers by some constant factor around 14: In [9,10,14–16], however, the usefulness of Booth recoding is challenged 41 43 O C N 39 U 37 R 33 *Corresponding author. E-mail addresses: [email protected] (W.J. Paul), [email protected] (P.-M. Seidel). 0167-9260/02/$ - see front matter r 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 9 2 6 0 ( 0 2 ) 0 0 0 4 0 - 8 VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 2 15 17 19 21 23 25 27 29 31 F * O 13 * In this paper, we introduce a VLSI model which accounts—in a hopefully meaningful way—for constant factors as well as wire delays and which is at the same time simple enough to permit the analysis of large circuits. The model depends on a parameter n which specifies the influence of the wire delays. In this model, we study two standard designs of partial product generation and addition trees: the simple linear depth construction and a carefully chosen variant of 42-trees. We show that Booth encoding improves for all reasonable values of n both the delay and the area of the resulting VLSI layouts by constant factors between 26% and 50% minus low-order terms which depend on n and the length n of the operands. For practical operand length in the range 8pnp64 we specify the gain exactly. The paper is organized in the following way. In Section 2, we review Booth recoding as explained in [10,24]. Section 3 provides a combinatorial lemma which will subsequently permit to count the number of full adders in certain 42-trees. In Section 4, we analyze the circuit complexity of Booth recoding. In Section 5, we introduce the VLSI model. The detailed circuit model from [25] is combined with a linear delay model for nets of wires. Section 6 contains the specification and analysis of layouts. The layouts for 42-trees follow partly a suggestion from [26]. In Section 7, the savings due to Booth recoding are evaluated. We analyze both the asymptotical behavior and the situation for practical n: We conclude in Section 8 and list some further work. O 11 * The standard models of circuit complexity [17] and VLSI complexity [18,19] are only trusted to be exact up to constant factors. The standard theoretical VLSI model ignores wire delays. Conway/Mead style rules [20,21] do consider wire delays, but do not invite paper and pencil analysis due to their level of detail. Trivial addition trees with linear circuit depth have nice and regular layouts. But Wallace trees [2,6,7] and even 42-trees [3,4,22,23] have more irregular layouts. In the published literature, these layouts tend not to be specified at a level of detail which permits, say, to read off the length of wires readily. PR 9 * D 7 TE 5 EC 3 altogether. Finally, in [15] an ambitious case study of around 1000 concrete designs was performed, and for some technologies with small wire delays the fastest multipliers turned out not to use Booth recoding. In spite of the obvious importance of the concept the (potential) benefits of Booth recoding have apparently received no theoretical treatment yet. This is probably due to the following facts: R 1 39 O For a ¼ a½n 1 : 0 ¼ ða½n 1; y; a½0ÞAf0; 1gn we denote by /aS ¼ C 37 2. Preliminaries n1 X a½i2i N 35 R 33 i¼0 43 the number represented by a; if we interpret it as a binary number. Conversely, for pAf0; y; 2n 1g we denote the n-bit binary representation of p by binn ðpÞ: For xAf0; 1gn and sAf0; 1g we define U 41 x"s ¼ ðx½n 1"s; y; x½0"sÞ: VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 1 3 An ðn; mÞ-multiplier is a circuit with n inputs a ¼ a½n 1 : 0; m inputs b ¼ b½m 1 : 0 and n þ m outputs p ¼ p½n þ m 1 : 0 such that /aS/bS ¼ /pS holds. 3 2.1. Multiplication 5 For jAf0; y; m 1g and k; hAf1; y; mg we define the partial sums 7 Sj;k ¼ 9 ¼ /aS/b½ j þ k 1 : jS2j O o 2nþkþj : Then, we have O Sj;1 ¼ /aSb½ j2j ; 17 Sj;kþh ¼ Sj;k þ Sjþk;h PR 21 F p ð2n 1Þð2k 1Þ2j 13 19 /aSb½t2t t¼j 11 15 jþk1 X and /aS/bS ¼ S0;m 31 33 35 TE EC 29 2.2. Booth recoding In the simplest form of Booth recoding (called Booth-2 recoding or Booth recoding radix 4) the multiplicator is recoded as suggested in Fig. 2. With b½m þ 1 ¼ b½m ¼ b½1 ¼ 0 and m0 ¼ Jðm þ 1Þ=2n one writes /bS ¼ 2/bS /bS ¼ 0 m 1 X R 27 B2j 4 j ; j¼0 R 25 Because Sj;k is a multiple of 2j it has a binary representation with j trailing zeros. Because Sj;k is smaller than 2jþnþk it has a binary representation of length n þ j þ k (see Fig. 1). where O 23 D ¼ S0;m1 þ Sm1;1 : B2j ¼ 2b½2j þ b½2j 1 2b½2j þ 1 b½2j 41 43 C The numbers B2j Af2; 1; 0; 1; 2g are called Booth digits and we define their sign bits s2j by N 39 ¼ 2b½2j þ 1 þ b½2j þ b½2j 1: U 37 0 0 0 0 0 0 0 0 0 0 n+k Fig. 1. Sj;k in carry–save representation. j S j,k VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 4 1 2<b> bm bm-1 3 -<b> 0 bm bi+1 bi bm-1 Bm 5 bi-1 bi+1 bi Bm-2 b1 bi-1 Bi B2 b0 0 b1 b0 B0 Fig. 2. Booth digits B2j : 7 ( 9 0 if B2j X0; 1 if B2j o0: With C2j ¼ /aSB2j Af2nþ1 þ 2; y; 2nþ1 2g; 15 D2j ¼ /aSjB2j jAf0; y; 2nþ1 2g; d2j ¼ binnþ1 ðD2j Þ; O 17 O 13 F 11 s2j ¼ /aS/bS ¼ 0 m 1 X /aSB2j 4 j j¼0 21 ¼ 23 0 m 1 X C2j 4 j D 19 PR the product can be computed from the sums 25 ¼ 0 m 1 X 35 43 C 0 ¼ 2nþ1þ2m : N Because 2m0 > m these terms are congruent to zero modulo 2nþm : Thus /aS/bS U 41 e0 ¼ binnþ4 ðE0 Þ: This is illustrated in Fig. 3. The additional terms sum to ! 0 0 m 1 X 4m 1 nþ1 j nþ1 ¼2 1þ3 4 1þ3 2 3 j¼0 37 39 EC E0 ¼ C0 þ 4 2nþ1 ; e2j ¼ binnþ3 ðE2j Þ; R E2j ¼ C2j þ 3 2nþ1 ; R 33 In order to avoid negative numbers C2j one sums the positive E2j instead O 31 s2j D2j 4 j : j¼0 27 29 TE j¼0 0 m 1 X E2j 4 j mod 2nþm : j¼0 With respect to the standard subtraction algorithm for binary numbers, the e2j can be computed by VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 1 1 1 5 E2 : 7 1 0 1 1 E4 : 1 1 11 1 E2m’-2 : 1 0 0 = <a> 0 0 B 2m’-2 2m’-2 0 1 s 0 1 s2m’ s 2m’ d2m 2m’’ 0 2m-2 m m’ Fig. 4. Partial product construction. 25 /e2j S ¼ /1s2j ; d2j "s2j S þ s2j ; 29 /e0 S ¼ /s0 s0 s0 ; d0 "s0 S þ s0 : EC 27 R Based on these equations, the computation of the numbers O f2j ¼ binnþ3 ðF2j Þ; R F2j ¼ E2j s2j ; f0 ¼ binnþ4 ðF0 Þ C is easy, namely N f2j ¼ ð1s2j ; d2j "s2j Þ; f0 ¼ ðs0 s0 s0 ; d0 "s0 Þ: U 43 D g 2m’ s0 s2 TE 23 s0 O d 21 41 0 B4 s s s g0 g2 g4 19 39 B2 PR 17 37 0 Fig. 3. Summation of the E2j : 15 35 0 B0 0 d4 = < a > 0 d 0 d0 = < a > d2 = < a > 0 + - 13 0 + 0 0 + - 9 33 0 + - F E0 : O 3 31 5 Instead of adding the sign bits s2j to the numbers F2j one incorporates them at the proper position into the representation of F2jþ2 as suggested in Fig. 4. The last sign bit does not create a problem because B2m0 2 is always nonnegative. Formally, let VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 6 1 g2j ¼ ðf2j ; 0s2j2 ÞAf0; 1gnþ5 ; 3 g0 ¼ ðf2j ; 00ÞAf0; 1gnþ6 : 9 /g2j S ¼ 4/f2j S þ s2j2 and hence we get the product by 0 0 m 1 m 1 X X j E2j 4 ¼ /g2j S4 j1 /aS/bS ¼ j¼0 11 13 j¼0 with the g2j as partial products. We define jþk1 X 0 ¼ /g2t S4t1 : S2j;2k F 7 Then O 5 t¼j 15 0 ¼ /g2j S4 j1 ; S2j;2 19 0 0 0 ¼ S2j;2k þ S2ðjþkÞ;2h : S2j;2ðkþhÞ 0 0 One can easily show, that S2j;2k is a multiple of 22j2 and S2j;2k o2nþ2jþ2kþ2 : Therefore, at most 0 in both carry-save or binary form. n þ 2k þ 4 nonzero positions are necessary to represent S2j;2k D 21 PR 17 23 35 37 Level c where v ranges over all nodes of level c and in m1 X H¼ Hc : c¼0 39 U We show 41 Lemma 1. HpðmmÞ=2: 43 TE EC R 33 R 31 O 29 Let T be a complete binary tree with depth m: We number the levels c from the leaves to the root from 0 to m: Each leaf u has a weight W ðuÞ: For some natural number k we have W ðvÞAfk; k þ 1g for all leaves and the weights are non-decreasing from left to right. Let m be the sum of the weights of the leaves. For m ¼ 4; m ¼ 53 and k ¼ 3 the leaves could, Pfor example, have the weights 3333333333344444: For each subtree t of T we define W ðtÞ ¼ ðW ðuÞÞ where u ranges over all leaves of t: For each interior node v of T we define LðvÞ (resp. RðvÞÞ as the weight of the subtree rooted in the left (resp. right) son of v: We are interested in the sums X LðvÞ; Hc ¼ C 27 3. A combinatorial lemma N 25 O Then, we have analogously to the nonBooth case VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 1 3 7 Proof. By induction on the levels of T one shows that in each level weights are nondecreasing from left to right and their sum is m: Hence X X 2Hc p LðvÞ þ RðvÞ Level c 5 X ¼ Level c W ðvÞ Level c 7 ¼m 17 19 21 23 F hðvÞ ¼ ðLðvÞ þ RðvÞÞ=2 LðvÞ ¼ ðRðvÞ LðvÞÞ=2Þ: O 15 In the above estimate we have replaced each weight LðvÞ by the arithmetic mean of LðvÞ and RðvÞ hereby overestimating LðvÞ by O 13 & For each node v in level c the 2c leaves of the subtree rooted in v form a continuous subsequence of the leaves of T: Hence, all nodes in level c except at most one have weights in fk2c ; ðk þ 1Þ2c g: Therefore, in each level c there is at most one node vc such that hðvc Þa0: We set hðcÞ ¼ hðvc Þ; if it exists. Otherwise, we set hðcÞ ¼ 0: It follows that H ¼ ðmmÞ=2 m1 X PR 11 and the lemma follows. hðcÞ: D 9 35 37 39 41 43 EC R R 33 In this section, we use a simple gate model [25] to analyze delay and cost of the multiplication circuits considering only the influence of gates. This model does not consider fanout restrictions or wire effects. The delay is the maximum gate delay of all paths from input bits a½i and b½ j to output bits p½k: The cost is computed as the cumulative cost of all gates used. For this purpose the basic gate delays and costs can be extracted from any design system. We will consider the Motorola technology [27] (Fig. 5), but the reader might choose his favorite technology in our parametrical analysis. O 31 4. Gates C 29 N 27 In the above example, we have h0 ¼ 12; h1 ¼ 12; h2 ¼ 32; h3 ¼ 52 and H ¼ 101: For m ¼ 3; m ¼ 27 and weights 33333444 we have h0 ¼ 12; h1 ¼ 12; h2 ¼ 32 and H ¼ 38: U 25 TE c¼0 Fig. 5. Motorola technology parameters. VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 8 1 4.1. Partial product generation 3 4.1.1. NonBooth One has to compute and shift the binary representations of the numbers 5 /aSb½ j ¼ /a½n 14b½ j; y; a½04b½ jS: 7 The nm AND-gates have a cost of 2nm: As they are used in parallel, they have a total delay of 2. 9 4.1.2. Booth 0 The binary representations of the numbers S2j;2 must be computed (see Fig. 4). These are g2j ¼ ð1s2j ; d2j "s2j ; 0s2j2 Þ; 13 g0 ¼ ðs0 s0 s0 ; d0 "s0 ; 00Þ; 23 25 27 29 31 if jB2j j ¼ 1; ða; 0Þ if jB2j j ¼ 2: and calculate them by the Booth decoder logic BD from Fig. 6(b), that can be developed from Fig. 6(a). Such a Booth decoder has cost CBD ¼ 11 and delay DBD ¼ 3: The selection logic SL from Fig. 6(c) directs either a½i or a½i þ 1 or 0 to position ½i þ 1 of d2j and the inversion depending on s2j yields g2j ½i þ 3: In the first 2 (3 for rightmost partial product) and last 2 bit positions this logic is replaced by the simple signal of a sign bit, its inverse, a zero or a one. The selection logic has cost CSL ¼ 10 and delay DSL ¼ 4: R O 35 N U 43 C 37 41 O For this computation two signals indicating jB2j j ¼ 1 and jB2j j ¼ 2 are necessary. We denote these by ( ( 1 if jB2j j ¼ 1; 1 if jB2j j ¼ 2; b22j ¼ b12j ¼ 0 otherwise; 0 otherwise 33 39 O ð0; aÞ PR > : D 21 d2j ¼ TE 19 EC 17 shifted by 2j 2 bit positions. The d2j ¼ binnþ1 ð/aSjB2j jÞ are easily determined from B2j and a by 8 > < ð0; y; 0Þ if B2j ¼ 0; R 15 F 11 Fig. 6. (a) Booth digit signals, (b) Booth decoder and (c) selection logic. VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 5 7 9 11 13 15 m0 ðn þ 1ÞCSL ¼ 10m0 ðn þ 1Þ: 4.2. Redundant partial product addition We study two standard constructions for the reduction of the partial products to a carry-save representation of the product: plain multiplication arrays and 42-trees. Each construction uses k-bit carry-save adders (k-CSA, 32-adders) [3–5]. They are composed of k full adders working in parallel and hence they have cost kCFA and delay DFA : Cascading two k-CSAs one constructs a k-bit 42adder, i.e. a circuit with 4k inputs a½k 1 : 0; b½k 1 : 0; c½k 1 : 0; d½k 1 : 0 and 2ðk þ 1Þ outputs s½k : 0; t½k : 0 such that F 3 As m0 booth decoders are necessary, the decoding logic costs m0 CBD ¼ 11m0 : The selection logic occurs for each partial product ðn þ 1Þ-times. Therefore, the selection logics have altogether a cost of O 1 27 29 31 33 PR D TE 25 S0;t ¼ S0;t1 þ St1;1 ; one can compute a carry-save representation of S0;t from a carry-save representation of S0;t1 and the binary representation of St1;1 by an n-CSA. This works because both S0;t1 and St1;1 can be represented with n þ t 1 bits and because the binary representation of St1;1 has t 1 trailing zeros (see Fig. 7(c)). The m 2 many n-CSAs which are cascaded this way have cost nðm 2ÞCFA and delay ðm 2ÞDFA : The combined cost and delay for (nonBooth) partial product generation and the multiplication array are EC 23 R 21 holds. The k-bit 42-adders constructed in this way have cost 2kCFA and delay 2DFA : We consider full adder implementations with cost CFA ¼ 14 and delay DFA ¼ 6: There are optimized implementations of 42-adders like described in [23], we do not use them in this analysis to keep the presentation simple. In order to introduce techniques of analysis we review quite formally the construction of standard multiplication arrays [3,21]. A carry-save representation of S0;3 can be computed by a single n-CSA (see Fig. 7(a)). Exploiting CArray ¼ nðm 2ÞCFA þ nmCAND R 19 ¼ 16nm 28n; 37 DArray ¼ ðm 2ÞDFA þ DAND 43 C N ¼ 6m 10: With Booth recoding one has only to sum the m0 (representations of) partial products g2j : Each 0 partial product has length n0 ¼ n þ 5 except g0 which has n0 þ 1 bits. Arguing with the sums S0;2t 0 0 in place of the sums S0;t one shows that ðm 2Þ many ðn Þ-CSAs suffice to sum the partial products in a multiplication array with Booth-2 recoding. Taking into account the partial product generation one obtains cost and delay U 41 O 35 39 O /aS þ /bS þ /cS þ /dS /sS þ /tS mod 2nþ1 17 9 VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 1 n 0 0 0 3 n n 5 n+1 2 i t-1 n n n n+h n-bit CSA n+1 i k n+h S i,4 i 0 0 0 0 S i,k S i+k,h 0 0 0 0 0 0 0 0 S i,k+h 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 n+h 0 0 0 0 0 0 0 0 n+h (n+h)-bit 4/2-adder n+1 n+h S 0,t n+1 t-1 n+h n+h k i (d) EC Fig. 7. Partial compression of (a) Si;3 ; (b) Si;4 ; (c) S0;t from a carry–save representation of S0;t1 and St1;1 ; and (d) Si;kþh from carry–save representations of Si;k and Siþk;h : R 0 CArray ¼ ðn0 Þðm0 2ÞCFA þ ðn þ 1Þm0 CSL þ m0 CBD R ¼ 24nm0 þ 91m0 28n 140; O D0Array ¼ ðm0 2ÞDFA þ DSL þ DBD C ¼ 6m0 5: N For n ¼ m ¼ 53 (double precision) we get 0 =CArray ¼ 35457=43460 ¼ 81:6%; CArray U 43 n h 0 0 0 0 0 0 S 0,t-1 S t-1,1 31 41 3 PR n 29 39 n+1 (b) (c) 37 0 0 0 0 0 0 0 0 0 0 (a) 25 35 n+1 O n+1 21 33 n 0 0 0 0 0 0 0 0 0 0 0 0 17 27 n n-bit CSA n+1 15 23 n S i,3 13 19 n n-bit CSA n+1 S i,3 S i+3,1 n+1 D 11 n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 n-bit CSA i 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 n 9 S i,1 S i+1,1 S i+2,1 n n+1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 O S i,1 S i+1,1 S i+2,1 2 TE 7 n i 2 F 10 D0Array =DArray ¼ 157=308 ¼ 50:9%: For n ¼ m ¼ 24 (single precision) we get VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 1 0 CArray =CArray ¼ 8139=8544 ¼ 95:2% 3 D0Array =DArray ¼ 73=134 ¼ 54:5%: 11 7 0 0 3 Asymptotically, CArray =CArray tends to 12 16 ¼ 75% and DArray =DArray tends to 6 ¼ 50%: Thus, if multiplication arrays would yield the best known multipliers the case for Booth recoding would be easy and convincing (and hardly worth a paper). 9 4.3. Analysis 42-trees 19 21 23 25 27 29 F O O 17 4.3.1. NonBooth We construct a specific family of 42-trees which happens to be accessible to analysis. The nodes of the tree are either 32-adders or 42-adders. The representations of the m partial products S0;1 ; y; Sm1;1 will be fed into the leaves of the tree from right to left. Let M ¼ 2Jlog mn be the smallest power of two greater or equal to m: Let m ¼ logðM=4Þ ¼ JlogðmÞn 2: We will construct the entire addition tree T by two parts as shown in Fig. 8. The lower regular part is a complete binary tree of depth m 1 consisting entirely of 42-adders. It has M=8 many 42-adders as leaves. For the top level of the tree we distinguish two cases. If 3M=4pmpM; we use a ¼ m 3M=4 many 42-adders and M=4 a many 32-adders. We arrange the 32-adders of the top level of the tree at the left of the 42-adders. If M=2omo3M=4 we use in the top level only b ¼ m M=2 many 32-adders (and we feed m 3b representations of partial products directly into the lower part). We arrange the 32-adders at the right end of the tree. For m ¼ 24 we have mX24; and use 832 adders as leaves. For m ¼ 53 we have mX48; and we use 1132-adders and 542-adders as leaves. We only analyze the first case explicitly. PR 15 D 13 The situation changes in two respects when we consider addition trees with logarithmic delay like Wallace trees [2,6,7] or 42-trees [1,3,4,22,23]. First, counting gates in such a tree becomes a somewhat nontrivial problem. Second, Booth recoding only yields a marginal saving in time. TE 11 EC 5 (M/4-a)-times Sm-2,1 Sm-1,1 Sm-3,1 R 33 35 O S m-3,3 43 N S4a+1,1 S4a-1,1 S4a-3,1 S4a+2,1 S4a,1 S4a-2,1 S4a-4,1 S 4a+4,3 S 4a,4 complete 4:2 adder tree U 41 C 37 39 a-times R 31 Fig. 8. Adder tree construction. S3,1 S1,1 S2,1 S 4a,4 S0,1 VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 12 13 15 17 19 Si;kþh ¼ Si;k þ Siþk;h ; where a carry-save representation of Si;k is provided by the right son of u and Siþk;h is provided by the left son of u: By the results of Section 3 we are in the situation of Fig. 7(d). Hence node u has 2h excess full adders. Referring to Section 3 we label each leaf u of the tree with the number of partial products it sums, i.e. with W ðuÞ ¼ 3 if u is a 32-adder and with W ðuÞ ¼ 4 if it is a 42-adder, then h ¼ LðuÞ and for the number E ¼ 2H of excess full adders in the tree we have 21 E ¼ ðmmÞ 2 27 29 31 hc c¼0 p ðmmÞ: TE 25 m1 X This implies, that the 42-trees constructed above have nm þ oðnmÞ full adders. The proof only depends on the fact that for each interior node u the left son of u sums less partial products than the right son of u: Hence, it applies to many other balanced addition trees. One hardly dares to state or prove such a folklore result because it is ‘obviously’ known. Unfortunately, we have not been able to locate it in the literature and we need it later. With partial product generation we get cost and delay EC 23 F 11 O 9 O 7 full adders, because every 32-adder reduces the number of partial products by 1 and every 42-adder reduces the number of partial products by 2. It remains to estimate the number of excess full adders in the tree. Every leaf in the tree which is a 32-adder computes a sum Si;3 : Thus, n full adders suffice (see Fig. 7(a)). A leaf of the tree which is a 42-adder computes a sum Si;4 : It can be simplified as shown in Fig. 7(b) such that 2n full adders suffice. Thus, in the top level of the tree there are no excess full adders. Each 42-adder u in the lower portion of the tree performs a computation of the form PR 5 F ¼ nðm 2Þ D 3 If all 32- and 42-adders in the tree would consist of exactly n resp. 2n full adders, then the tree would have a total of CTree ¼ ðnðm 2Þ þ EÞCFA þ nmCAND R 1 ¼ 16nm þ oðnmÞ; 35 DTree ¼ 2ðm þ 1ÞDFA þ DAND O R 33 ¼ 12m þ 14 41 43 C N 39 ¼ 12JlogðmÞn 10: 4.3.2. Booth 0 Let M 0 ¼ 2Jlog m n the smallest power of two greater or equal m0 and let m0 ¼ logðM 0 =4Þ: For mAf24; 53g we have 34M 0 pm0 oM 0 : We proceed as above and only analyze this case. The standard length of 32-adders and 42-adders is now n0 ¼ n þ 5 bits; longer operands require excess full adders. Let E 0 be the number of excess full adders. U 37 VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] and we get E 0 ¼ 2ðm0 m0 Þ 4 15 17 19 21 23 25 27 29 31 Taking into account the partial product generation one obtains cost and delay F 0 ¼ ðn0 ðm0 2Þ þ E 0 ÞCFA þ ðn þ 1Þm0 CSL þ m0 CBD CTree ¼ 24nm0 þ oðnm0 Þ; O 13 p 2ðm0 m0 Þ: D0Tree ¼ 2ðm0 þ 1ÞDFA þ DBD þ DSL ¼ 12ðm0 þ 1Þ þ 7: For n ¼ m ¼ 53 we get 0 CTree =CTree ¼ 37305=46288 ¼ 80:6%; D0Tree =DTree ¼ 55=62 ¼ 88:7% and for n ¼ m ¼ 24 we get 0 =CTree ¼ 8531=9552 ¼ 89:3%; CTree D0Tree =DTree ¼ 43=50 ¼ 86:0%: 0 =CTree tends again to 12 Asymptotically, CTree 16 ¼ 75%: Unless m is a power of two, we have m ¼ 0 0 m þ 1 and DTree DTree ¼ 7: Hence, D0Tree =DTree tends to one as n grows large. This contradicts the common opinion that Booth recoding saves a constant fraction of the delay (independent of n). We try to resolve this contradiction in the next section by considering layouts. EC 11 ch0c R 9 X O 7 E 0 ¼ 4H 0 PR 5 D 3 Considering the sums S0 instead of the sums S one shows that the top level of the tree has no excess full adders. Let H 0 be the sum of labels of left sons in the resulting tree and for all c let h0c be the correction term for level c: Because with Booth recoding successive partial products are shifted by 2 positions, we now have TE 1 13 5. VLSI model 35 In the circuit complexity model we could only show, that a constant fraction of the cost is saved by Booth recoding. In order to explain, why Booth recoding also saves a constant fraction of the run time we have to consider wire delays in VLSI layouts. The layouts that we consider consist of simple rectangular circuits S; connected by nets N: For circuits S; we denote by bðSÞ the breadth, hðSÞ the height, CðSÞ the gate count and DðSÞ the combinatorial delay of S: We do not consider different delays for different inputs and outputs of the same simple circuit in order to keep the analysis simple. We require 41 43 O C N 39 U 37 R 33 bðSÞhðSÞ ¼ CðSÞ; i.e. area equals gate count and VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 14 1 3 5 7 Fig. 9. (a) Table of basic circuit geometries; (b) shapes of basic circuits. 9 11 17 19 F O O 15 i.e. layouts of simple circuits are not too slim or too flat. In order to keep drawings simple we will place pins quite liberally at the borders of the circuits. Input pins are at one side of the rectangle, output pins are at the opposite side. Nets consist of horizontal and vertical lines (wires). They must have a minimal distance d from each other and from circuits. Thus, a wire channel for t lines has width ðt þ 1Þd: The size jNj of a net N is the sum of the length of the lines that constitute the net. We define the delay of a circuit S driving a net N as PR 13 hðSÞ=2pbðSÞp2hðSÞ; 21 D TIMEðS; NÞ ¼ DðSÞ þ njNj: 23 TE 27 The parameter n weights the influence of wire delay on the total delay. If n ¼ 0 only gate delays count. For a square inverter NOT with CðSÞ ¼ 1 we have bðNOTÞ ¼ hðNOTÞ ¼ 1: Suppose we connect the output of such a gate with a single wire N of length hðNOTÞ ¼ 1 and we have nX1: Then EC 25 29 njNjX1 ¼ DðNOTÞ; 39 41 43 R O C 37 N 35 i.e. a wire which is as long as the gate contributes to the delay as much as the propagation delay of the gate or more. This does not seem reasonable. Therefore, we restrict the range of n to the interval ½0; 1: We do not restrict the fanout of circuits, but within limits the parameter n can also be used to model fanout restrictions. We will consider only four types of simple circuits S; namely AND gates, full adders FA; Booth decoders BD and the selection logic SL: We will use two geometries for the selection logic. The geometries from Fig. 9(a) happen to make the layouts of addition trees particularly simple. We place the pins of the full adder and the other basic circuits as specified in Fig. 9(b).1 The exact position of the input pins of AND gates and the selection logic contributes only marginally to the run time of addition trees and will not be considered in the analysis. We will use d ¼ 0:1: 1 U 33 R 31 Strictly speaking we use three layouts for the selection logic. Two of the layouts only differ in the position of the output pin. VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 15 6. Layouts and their analysis 3 First, we specify and analyze the layout of a plain 42-tree T with M=4 leaves where partial products are generated with simple AND gates. The tree T has depth m ¼ logðM=4Þ as well as m levels of interior nodes. The number of interior nodes is M=4 1: This will then be compared with the layout of a tree T 0 with M 0 =4 leaves where partial products are generated by circuits SL controlled by Booth decoders BD: All our layouts will consist of a matrix of full adders plus extra circuits and wires. Every node in the tree is either a 32-adder and occupies one row of the matrix or it is a 42-adder and occupies 2 consecutive rows of the matrix. Every bit position occupies a column of the matrix. Between neighboring full adders of the same row we leave a wire channel of appropriate width. Inputs a½i are fed into the layout at the top with indices increasing from right to left. Inputs b½ j are fed into the layout from the left with indices increasing from top to bottom. Outputs are produced at the right border and at the bottom of the layout. Let Tl (resp. Tr ) be the subtree rooted in the left (resp. right) son of a node v: Then, we layout Tr on top of Tl : This is followed by a row for v (see Fig. 10). Between the layouts of the two subtrees we leave space for one wire. This space will be filled with boxes4/2a specified below. It only remains to specify where to place the extra circuits and how to layout the wires. For tree T it suffices to specify the three types of boxes in Fig. 11(a)– (c): 17 19 O O 15 * 23 25 * EC 27 29 31 33 35 Box3: this is one full adder which is part of a leaf v of the tree with weight L ¼ 3: The whole 32adder v has as inputs 3 bits of b which are routed in the b-channel b½2 : 0: Every bit a½i is needed in three consecutive bit positions of v; then it is routed to the next leaf down the tree and one position to the left. Thus, 3 lines of an a-channel a½2 : 0 suffice to accommodate the ainputs for v: For all leaves v and all i we feed input a½i into a-channel a½i mod 3: Before a½i can be fed into the channel, bit a½i 3 has to be removed and routed down to the next leaf of the tree. This—unfortunately—consumes horizontally distance 2d: Box 4: this is one pair of full adders which is part of a leaf v of the tree with weight L ¼ 4: The whole 42-adder v has as inputs 4 bits of b which are routed in the b-channel b½2 : 0 of the first 32adder and in one b-channel of the second 32-adder. Similarly, 4 a-channels are used. Each signal a½i is fed into 3 full adders of consecutive bit positions in the first 32-adder level and then into one full adder in the next bit position of the second 32-adder. After that a½i is routed down to the next leaf of the tree. Box 42: This is strictly speaking a pair of boxes which is part of an interior node v of the tree with subtrees Tr and Tl : Box4/2a is inserted between the layout of Tr and Tl : Box4/2b is added at D 21 PR 13 TE 11 * R 9 R 7 O 5 F 1 43 N 41 U 39 C 37 partial product compression of right subtree partial product compression of left subtree Τρ Τ wire channels λ inner node v (4-2 adder) Fig. 10. Recursive partition of the tree layout. VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 16 1 2ld(M/4) 2ld(M/4) Box4 a 5 a i+1 i b-channel[2:0] a-channel[2:0] full adder 19 a 21 a i-3 i-4 D 2ld(M/4) 23 25 2ld(M/4) Box3 2ld(M/4) EC 27 a a R 33 O 35 43 N 41 U 39 C 37 full adder a b-channel[2:0] a-channel[2:0] a i-2 i wire channel R 31 wire channel i+1 29 (c) TE (a) i-3 (b) Fig. 11. Basic boxes for non-booth multiplier construction. i F wire channel full adder O full adder PR wire channel wire channel 17 b-channel a-channel wire channel 13 15 a i+1 O full adder a wire channel 11 Box4/2 b) wire channel 9 wire channel 7 wire channel 3 Box4/2 a) 2ld(M/4) VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] the bottom of the layout of Tl : The box consumes two wires in each vertical wire channel, one for a carry output and one for the sum output of one full adder of the root of Tr : We place the sum lines in the left and the carry lines in the right half the wire channel. It will not matter in the analysis where we place them exactly. 1 3 5 7 17 All boxes have the same width bðboxÞ: Because each level of interior nodes contributes 2 lines to the global wire channel, bðboxÞ ¼ bðFAÞ þ ðm þ 3Þd: 9 PR The height of the whole layout is hðTÞ ¼ ðM=4 1Þhðbox42Þ þ ðM=4Þhðbox3Þ þ aðhðbox4Þ hðbox3ÞÞ: For the size of the nets a½i we consider * * 25 * D 23 O hðbox4Þ ¼ 2hðFAÞ þ 2hðANDÞ þ 11d: 17 21 O hðbox3Þ ¼ hðFAÞ þ hðANDÞ þ 7d; 15 19 F hðbox4=2Þ ¼ 2hðFAÞ þ 7d; 13 The horizontal extension mbðboxÞ: The vertical extension. This extends from the very top of the layout to the leftmost leaf. Below the leftmost leaf we have m many boxes4/2b. The m connections from the a-channels to the AND gates, each of length up to 3d: TE 11 Box4/2a is placed on top of the b-channel of the rightmost leaf of Tl : Thus, it contributes only d to the height, and we have 27 EC Thus, for the largest ones of the nets a½i we have 29 39 R R O 37 Next we determine the accumulated delay from a leaf to the root which follows in each box4/2 the following path: sum bit in Tr —first full adder in v—carry output—second full adder in the box to the left—sum bit of that adder. We estimate the accumulated length of the wires separately: * All wires connected to carry outputs: Carry4=2 ¼ mðbðboxÞ 2bðFAÞ=3 þ 3dÞ: 41 43 C 35 jcarry4j ¼ bðboxÞ 2bðFAÞ=3 þ hðANDÞ þ 4d: N 33 For mpn the size of the nets b½i is dominated by the size of the nets a½i: The carry in box4 travels horizontally over a full box minus 23 of a full adder. Thus, we have * U 31 ja½ij ¼ mbðboxÞ þ hðTÞ mðhðbox4=2Þ dÞ hðFAÞ hðANDÞ þ 3md: All wires connected to sum outputs not counting vertical displacement in the global wire channels. In the wire channels every one of the distances 2kd occurs exactly once VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 18 1 Sum4=2 ¼ mð4bðFAÞ=3 þ 4dÞ þ 3 m X 2kd k¼1 ¼ mð4bðFAÞ=3 þ 4dÞ þ dðm2 þ mÞ: 5 7 9 * Total vertical displacement of the wires connected to sum outputs in the wire channels; from the bottom of rightmost leaf (a box4 if m ¼ 53) to the very bottom of the layout not counting the boxes4/2 on the path: Channel ¼ hðTÞ hðbox4Þ logðM=4Þhðbox4=2Þ: 21 23 25 27 29 O O PR 19 TIMETree ðn; nÞ ¼ DAND þ 2DFA ð1 þ mÞ þ nðja½ij þ jcarry4j þ Carry4=2 þ Sum4=2 þ ChannelÞ: Exactly along the same lines one specifies and analyzes the layout of the addition tree T 0 with Booth recoding. The tree has M 0 =4 leaves and depth m0 ¼ logðM 0 =4Þ: One uses the boxes specified in Fig. 12(a)–(c). The selection logic becomes more complex and has to be placed in two rows in order not to exceed the Full adder width. For this organization in Box40 an additional horizontal input wire must be routed around the full adder. Also the channel width changes a bit and becomes m0 d: D 17 TE 15 The competing paths from the carry output of Tr to the second full adder are faster than the path considered above for realistic values of n: We now imagine that all a- and b-inputs are the outputs of drivers whose propagation delay we ignore. Then the total delay of the tree equals bðbox0 Þ ¼ bðFAÞ þ m0 d þ 4d: This change of width is the only change for the Box4=20 ; its height stays the same. In the other two boxes there must be 5 input wires per selection logic. In Box30 the top selection logic is rotated in order not to waste area. From the figures one obtains the following equations: EC 13 F 11 hðbox4=20 Þ ¼ 2hðFAÞ þ 7d; 33 hðbox30 Þ ¼ hðFAÞ þ hðSLÞ þ bðSLÞ þ 17d; 35 hðbox40 Þ ¼ 2hðFAÞ þ 2hðSLÞ þ 26d: O R R 31 Most of the other equations only change slightly: 41 43 C ja0 ½ij ¼ m0 bðbox0 Þ þ hðT 0 Þ m0 ðhðbox4=20 dÞ hðFAÞ hðSLÞ þ 4m0 d; N 39 hðT 0 Þ ¼ ðM 0 =4 1Þhðbox4=20 Þ þ ðM 0 =4Þhðbox30 Þ þ a0 ðhðbox40 Þ hðbox30 ÞÞ; U 37 jcarry40 j ¼ bðbox0 Þ 2bðFAÞ=3 þ 3d; Carry04=2 ¼ m0 ðbðbox0 Þ 2bðFAÞ=3 þ 3dÞ; VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 1 Box 3 booth 2ld(N/4) 2ld(N/4) a a i+1 3 2 3 5 a booth-channel[2:0] a-channel[1:0] a i-2 Box 4 booth 3 6 3 4 2ld(N/4) a a i+1 3 3 i 6 3 4 booth-channel[5:0] a-channel[3:0] Selection logic Selection logic 13 i-4 a 6 3 Selection logic Box4/2 booth a) D 21 Selection logic (b) a-channel[7:4] wire channel 19 wire channel 4 23 Full Adder TE Box4/2 booth b) 25 Full Adder 29 a i-7 (a) R 33 O 35 43 N U 41 C 37 39 i-8 R a EC 27 31 PR 3 a a i+1 wire channel 3 a i-5 booth-channel[11:6] full adder wire channel 17 O a i-3 O a full adder 2ld(M/4) i wire channel 15 Full Adder Selection logic wire channel Selection logic a-channel[5:2] F 9 booth-channel[8:3] wire channel 2ld(N/4) 11 i 3 Selection logic i-1 wire channel 7 19 2ld(M/4) (c) Fig. 12. Basic boxes for booth multiplier construction. i-6 VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 20 1 3 0 Sum04=2 0 ¼ m ð4bðFAÞ=3 þ 4dÞ þ m X 2kd; k¼1 ¼ m0 ð4bðFAÞ=3 þ 4dÞ þ dðm02 þ m0 Þ; 5 Channel 0 ¼ hðT 0 Þ hðbox40 Þ logðM 0 =4Þhðbox4=20 Þ: 19 21 23 25 27 29 31 F 0 TIMETree ðn; nÞ ¼ DSL þ 2DFA ð1 þ m0 Þ DBD 0 0 þ n max ja ½ij; jb ðiÞj þ þ jsum40 j þ jcarry40 j þ Carry04=2 n þSum04=2 þ Channel 0 : 0 Þ=TIMETree is plotted for n ¼ m ¼ In Fig. 13 the relative improvement ðTIMETree TIMETree 53 as a function of n: This figure is surprisingly complex and counter intuitive. In particular we see gains due to Booth recoding decrease with increasing n; i.e. with slower wires. For small values of n until nE0:1 the delay of the Booth decoder DBD ¼ 3 plus the delay of the b-nets is larger than the delay of the a-nets. For small n we have 0 E55 þ 537n; TIMETree whereas for large n we have 41 43 R O This explains why the graph has two branches. For n ¼ 0 only gate delays count and we have 0 TIMETree =TIMETree E55=62E89%: C 39 TIMETree ¼ 62 þ 710n: This explains why the branch for the b-nets starts around 100% 89% ¼ 11%: It remains to explain why the branch for the a-nets is falling with n: For n ¼ 0 the branch starts at 1 52 62E100% 84% ¼ 16%: For large n the savings are dominated by wire delays and approach 1 ð52 þ 614Þ=ð62 þ 710ÞE14%o16%: Thus, the branch falls because the wire delays have not fallen much due to Booth recoding. The reason for this is the geometry of our particular layouts which are quite wide and not very high: Denote by wireðTÞ resp. wireðT 0 Þ the wire delays for T and N 37 We always have U 35 O With this we can evaluate the delay of the tree T 0 : 0 E52 þ 614n: TIMETree 33 O 17 jb0 ½ij ¼ n0 ðbðbox0 Þ þ 10dÞ: PR 15 D 13 Also the nets b0 ½i can become important. Their delay in sequence to the Booth decoder delay has to compete against the nets a0 ½i and it depends on the constant n which one is slower. Therefore, we have to consider the length of the longest jb0 ½ij: It has to reach n0 bit positions and for each connection in the worst case it crosses all the 9 other selection logic wires. TE 11 jsum40 j ¼ hðSLÞ þ bðSLÞ=2 þ 12d: EC 9 Additionally, to these wires we must consider the wire between one selection logic output in the upper row and the full-adder input in Box40 : R 7 VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 1 21 bw 1-TIME’Tree /TIME Tree [%] for n=m=53 22 3 with b-net and BD influence 5 20 7 18 9 11 F ag 16 13 15 O with a-net influence 14 aw 17 O total TIME savings by Booth recoding 12 19 0 0.2 0.6 0.4 v 21 wireðT 0 ÞEwðT 0 Þ=2 þ 2hðT 0 Þ; R because a-nets travel horizontally over half the layout and vertically over almost the full layout; the longest paths from the leaves to the root travel vertically over almost the full layout whereas horizontal displacement is logarithmic. We have 41 43 R C O If we would use as leaves only 42-adders, then the tree T would have n=4 leaves each of height hðbox4=2ÞE9:9 and around n=4 interior nodes, each of height hðbox4=2ÞE6:7: Thus, hðTÞEðn=4Þð9:9 þ 6:7Þ ¼ 4:15n: Similarly, we have hðT 0 ÞEðn=8Þðhðbox0 4Þ þ 0 hðbox4=2ÞEðn=8Þð17:5 þ 6:7Þ ¼ 3:025n: For n ¼ 53 we get wireðT ÞE623 and wireðTÞE742: Note that we have hðTÞ=wðTÞE13 and hðTÞ=wðT 0 ÞE14: If we make the layouts more square,2 say by doubling h and halving w; then the savings due to Booth recoding increase but the layouts become slower!. This incidentally shows, that the common practice of making the layouts of addition trees roughly square is not always a good idea. N 39 wðTÞEwðT 0 ÞE2nwðboxÞE2n5:7 U 37 D wireðTÞEwðTÞ=2 þ 2hðTÞ; 33 35 TE T 0 : Then, we have for n ¼ 1: 27 31 1 Fig. 13. Relative TIME improvement by Booth recoding for n ¼ m ¼ 53: 25 29 0.8 EC 23 PR bg 2 This is common practice. VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 22 7. Evaluations 3 In this section, we evaluate the savings due to Booth recoding for a wide range of n and also analyze the asymptotical behaviour. The evaluations are presented in two parts: In the first part the gate model from Section 4 is considered, in the second part we consider the layout model from Section 6. To be able to compare the four different designs for every particular n; on the one hand layouts had to be analyzed also for the multiplication arrays. On the other hand, the delay, cost and area formulae of the multiplication tree designs had to be extended for the cases where M=2pmo3=4M: These extensions have been done following the presented designs in a straightforward way. We do not only focus on the savings in cost, area and delay, but we consider the more general quality metrics, the InverseGateQuality (IGQ) and the InverseLayoutQuality (ILQ), that are defined below: 19 21 23 25 27 29 31 33 O IGQX ðn; qÞ ¼ CX ðn; nÞq DX ðn; nÞð1qÞ : Note, that we get the delay by IGQX ðn; 0Þ ¼ DX ðn; nÞ and the cost by IGQX ðn; 1Þ ¼ CX ðn; nÞ: Because the delay DX ðn; nÞ of a design X is the inverse of the performance of the design PX ðn; nÞ ¼ 1=DX ðn; nÞ; for q ¼ 12 the IGQ metric relates to cost-performance ratio: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi IGQX ðn; 0:5Þ ¼ CX ðn; nÞDX ðn; nÞ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ CX ðn; nÞ=PX ðn; nÞ: Definition 3. We define the IGQ of the best among our designs, that do not use Booth recoding by bestIGQðn; qÞ ¼ minðIGQArray ðn; qÞ; IGQTree ðn; qÞÞ: Analogously, the IGQ of the best among our designs using Booth recoding is defined by bestIGQ0 ðn; qÞ ¼ minðIGQ0Array ðn; qÞ; IGQ0Tree ðn; qÞÞ: Moreover, we define the relative savings of the IGQ due to Booth recoding: IGQsaveðn; qÞ ¼ 1 bestIGQ0 ðn; qÞ=bestIGQðn; qÞ: 43 C ILQX ðn; n; qÞ ¼ AREAX ðn; nÞq TIMEX ðn; n; nÞð1qÞ : N 41 Definition 4. Depending on a quality paramater 0pqp1; we define the InverseLayoutQuality (ILQ) of a design X ; that requires the area AREAX ðn; mÞ and has the delay TIMEX ðn; m; nÞ; given by U 39 O 35 37 O 17 Definition 2. Depending on a quality paramater 0pqp1; we define the InverseGateQuality (IGQ) of a design X ; that has the cost CX ðn; mÞ and the delay DX ðn; mÞ; by PR 15 D 13 TE 11 EC 9 R 7 R 5 F 1 Note, that we get the TIME metric by ILQX ðn; n; 0Þ and the AREA metric by ILQX ðn; n; 1Þ: Moreover, for q ¼ 12 and q ¼ 23 the ILQ metric relates to the metrics commonly known as AT and AT 2 ; respectively. VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 23 1 3 5 7 9 Definition 5. We define the ILQ of the best among our designs, that do not use Booth recoding by bestILQðn; n; qÞ ¼ minðILQArray ðn; n; qÞ; ILQTree ðn; n; qÞÞ: Accordingly, the ILQ of the best among our designs using Booth recoding is defined by bestILQ0 ðn; n; qÞ ¼ minðILQ0Array ðn; n; qÞ; ILQ0Tree ðn; n; qÞÞ: Moreover, we define the relative savings of the ILQ due to Booth recoding: ILQsaveðn; n; qÞ ¼ 1 bestILQ0 ðn; n; qÞ=bestILQðn; n; qÞ: 11 F 7.1. Gate analysis 13 O O 15 7.1.1. Asymptotic considerations The following lemma states, that for large n; the multiplication arrays have the best IGQ for q ¼ 1; whereas, multiplication trees have the best IGQ for qo1: 21 23 bestIGQ0 ðn; qÞ ¼ IGQArray ðn; qÞ if q ¼ 1; IGQTree ðn; qÞ if 0pqo1; ( IGQ0Array ðn; qÞ if q ¼ 1; IGQ0Tree ðn; qÞ 25 33 35 EC IGQArray ðn; qÞ CArray ðn; nÞq DArray ðn; nÞð1qÞ : ¼ IGQTree ðn; qÞ CTree ðn; nÞq DTree ðn; nÞð1qÞ 1q n n q þo X1 2 logðnÞ 2 logðnÞ R 31 R 29 Proof. We separate the proof for the two cases: (a) q ¼ 1; and (b) 0pqo1: (a) For q ¼ 1; the quality metric becomes the cost metric IGQX ðn; 1Þ ¼ CX ðn; nÞ; so that from 0 0 ðn; nÞ ¼ CArray ðn; nÞ þ E 0 CFA the the cost formulae CTree ðn; nÞ ¼ CArray ðn; nÞ þ ECFA and CTree proof of part (a) of the lemma already follows. (b) For 0pqo1 and large n; we consider the fraction O 27 X 1: 43 N The same argumentation can be used for the Booth designs, so that also part (b) of the proof is completed. & Theorem 7. For large n; Booth recoding improves the best IGQ by q CFA þ CSL IGQsaveðn; qÞ ¼ 1 7oð1Þ: 2ðCFA þ CAND Þ U 41 C 37 39 if 0pqo1: D bestIGQðn; qÞ ¼ TE 19 Lemma 6. For large n: ( PR 17 VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 24 1 9 11 13 15 17 Because IGQArray ðn; 1Þ ¼ CArray ðn; nÞ ¼ ðCFA þ CAND Þn2 þ oðn2 Þ and 0 IGQ0Array ðn; 1Þ ¼ CArray ðn; nÞ nþ1 þ oðn2 Þ ¼ ðCFA þ CSL Þn 2 CFA þ CSL 2 ¼ n þ oðn2 Þ; 2 we get IGQ0Array ðn; 1Þ=IGQArray ðn; 1Þ ¼ ðCFA þ CSL Þ=2ðCFA þ CAND Þ7oð1Þ; so that for large n CFA þ CSL IGQsaveðn; 1Þ ¼ 1 7oð1Þ: 2ðCFA þ CAND Þ F 7 IGQsaveðn; 1Þ ¼ 1 IGQ0Array ðn; 1Þ=IGQArray ðn; 1Þ: O 5 Proof. We separate the proof for the two cases: (a) q ¼ 1; and (b) 0pqo1: (a) For q ¼ 1; it follows from Lemma 6 that for large n (b) For 0pqo1; it follows from Lemma 6 that for large n 29 31 33 : D CTree ðn; nÞq DTree ðn; nÞð1qÞ Because for large n; D0Tree ðn; nÞð1qÞ =DTree ðn; nÞð1qÞ ¼ m0 =m7oð1Þ ¼ 17oð1Þ; the equation for IGQsaveðn; qÞ can be simplified to 0 C 0 ðn; nÞq CTree ðn; nÞq þ o IGQsaveðn; qÞ ¼ 1 Tree CTree ðn; nÞq CTree ðn; nÞq ðCFA þ CSL ÞnJðn þ 1Þ=2n q 7oð1Þ ¼1 ðCFA þ CAND Þn2 q CFA þ CSL 7oð1Þ: X1 2ðCFA þ CAND Þ TE 27 0 CTree ðn; nÞq D0Tree ðn; nÞð1qÞ EC 25 IGQsaveðn; qÞ ¼ 1 R 23 0 Using IGQ0Tree ðn; qÞ ¼ CTree ðn; nÞq D0Tree ðn; nÞð1qÞ and IGQTree ðn; qÞ ¼ CTree ðn; nÞq DTree ðn; nÞð1qÞ ; we get R 21 PR IGQsaveðn; qÞ ¼ 1 IGQ0Tree ðn; qÞ=IGQTree ðn; qÞ: 19 O 3 This completes the proof of the theorem. & 41 43 C N 39 Corollary 8. Asymptotically, Booth recoding saves 25% of the cost of the partial product generation and reduction (Theorem 7 with q ¼ 1 and the presented basic circuit designs). The asymptotic relative IGQ savings are depicted in Fig. 14 as a function of the quality parameter q: 7.1.2. Considerations for practical n The relative savings of the best IGQ according to the use of Booth Recoding are depicted in Fig. 15 for 8pnp64 and 0pqp1: This figure shows that Booth recoding is useful in most practical situations, except for small n and qE1: For q ¼ 1 the savings of the IGQ due to Booth U 37 O 35 VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 1 25 25 3 5 20 7 9 15 F 11 O 13 10 O 15 PR 17 5 19 23 0 q 0.6 0.8 1 Fig. 14. Asymptotic relative savings (%) of the best IGQ (regarding gate model) due to Booth recoding as a function of q: IGQsaveðN; qÞ: EC 27 0.4 TE 0.2 25 D 21 29 recoding are analyzed in detail. The following lemma states, that regarding the gate count (IGQX ðn; 1Þ), Booth recoding is useful, iff n ¼ 13; 15 or nX17: 33 Lemma 9. 43 N 0 ðn; nÞ and bestIGQðn; 1Þ ¼ CArray ðn; nÞ: We Proof. From Lemma 6 we get bestIGQ0 ðn; 1Þ ¼ CArray distinguish two cases: (a) n is even; and (b) n is odd: 0 ðn; nÞ: The condition (a) If n ¼ m is even, then we can substitute m0 by m0 ¼ n=2 þ 1 in CArray 0 2 139 CArray ðn; nÞpCArray ðn; nÞ is then equivalent to n 8 n þ 49X0; so that for even n > 0 we get the solution nX16:63; which is equivalent to even nX18: (b) If n ¼ m is odd, then we can substitute m0 0 0 by m0 ¼ n=2 þ 12 in CArray ðn; nÞ: The condition CArray ðn; nÞpCArray ðn; nÞ is then equivalent to n2 115 189 8 n þ 8 X0; so that for odd n > 0 we get the solution nX12:48; which is equivalent to odd nX13: U 41 C 37 39 R bestIGQ0 ðn; 1ÞobestIGQðn; 1Þ3ððn ¼ 13Þ OR ðn ¼ 15Þ OR ðnX17ÞÞ: O 35 R 31 VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 26 1 3 5 26 7 9 11 24 22 20 18 14 O 15 F 16 13 12 8 PR 19 O 10 17 6 4 25 0 D 23 2 -2 TE 21 -4 -6 29 -8 EC 27 -10 -12 R 33 -14 -16 R 31 0 43 N 41 0.2 0.3 10 20 30 0.4 0.5 q 40 0.6 0.7 0.8 50 0.9 1 m 60 Fig. 15. Relative savings (%) of the best IGQ (regarding gate model) due to Booth recoding: IGQsaveðn; qÞ: U 39 0.1 C 37 O -18 35 VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 1 10 27 IGQsave(n,1) [%] 3 m 8 5 9 10 11 12 13 14 15 16 17 18 19 20 21 0 7 -10 9 11 F -20 O 17 -30 Fig. 16. Relative Savings (%) of the best IGQ due to Booth recoding for q ¼ 1: IGQsaveðn; 1Þ: PR 15 O 13 19 Thus, the solutions of part (a) and part (b) combine to the lemma. The situation of the IGQ savings due to Booth recoding IGQsaveðn; 1Þ is depicted in Fig. 16 for 8pnp20: & D 21 23 Finally, Fig. 17 depicts which of the 4 different designs should be chosen according to the IGQ for 0pqp1; 8pnp64 and 10pnp1010 : 27 7.2. Layout analysis 29 7.2.1. Asymptotic considerations 39 41 EC R R O 37 C 35 N 33 Lemma 10. With the definition of the condition COAR3ðð0pqo1Þ AND ðna0ÞÞ OR ðq ¼ 1Þ; the multiplication array designs have the asymptotically best ILQ for ðCOAR ¼ 1Þ: For ðCOAR ¼ 0Þ the multiplication tree designs have the asymptotically best ILQ, so that for large n ( ILQArray ðn; n; qÞ if COAR; bestILQðn; n; qÞ ¼ ILQTree ðn; n; qÞ otherwise: ( ILQ0Array ðn; n; qÞ if COAR; 0 bestILQ ðn; n; qÞ ¼ ILQ0Tree ðn; n; qÞ otherwise: U 31 TE 25 Proof. We separate the proof for the two cases: (a) ðn ¼ 0Þ; and (b) ð0onp1Þ: 43 VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 28 1 q 0.2 0 q 0.6 0.4 1 0.8 0 0.2 0.4 0.6 0.8 1 1e+10 64 3 5 56 7 1e+08 48 11 13 40 1e+06 m m F 9 O 32 19 PR 24 17 16 21 27 D Booth Recoding: Multiplication Tree Multiplication Tree Multiplication Array Multiplication Array TE non-Booth: 25 100 10 8 23 10000 O 15 EC Fig. 17. Value ranges of q and n for that the 4 designs have the best IGQ (regarding gate model). 29 (i) From ðq ¼ 1Þ and R 35 (a) The proof for ðn ¼ 0Þ is again partitioned into two parts for the cases: (i) ðq ¼ 1Þ; and (ii) ð0pqo1Þ: ILQTree ðn; 0; 1Þ ¼ AREATree ðn; nÞ ¼ Oðn2 logðMÞÞ; O 33 R 31 ILQ0Tree ðn; 0; 1Þ ¼ AREA0Tree ðn; nÞ ¼ Oðn2 logðMÞÞ; 39 ILQArray ðn; 0; 1Þ ¼ AREAArray ðn; nÞ ¼ Oðn2 Þ; 43 N ILQ0Array ðn; 0; 1Þ ¼ AREA0Array ðn; nÞ ¼ Oðn2 Þ; U 41 C 37 it follows, that for large n bestILQðn; 0; 1Þ ¼ ILQArray ðn; 0; 1Þ; VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] ILQArray ðn; 0; qÞ ¼ AREAArray ðn; nÞq TIMEArray ðn; n; 0Þð1qÞ ¼ Yðnðqþ1Þ Þ; 7 25 27 29 31 33 35 37 39 41 43 F PR D 23 This completes part (a) of the proof. (b) In the following, we assume for the proof of case (b), that ð0onp1Þ: Because 0 ðn; n; nÞ ¼ Oðn logðnÞÞ; TIMEArray ðn; n; nÞ ¼ OðnÞ and TIMETree ðn; n; nÞ ¼ Oðn logðnÞÞ; TIMETree 0 TIMEArray ðn; n; nÞ ¼ OðnÞ; we get for large n bestILQðn; n; 0Þ ¼ TIMEArray ðn; n; nÞ; ð1Þ 0 bestILQ0 ðn; n; 0Þ ¼ TIMEArray ðn; n; nÞ: EC 21 In the same way we can show for 0pqo1; that for large n: bestILQ0 ðn; 0; qÞ ¼ ILQTree ðn; 0; qÞ: ð2Þ Accordingly, from AREATree ðn; nÞ ¼ Oðn2 logðnÞÞ; AREA0Tree ðn; nÞ ¼ Oðn2 logðnÞÞ; AREAArray ðn; nÞ ¼ Oðn2 Þ and AREA0Array ðn; nÞ ¼ Oðn2 Þ; it follows for large n that ð3Þ bestILQðn; n; 1Þ ¼ AREAArray ðn; nÞ; R 19 We get for large n bestILQðn; 0; qÞ ¼ ILQTree ðn; 0; qÞ: bestILQðn; n; 1Þ0 ¼ AREA0Array ðn; nÞ: R 17 ð4Þ Because for n > 0; the multiplication arrays have both, the asymptotically smallest AREA and the asymptotically fastest TIME, we get for large n O 15 Because for 0pqo1: nðqþ1Þ ¼ n1q =logðnÞ nð2qÞ logðnÞ X 1 þ Oð1Þ: bestILQðn; n; qÞ ¼ AREAArray ðn; nÞq TIMEArray ðn; n; nÞð1qÞ C 13 ¼ Yðnð2qÞ logðnÞÞ: ¼ ILQArray ; N 11 ILQTree ðn; 0; qÞ ¼ AREATree ðn; nÞq TIMETree ðn; n; 0Þð1qÞ 0 bestILQ0 ðn; n; qÞ ¼ AREA0Array ðn; nÞq TIMEArray ðn; n; nÞð1qÞ U 9 O 5 (ii) For ðn ¼ 0Þ; O 3 bestILQ0 ðn; 0; 1Þ ¼ ILQ0Array ðn; 0; 1Þ: TE 1 29 ¼ ILQ0Array : This completes case (b) of the proof. & VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 30 1 3 Theorem 11. We use the previous definition of the condition COAR3ðð0pqo1Þ AND ðna0ÞÞ OR ðq ¼ 1Þ: Then, for large n the use of Booth recoding relatively saves the best ILQ by 5 7 9 50 O 13 F 11 O 15 17 PR 45 19 23 D 21 TE 40 25 27 EC 35 29 R 31 30 43 N 41 0.2 0 U 39 0 C 37 O 35 R 33 0.4 0.2 0.6 nu 0.4 q 0.6 0.8 0.8 1 1 Fig. 18. Asymptotic relative savings (%) of the best ILQ (regarding layout model) due to Booth recoding for 0onp1 and 0pqp1: VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 1 3 8 ð1qÞ > < 1 0:641q 11:66n þ 3 7oð1Þ 13:75n þ 6 ILQsaveðn; n; qÞ ¼ > : 1 0:728q 7oð1Þ 31 if COAR; otherwise: 5 The asymptotic relative ILQ savings are depicted in Fig. 18 as a function of q and n: 7 Proof. First, we summarize the TIME and AREA formulae for the different designs: 15 AREA0Tree ðn; nÞ ¼ 1:26n2 logðMÞ þ oðn2 logðMÞÞ; F 13 AREAArray ðn; nÞ ¼ 23:92n2 þ oðn2 Þ; AREATree ðn; nÞ ¼ 1:73n2 logðMÞ þ oðn2 logðMÞÞ; O 11 AREA0Array ðn; nÞ ¼ 15:49n2 þ oðn2 Þ; 0 TIMEArray ðn; n; nÞ ¼ 11:66nm þ oðnmÞ þ 3m þ oðmÞ; O 9 TIMEArray ðn; n; nÞ ¼ 13:75nm þ oðnmÞ þ 6m þ oðmÞ; 19 0 TIMETree ðn; n; nÞ ¼ 0:2nn logðMÞ þ oðnn logðMÞ þ 12 logðMÞ þ oðlogðMÞÞ; 21 TIMETree ðn; n; nÞ ¼ 0:2nn logðMÞ þ oðnn logðMÞ þ 12 logðMÞ þ oðlogðMÞÞ: PR 17 35 37 39 41 43 TE EC R R 33 Finally, by the use of Lemma 10, we get, that for large n 8 0 < 1 ILQArray ðn;n;qÞ if COAR; ILQArray ðn;n;qÞ ILQsaveðn; n; qÞ ¼ : 1 ILQ0Tree ðn;n;qÞ otherwise: ILQTree ðn;n;qÞ 8 < 1 0:641q 11:66nþ3 ð1qÞ 7oð1Þ if COAR; 13:75nþ6 ¼ : 1 0:728q 7oð1Þ otherwise: O 31 0 ILQ0Tree ðn; n; qÞ AREA0Tree ðn; nÞq TIMETree ðn; n; nÞð1qÞ ¼ ILQTree ðn; n; qÞ AREATree ðn; nÞq TIMETree ðn; n; nÞð1qÞ 1:26 q ð1qÞ ¼ 1 7oð1Þ: 1:73 C 29 N 27 0 ILQ0Array ðn; n; qÞ AREA0Array ðn; nÞq TIMEArray ðn; n; nÞð1qÞ ¼ ILQArray ðn; n; qÞ AREAArray ðn; nÞq TIMEArray ðn; n; nÞð1qÞ 15:49 q 11:66n þ 3 ð1qÞ ¼ 7oð1Þ 23:92 13:75n þ 6 U 25 D From these equations, we can easily derive 23 & VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 32 1 50 rel. asymptotic TIME improvement [%] by Booth recoding 3 45 5 40 7 9 35 11 26 0 0.2 0.4 0.6 0.8 17 25 27 Moreover, we can extract the TIME savings (q ¼ 0) for large n: 8 < 11:66nþ3 7oð1Þ if ðn > 0Þ; 13:75nþ6 ILQsaveðn; n; 0Þ ¼ : 7oð1Þ otherwise: The asymptotic TIME savings are depicted in Fig. 19. 29 37 39 R R O 35 C 33 7.2.2. Considerations for practical n The relative savings of the best ILQ according to the use of Booth recoding are depicted in Fig. 20(a) for 8pnp64; 0pqp1 and n ¼ 0:05; and in Fig. 20(b) for 8pnp64; q ¼ 0 and 0pnp1: These figures show that Booth recoding is useful for most practical parameters regarding the ILQ. Fig. 21 depicts which of the 4 different layouts should be chosen according to the ILQ for 0pqp1 and 8pnp64: Because the best practical ILQ layouts in this range do not seem to converge to the best asymptotical ILQ layouts from Lemma 10, the choice of the best ILQ layout is also depicted for a larger range of n with 10pnp1010 and an even larger range of n with 104 pnp1080 in Fig. 21. N 31 PR ILQsaveðn; n; 1Þ ¼ 35:97oð1Þ: D 23 Corollary 12. From Theorem 11 we can extract the AREA savings (q ¼ 1) for large n: TE 21 Fig. 19. Asymptotic TIME savings due to Booth recoding. EC 19 1 O ν O 15 F 30 13 8. Conclusions 43 We formally investigate the complexity of Booth recoding for fixed point multiplication. As main contribution of this investigation we provide a formal version of the folklore theorem, that U 41 VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 33 1 3 30 5 20 20 15 savings in [%] 13 savings [%] 11 15 10 0 10 F 9 O 7 -10 O 5 -20 19 0 1 0.8 21 56 48 m 23 35 37 39 41 43 8 0 TE EC R R 33 O 31 0.2 16 48 40 m 32 24 0.6 0.4 nu 0.2 16 8 0 (b) q=0 Booth recoding saves the delay and the area of the multipliers’ adder tree by constant factors between 26% and 50% minus low order terms which depend on the length n of the operands and a parameter n for the wire delay. For the cost analysis we provide a formal version of the folklore theorem that balanced addition trees in n-bit multipliers have n2 þ Oðn log nÞ full adders. This required the combinatorial argument from Section 3. In our investigations we did not only analyze the costs and delays considering gates, but we also considered layouts including the effects of wire delays. For practical n in the range 8pnp64 we specify the delay and the area gain due to Booth recoding exactly for the proposed designs. Parts of our gate analysis are already used and referenced by [28]. Our analysis are not only parametrical considering the operand size n; but also the relative delay of the wires, the costs and delays of basic gates and the geometries of the basic circuits can be adjusted by parameters, so that the interested reader may choose his favorite technology/ implementation in the analysis. Regarding the wire delay parameter n; we have demonstrated the surprising effect, that for particular multiplier designs the relative delay savings due to Booth recoding might decrease as wires become slower. Empirical studies on a limited number of designs so far have suggested that Booth recoding should be particularly helpful if wires become slow [16]. We have analyzed this effect in some more detail. C 29 24 56 1 0.8 Fig. 20. Relative savings (%) of the best ILQ (regarding layout model) due to Booth recoding for (a) 0oqp1; 8pmp64 and n ¼ 0:05; and (b) 0pnp1; 8pmp64 and q ¼ 0: N 27 32 (a) nu=0.05 U 25 40 64 D 64 0.6 0.4 q PR 17 VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 34 1 q 0 0.2 0.6 0.4 0.8 non-Booth: 1 64 3 5 Booth Recoding: Multiplication Tree Multiplication Tree Multiplication Array Multiplication Array 56 7 48 9 40 11 m F 32 O 13 17 O 24 15 19 8 q q 0 0.2 0.4 0.6 0.8 0 1 1e+10 TE 23 0.2 25 0.4 D 21 PR 16 0.6 0.8 1 80 70 1e+08 29 60 50 1e+06 31 R m R 33 O 35 43 N 30 20 100 10 10 4 Fig. 21. Value ranges of q and n for that the 4 designs have the best ILQ (regarding layout model). U 41 C 37 39 10000 40 log10(m) EC 27 VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] * 5 * 7 * 13 15 17 19 wireðTÞ==E==wðTÞ=2 þ 2hðTÞ F 11 * suggests that trivial folding of layouts produces slower designs. This clearly needs closer investigation. O 9 One can systematically study where drivers should be placed in order to make nets smaller and hence speed up signal propagation. One can analyze hybrid layouts (arrays of small trees) and the layouts of many other multiplication designs [2,6,8–10,22,23,29–32]. The layouts of various adder and shifter designs can be analyzed in a quite realistic way. It is common practice to ‘fold’ layouts of addition trees into more square layouts. But the formula O 3 Technically, the main contribution of the paper is the VLSI model from Section 5 and the techniques of analysis of the same section. The detailed yet tractable nature of this model opens the way for many further investigations. We list just a few Acknowledgements PR 1 For helpful and inspiring discussions the authors thank Michael Bosch and Guy Even. 21 D References 23 35 37 39 41 43 TE EC R R 33 O 31 C 29 N 27 [1] L. Dadda, Some schemes for parallel multipliers, Alta Frequenza 34 (1965) 349–356. [2] B.C. Drerup, E.E. Swartzlander, Fast multiplier bit-product matrix reduction using bit-ordering and parity generation, J. VLSI Signal Process. 7 (1994) 249–257. [3] J. Keller, W.J. Paul, Hardware Design, 2nd Edition, Teubner Verlagsgesellschaft, Stuttgart, Leipzig, 1997. [4] I. Koren, Computer Arithmetic Algorithms, Prentice-Hall International, Englewood Cliffs, NJ, 1993. [5] A.R. Omondi, Computer Arithmetic Systems; Algorithms, Architecture and Implementations, Series in Computer Science, Prentice-Hall International, Englewood Cliffs, NJ, 1994. [6] N. Takagi, H. Yasuura, S. Yajima, High-speed VLSI multiplication algorithm with a redundant binary addition tree, IEEE Trans. Comput. C-34 (9) (1985) 217–220. [7] C.S. Wallace, A suggestion for parallel multipliers, IEEE Trans. Electron. Comput. EC-13 (1964) 14–17. [8] Z. Wang, A. Jullien, C. Miller, A new design technique for column compression multipliers, IEEE Trans. Comput. 44 (8) (1995) 962–970. [9] H. Al-Twaijry, Area and performance optimized CMOS multipliers, Ph.D. Thesis, Stanford University, August 1997. [10] G.W. Bewick, Fast multiplication: algorithms and implementation, Ph.D. Thesis, Stanford University, March 1994. [11] A.D. Booth, A signed binary multiplication technique, Q. J. Mech. Appl. Math. 4 (2) (1951) 236–240. [12] P.E. Madrid, B. Millar, E.E. Swartzlander, Modified Booth algorithm for high radix multiplication, IEEE Computer Design Conference, 1992, pp. 118–121. [13] L.P. Rubinfeld, A proof of the modified Booth’s algorithm for multiplication, IEEE Trans. Comput. (October 1975), pp. 1014–1015. [14] H. Al-Twaijry, M. Flynn, Multipliers and datapaths, Technical Report CSL-TR-94-654, Stanford University, December 1994. [15] H. Al-Twaijry, M. Flynn, Performance/area tradeoffs in Booth multipliers, Technical Report CSL-TR-95-684, Stanford University, November 1995. U 25 35 VLSI : 601 ARTICLE IN PRESS W.J. Paul, P.-M. Seidel / INTEGRATION, the VLSI journal ] (]]]]) ]]]–]]] 36 21 23 25 F O O PR 19 D 17 TE 15 EC 13 R 11 R 9 O 7 C 5 N 3 [16] H. Al-Twaijry, M. Flynn, Technology scaling effects on multipliers, Technical Report CSL-TR-96-698, Stanford University, July 1996. [17] I. Wegener, The Complexity of Boolean Functions, Wiley, New York, 1987. [18] C.D. Thompson, Area-time complexity for VLSI. Proceedings of the 11th Annual ACM Symposium on the Theory of Computing, Vol. 11, 1979, pp. 81–88. [19] J.D. Ullman, Computational Aspects of VLSI, Computer Science Press, Rockville, MD, 1984. [20] L.A. Glasser, D.W. Dobberpuhl, The Design And Analysis Of VLSI Circuits, Addison-Wesley, Reading, MA, 1985. [21] C. Mead, L. Conway, Introduction to VLSI Systems, Addison-Wesley, Reading, MA, 1980. [22] L. Kuehnel, H. Schmeck, A closer look at VLSI multiplication, INTEGRATION, VLSI J. 6 (1988) 345–359. [23] R.K. Yu, G.B. Zyner, 167 MHz Radix-4 floating point multiplier, Proceedings of the 12th Symposium on Computer Arithmetic, Vol. 12, 1995, pp. 149–154. [24] P. Kornerup, private communication. [25] S.M. Mueller, W.J. Paul, The Complexity of Simple Computer Architectures, in: Lecture Notes in Computer Science, Vol. 995, Springer, Berlin, 1995. [26] J. Vuillemin, A very fast multiplication algorithm for VLSI implementation, INTEGRATION, VLSI J. 1 (1983) 39–52. [27] C. Nakata, J. Brock, H4C Series: Design Reference Guide, CAD, 0.7 Micron Leff ; Motorola Ltd., 1993, Preliminary. [28] S.M. Mueller, W.J. Paul, Computer Architecture—Complexity and Correctness, Springer, Berlin, 2000 ISBN# 3540-67481-0. [29] Z.-J. Mou, F. Jutand, Overturned-stairs adder trees and multiplier design, IEEE Trans. Comput. 41 (8) (1992) 940–948. [30] V.G. Oklobdzija, D. Villeger, S.S. Liu, A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach, IEEE Trans. Comput. 45 (3) (1996) 294–306. [31] R.M. Owens, R.S. Bajwa, M.J. Irwin, Reducing the number of counters needed for integer multiplication, Proceedings 12th Symposium on Computer Arithmetic, Vol. 12, 1995, pp. 38–41. [32] K.F. Pang, R. Soong, H.-W. Sexton, P.H. Ang, Generation of high speed CMOS multiplier-accumulators, Proceedings IEEE International Conference on Computer Design: VLSI in Computers and Processors, 1988, pp. 217–220. U 1
© Copyright 2026 Paperzz