Optimization of Column Compression Multipliers

Copyright
by
K’Andrea Catherine Bickerstaff
2007
The Dissertation Committee for K’Andrea Catherine Bickerstaff
certifies that this is the approved version of the following dissertation:
Optimization of Column Compression Multipliers
Committee:
_______________________________
Earl E. Swartzlander, Jr., Supervisor
_______________________________
Jacob A. Abraham
_______________________________
Anthony P. Ambler
_______________________________
Harvey G. Cragon
_______________________________
Donald S. Fussell
_______________________________
Eric J. Swanson
Optimization of Column Compression Multipliers
by
K’Andrea Catherine Bickerstaff, B.S.; M.S.
Dissertation
Presented to the Faculty of the Graduate School of
the University of Texas at Austin
in Partial Fulfillment
of the Requirements
for the Degree of
Doctor of Philosophy
The University of Texas at Austin
August 2007
Dedication
Dedicated to my grandmothers,
Louise Andrews and Mattiel Bickerstaff,
for their appreciation of education and love for me.
Acknowledgements
I am very grateful to my graduate supervisor, Dr. Earl E. Swartlzander, Jr., for his
encouragement, support, and patience.
His guidance in matters both academic and
professional has been invaluable. Many thanks to my fellow research group members,
especially Dr. Michael J. Schulte, Dr. Edwin De Angel, Dr. William Lynn Gallagher, and
Dr. Jason Arbaugh. It is an honor and a pleasure to work with each of you; I cherish our
friendship.
I thank my management and colleagues at Crystal Semiconductor, Cirrus Logic,
and Luminary Micro. I appreciate the excellent training and job opportunities from my
managers, Greg North and Jeff Klaas, and mentors, Eric Swanson and Dr. Matt Perry.
I thank my mother, Doris, and my brother, Kenneth, for their enduring love and
support. At the lowest points, my brother’s “I’m proud of you!” helped me keep going.
I also extend very special thanks to my many friends for staying beside me during
this long journey. Liz Wright, Yolanda Torres, Judy Ko, Charles Robinson, Scott Haban,
Montine Heim and Annola Bailey—their phone calls, hugs, and laughter are precious
gifts to me.
v
Optimization of Column Compression Multipliers
Publication No. __________________
K’Andrea Catherine Bickerstaff, Ph.D.
The University of Texas at Austin, 2007
Supervisor: Earl E. Swartzlander, Jr.
With delay proportional to the logarithm of the multiplier word length, column
compression multipliers are the fastest multipliers. Unfortunately, since the design
community has assumed that fast multiplication can only be realized through custom
design and layout, column compression multipliers are often dismissed as too time
consuming and complex because of their irregular structure. This research demonstrates
that an automated multiplier generation and layout process makes the column
compression multiplier a viable option for application specific CMOS products.
Techniques for optimal multiplier designs are identified through analysis of area, delay,
and power characteristics of Wallace, Dadda, and Reduced Area multipliers.
vi
Table of Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Chapter 2: Past Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.1 Array Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Column Compression Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2.1
Counters and Compressors . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2.2
Reduction Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
Using Higher-Order Counters and Compressors . .
23
2.2.3
The Final Carry Propagate Adder . . . . . . . . . . . . . . . . . . . .
26
2.2.4
Layout Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
Chapter 3: Automated Multiplier Netlist Generation . . . . . . . . . . . . . . . . .
28
Basic Multiplier Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.1.1
Signal Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.1.2
Partial Product Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.1.3 Carry Lookahead Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.2
Process Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.3
Cell Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.4
M x N Multiplier Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
35
2.2
2.2.2.1
3.1
Chapter 4: Automated Multiplier Implementation and Verification . . . . .
38
4.1 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
Formal Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.3 Layout Floorplanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.4
Timing Driven Placement and Route . . . . . . . . . . . . . . . . . . . . . . . .
46
4.5
RC Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4.6
Static Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.7
Power Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
Chapter 5: Multiplier Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
5.1
Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
5.2
Area Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.3
Area of Multipliers in the 250 nm Process Technology . . . . . . . . . .
59
5.4
Area of Multipliers in the 180 nm Process Technology . . . . . . . . . .
60
5.5
Area of Multipliers in the 130 nm Process Technology . . . . . . . . . .
61
5.6
Area of Multipliers in the 90 nm Process Technology . . . . . . . . . . .
64
5.7
Area Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.8 Area Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
Chapter 6: Multiplier Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
Delay Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
6.2 Delay Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
4.2
6.1
viii
6.3
Delay for Multipliers in the 250 nm Process Technology . . . . . . . .
85
6.4
Delay for Multipliers in the 180 nm Process Technology . . . . . . . .
89
6.5
Delay for Multipliers in the 130 nm Process Technology . . . . . . . .
91
6.6
Delay for Multipliers in the 90 nm Process Technology . . . . . . . . .
95
6.7
Delay Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
6.8
Delay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Chapter 7: Multiplier Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.1
Power Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2
Power Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.3
Power for Multipliers in the 250 nm Process Technology . . . . . . . . 112
7.4
Power for Multipliers in the 180 nm Process Technology . . . . . . . . 114
7.5
Power for Multipliers in the 130 nm Process Technology . . . . . . . . 116
7.6
Power Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.7
Power Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Chapter 8: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
ix
List of Figures
2.1
A square version of a 4 by 4 array multiplier (after [23]) . . . . . . . . . . . . .
6
2.2
Two’s complement by modified Baugh-Wooley method . . . . . . . . . . . . .
9
2.3
Steps for N by N unsigned parallel multiplication . . . . . . . . . . . . . . . . . .
13
2.4
Dot Diagram for a 12 by 12 Wallace Multiplier . . . . . . . . . . . . . . . . . . . .
15
2.5
Dot Diagram for a 12 by 12 Dadda Multiplier . . . . . . . . . . . . . . . . . . . . .
18
2.6
Dot Diagram for a 12 by 12 Reduced Area Multiplier . . . . . . . . . . . . . . .
20
2.7
(7,3) Counter design using (3,2) counters after [30] . . . . . . . . . . . . . . . . .
23
2.8
(15, 4) Counter design using (3,2) counters after [30] . . . . . . . . . . . . . . . .
24
2.9
4:2 Compressor using (3,2) counters . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.1
Block diagram for implemented column compression multipliers . . . . . .
29
3.2
Loading for each product bit
..................................
30
3.3
Modified Full Adder
........................................
31
3.4
Diagram of 16-bit Carry Lookahead Adder . . . . . . . . . . . . . . . . . . . . . . .
33
3.5
Schematic of (3,2) counter standard cell . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.1
Design and Tool Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.2
Conformal Process Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4.3
Power/Ground abutment in layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.4
Diagram of pin placement for N by N multipliers . . . . . . . . . . . . . . . . . .
45
x
4.5 Configuration of Common Timing Engine . . . . . . . . . . . . . . . . . . . . . . . .
48
4.6
Configuration of UltraSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
5.1
Block diagram of an N by N unsigned column compression multiplier . . .
53
5.2
Dadda multiplier areas using different process technologies
and standard cell libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
5.3
Area pie charts of Wallace multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
5.4
Area pie charts of Dadda multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
5.5
Area pie charts of Reduced Area multipliers . . . . . . . . . . . . . . . . . . . . . . .
72
6.1
Back-annotated delays for N by N Dadda multipliers . . . . . . . . . . . . . . . .
98
6.2
Delay pie charts for back-annotated Dadda multipliers . . . . . . . . . . . . . . . 103
6.3
Back-annotated Dadda multiplier delays versus estimated delays . . . . . . . 105
7.1
Average power consumption for Wallace, Dadda, and
Reduced Area multipliers in the 250 nm process . . . . . . . . . . . . . . . . . . . . 114
7.2
Average power consumption for Wallace, Dadda, and
Reduced Area multipliers in the 180 nm process . . . . . . . . . . . . . . . . . . . . 116
7.3
Average power consumption for Wallace, Dadda, and Reduced Area
multipliers in 130g and 130p cell libraries . . . . . . . . . . . . . . . . . . . . . . . . 119
xi
List of Tables
2.1
Radix-4 Modified Booth Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2
Truth table for special half adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.3
Number of Reduction Stages for a Dadda Multiplier . . . . . . . . . . . . . . . .
16
5.1
Number of D flip-flops, buffers, and AND gates used in the multipliers . .
54
5.2
Components for Wallace, Dadda, and Reduced Area multipliers . . . . . . .
54
5.3
Hardware for Wallace, Dadda, and Reduced Area multipliers . . . . . . . . .
55
5.4
Complexity of the multiplier components . . . . . . . . . . . . . . . . . . . . . . . . .
57
5.5
Layout areas for Wallace, Dadda, and Reduced Area multipliers
in the 250 nm process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
Comparison of Wallace, Dadda, and Reduced Area multiplier
areas in the 250 nm process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
Layout areas for Wallace, Dadda, and Reduced Area multipliers
in the 180 nm process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
Comparison of Wallace, Dadda, and Reduced Area multiplier
areas in the 180 nm process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
Comparison of counter and CLA areas for 8 by 8 Wallace and
Dadda multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
5.10 Layout areas for Wallace, Dadda, and Reduced Area multipliers
in the generic 130 nm cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
5.11 Comparison of Wallace, Dadda, and Reduced Area multiplier
areas in the generic 130 nm cell library . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
5.6
5.7
5.8
5.9
xii
5.12 Layout areas for Wallace, Dadda, and Reduced Area multipliers
in the low power 130 nm cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
5.13 Comparison of Wallace, Dadda, and Reduced Area multiplier
areas in the low power 130 nm cell library . . . . . . . . . . . . . . . . . . . . . . . .
63
5.14 Percentage that multipliers in the 130p cell library are smaller than
multipliers in the 130g cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.15 Layout areas for Wallace, Dadda, and Reduced Area multipliers
in the 90 nm cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.16 Comparison of Wallace, Dadda, and Reduced Area multiplier
areas in the 90 nm cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.17 Wallace multiplier areas relative to each process’s 8 by 8 case . . . . . . . .
67
5.18 Dadda multiplier areas relative to each process’s 8 by 8 case . . . . . . . . . .
67
5.19 Reduced Area multiplier areas relative to each process’s 8 by 8 case . . .
67
5.20 Wallace multiplier areas relative to 90 nm . . . . . . . . . . . . . . . . . . . . . . . .
68
5.21 Dadda multiplier areas relative to 90 nm . . . . . . . . . . . . . . . . . . . . . . . . . .
68
5.22 Reduced Area multiplier areas relative to 90 nm . . . . . . . . . . . . . . . . . . .
68
5.23 Breakdown of multiplier areas by components . . . . . . . . . . . . . . . . . . . . .
74
5.24 Comparison of estimated areas using general quadratic approximation
versus measured areas of the multipliers . . . . . . . . . . . . . . . . . . . . . . . . . .
76
5.25 Comparison of general area approximations for geometries < 180 nm
versus measured areas of the multipliers . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.26 Predicted areas for column compression multipliers
in a 65 nm process technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
6.1
Delay values for Wallace multipliers in the 250 nm process . . . . . . . . . . .
87
6.2
Delay values for Dadda multipliers in the 250 nm process . . . . . . . . . . . .
xiii
87
6.3
Delay values for Reduced Area multipliers in the 250 nm process . . . . . .
87
6.4
Critical section delays for 64 by 64 multipliers in the 250 nm process . . .
88
6.5
Delay values for Wallace multipliers in the 180 nm process . . . . . . . . . . .
90
6.6
Delay values for Dadda multipliers in the 180 nm process . . . . . . . . . . . .
90
6.7
Delay values for Reduced Area multipliers in the 180 nm process . . . . . .
90
6.8
Critical section delays for 64 by 64 multipliers in the 180 nm process . . .
91
6.9
Delay values for Wallace multipliers in the 130g cell library . . . . . . . . . .
92
6.10 Delay values for Dadda multipliers in the 130g cell library . . . . . . . . . . .
92
6.11 Delay values for Reduced Area multipliers in the 130g cell library . . . . .
92
6.12 Critical section delays for 64 by 64 multipliers in the 130g cell library . .
93
6.13 Delay values for Wallace multipliers in the 130p cell library . . . . . . . . . .
94
6.14 Delay values for Dadda multipliers in the 130p cell library . . . . . . . . . . .
94
6.15 Delay values for Reduced Area multipliers in the 130p cell library . . . . .
94
6.16 Delay values for Wallace multipliers in the 90 nm process . . . . . . . . . . .
96
6.17 Delay values for Dadda multipliers in the 90 nm process . . . . . . . . . . . . .
96
6.18 Delay values for Reduced Area multipliers in the 90 nm process . . . . . . .
96
6.19 Back-annotated delays for Wallace, Dadda, and Reduced Area
multipliers developed in generic standard cell libraries . . . . . . . . . . . . . .
97
6.20 Back-annotated delays for Wallace, Dadda, and Reduced Area
multipliers developed in 130g and 130p cell libraries . . . . . . . . . . . . . . . .
99
xiv
6.21 Wallace multipliers with back-annotated delays
relative to each process’s 8 by 8 case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.22 Dadda multipliers with back-annotated delays
relative to each process’s 8 by 8 case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.23 Reduced Area multipliers with back-annotated delays
relative to each process’s 8 by 8 case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.24 Back-annotated Wallace multiplier delays relative to 90 nm . . . . . . . . . . . 101
6.25 Back-annotated Dadda multiplier delays relative to 90 nm . . . . . . . . . . . . 101
6.26 Back-annotated Reduced Area multiplier delays relative to 90 nm . . . . . 101
6.27 Predicted delays for column compression multipliers
in a 65 nm process technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.1
Average power for Wallace, Dadda, and Reduced Area
multipliers in the 250 nm process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2
Comparison of average power for Wallace, Dadda, and Reduced Area
multipliers in the 250 nm process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3
Average power for Wallace, Dadda, and Reduced Area
multipliers in the 180 nm process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4
Comparison of average power for Wallace, Dadda, and Reduced Area
multipliers in the 180 nm process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.5
Average power for Wallace, Dadda, and Reduced Area
multipliers in the 130g cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.6
Comparison of average power for Wallace, Dadda, and Reduced Area
multipliers in the 130g cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.7
Average power for Wallace, Dadda, and Reduced Area
multipliers in the 130p cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.8
Comparison of average power for Wallace, Dadda, and Reduced Area
multipliers in the 130p cell library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
xv
7.9
Comparison of average power of a multiplier in the 130g cell library
to the respective multiplier in the 130p cell library . . . . . . . . . . . . . . . . . . 119
7.10 Comparison of average power consumption for Wallace, Dadda,
and Reduced Area multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.11 Power/Area for Wallace multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.12 Power/Area for Dadda multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.13 Power/Area for Reduced Area multipliers . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.14 Estimated average power for column compression multipliers
in a 90 nm process technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.15 Estimated average power for column compression multipliers
in a 65 nm process technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
xvi
Chapter 1
Introduction
High-speed multiplication has always been a fundamental requirement of highperformance processors and systems. In digital signal processing (DSP) applications,
multiplication is one of the most utilized arithmetic operations, as part of filters,
convolvers, and transform processors. Improving multiplier design directly benefits the
high-performance embedded processors used in consumer and industrial electronic
products.
In the past five decades, engineering ingenuity has moved multiplication away
from the slow add-and-shift techniques [1] to faster, parallel multiplication schemes.
Two classes of parallel multipliers exist: array multipliers and column compression
multipliers. Array multipliers [2–5] are used most frequently in today’s designs due to
their lower design and layout complexities relative to column compression architectures.
Array multipliers are easily pipelined [6]. Most recently, research on array multipliers
has focused on power evaluation [9, 10] and power reduction [11–13].
Though not implemented as frequently as array multipliers, column compression
multipliers continue to be studied due to their high speed performance. With total delays
that are proportional to the logarithm of the operand word length, column compression
multipliers are faster than array multipliers, whose delay grows linearly with operand
word length. When first introduced by Wallace [14], and later refined by Dadda [15],
1
interconnect delays and pipelining were not critical design issues. With the advent of
VLSI, this type of multiplier often was difficult to design and exhibited high interconnect
overhead. However, advances in computer-aided design and VLSI technology have
helped alleviate these problems. In the literature, reports of fast CMOS implementations,
alternative design schemes, and strategies for the pipelining [16–18] of column
compression multipliers have begun to appear with increasing frequency.
Designs of column compression multipliers have mostly been recommended
based on improved speed performance. The issues of power consumption, interconnect,
and layout have not received as much attention, although Callaway’s papers [9] suggest
that they are much more power efficient than array multipliers. In particular, it is often
unclear whether proposed strategies for improving delay will result in more irregular
interconnect or difficult layout. For the column compression multiplier class to emerge
as viable solution to the demand for high-speed multiplication, the primary characteristics
of delay, area, and power, as they relate to interconnect and layout, need to be better
understood.
This research identifies techniques for optimal computer aided designs of column
compressions multipliers by analyzing delay, area, and power characteristics, with
particular emphasis on interconnect and layout.
Practical, realizable multiplier
architectures have been investigated, using industry standard design and layout tools.
Chapter 2 provides an overview of past work performed in the area of parallel
multipliers, with special attention given to research affecting column compression
multipliers.
2
Chapter 3 outlines the gate-level implementations of the column compression
multipliers. The design details of the Wallace, Dadda, and Reduced Area multipliers
created for analysis are reported.
To facilitate the creation of the three types of
multipliers an M by N multiplier has been written in perl. An overview of the M by N
multiplier generator is given.
Chapter 4 describes the implementation process used to design multipliers. The
tools and scripts for functional verification, layout, parasitic extraction, timing analysis,
and power estimation are detailed.
Chapter 5 discusses the layout areas of column compression multipliers
implemented in this research. For different CMOS process technologies, standard cell
libraries, and word sizes, sixty multipliers were placed and routed using industry standard
tools. The layouts confirm predicted trends in size and layout complexity.
Chapter 6 examines multiplier worst case delay values.
The delay values
showcase the fast performance of column compression multipliers, developed through an
automated tool flow instead of fully custom design.
Chapter 7 presents power analysis for the column compression multipliers. Key
to understanding the multipliers’ power characteristics is to examine power consumption
as word sizes, standard cell libraries, and process technologies change.
Chapter 8 summarizes the findings of this research. Based on understanding
gained from analyzing area, delay, and power characteristics, recommendations are given
for producing optimal column compression multipliers.
3
Chapter 2
Past Work
In the first large-scale digital systems, multiplication was performed as a series of
additions and shifts [1]. The requisite hardware consisted only of a parallel adder and a
few registers. In the early 1950’s, multiplier performance was significantly improved
with the introduction of Booth’s method [7], the modified Booth multiplier [19], and the
development of faster adders [20–22] and memory components. Booth’s method and the
modified Booth method do not require a correction of the product when either (or both)
of the operands is negative for two’s complement numbers. During the 1950’s, adder
designs moved away from the slow sequential formation of carries executed by ripple
carry adders. Carry lookahead, carry select, and conditional sum adders yielded speedy
sums through the faster simultaneous or parallel generation of carries.
In the 1960’s two classes of parallel multipliers were defined. The first class [2–
4] of parallel multiplier uses a rectangular array of identical combinatorial cells to
generate and sum the partial product bits. Multipliers of this class are called iterative
array multipliers or, more simply, array multipliers. They have a delay that is generally
proportional to the word length of the multiplier input. Due to the regularity of their
structures, array multipliers are easy to layout and have been implemented frequently.
The second class of parallel multiplier reduces a matrix of partial product bits to
two words through the strategic application of counters or compressors. These two words
are then summed using a fast carry-propagate adder to generate the product. This class of
4
parallel multiplier is sometimes termed a column compression multiplier. Since the delay
is proportional to the logarithm of the multiplier word length, these are also the fastest
multipliers.
2.1 Array Multipliers
In array multipliers, the two basic functions of partial product generation and
summation are combined. For unsigned N by N multiplication, N2 + N – 1 cells, where
N2 contain an AND gate for partial product generation and a full adder for summing, and
N – 1 cells containing a full adder, are connected to produce a multiplier. The array
generates N lower product bits directly and uses a carry-propagate adder, in this case a
ripple carry adder, to form the upper N bits of the product.
Replacing full adders with half adders where possible reduces the complexity to
N2 AND gates, N half adders, and N(N-2) full adders as shown in Figure 2.1. This 4 by 4
multiplier is shown as a square array with modifications to the first two rows. Since the
carry-in bits and the previous partial product bits are zero for the first row and the left
column, only the AND gates are needed. With only two switching inputs, the second row
employs half adders instead of full adders. The worst case delay is (2N - 2) ∆c , where
∆c is the worst case adder delay.
5
a3
a2
a1
a0
b0
P0
b1
HA
HA
HA
P1
b2
FA
FA
FA
P2
b3
FA
FA
FA
P3
P7
FA
FA
HA
P6
P5
P4
Figure 2.1: A square version of a 4 by 4 array multiplier (after [23])
In order to design an array multiplier for two’s complement operands, Booth’s
algorithm [7] can be employed. The implementation of a Booth’s algorithm array
multiplier computes the partial products by examining two multiplicand bits at a time.
Except for enabling usage of two’s complement operands, this Booth’s algorithm array
multiplier offers no performance or area advantage in comparison to the basic array
6
multiplier.
Better delays, though, can be achieved by implementing a higher radix
modified Booth algorithm.
The radix-4 modified Booth multiplier described by MacSorley [19] examines
three bits of the multiplicand to determine whether to add 0, 1x, -1x, 2x, or -2x of that
rank of the multiplicand. The rules for the radix-4 modified Booth algorithm are listed in
Table 2.1. Though the three bit decode to five possible operations—add 2A, add A, add
0, subtract A, or subtract 2A—increases the hardware complexity slightly, the radix-4
modified Booth multiplier uses only about half the delays of the Booth multiplier. It is
possible to use higher radices, such as radix-8 or radix-16, but the additional complexity,
due to non-power of two multiples of the multiplicand, compromises delay and area
improvements.
Table 2.1: Radix-4 Modified Booth Algorithm
bi
0
0
0
0
1
1
1
1
bi-1
0
0
1
1
0
0
1
1
Operations
+0
+A
+A
+2A
-2A
-A
-A
+0
bi-2
0
1
0
1
0
1
0
1
Another method for building an array multiplier that handles two’s complement
operands was presented by Baugh and Wooley [8, 24]. This method increases the
maximum column height by two. This may lead to an additional stage of partial product
reduction, thereby increasing overall delay. A modified form of the Baugh and Wooley
7
strategy is more commonly used because it does not increase the maximum column
height.
The modified Baugh-Wooley method [24] is shown in Figure 2.2.
This
organization of partial product bits produces an easy to remember strategy for two’s
complement multiplication, which is to 1) invert the bits along the left edge and the
bottom row, with the exception of the bottom left partial product bit, and 2) add a single
one to the n+1 and 2n columns. Note that the one in the 2n column is not actually part of
the final product and can be ignored. The negated partial product bits can be produced
using a NAND gate instead of an AND gate, which may reduce the area slightly in
CMOS. The one in the n+1 column is accommodated by using a special half adder on
two partial product bits in the n+1 column. The truth table for this special half adder is
given in Table 2.2. The sum is the complement of the sum of a normal half adder. The
carry is formed by a OR b.
8
Figure 2.2: Two’s complement by modified Baugh-Wooley method
Table 2.2: Truth table for special half adder
a
0
0
1
1
b
0
1
0
1
carry
0
1
1
1
sum
1
0
0
1
Implementations of array multipliers were described by Pezaris [5] and McIver, et
al. [25].
Pezaris, at Lincoln Laboratories, designed a board level 17 by 17 array
multiplier for two’s complement numbers. This multiplier generated the full 34 bit
product in 40 nsec.
A single chip array multiplier, reported by McIver, et al.,
implemented a 16 by 16 array multiplier with a two’s complement algorithm. A revised
design, the TRW MPY-16, was first sold commercially in 1976. This multiplier output
its product in 160 nsec.
9
2.2 Column Compression Multipliers
In 1964, Wallace [14] introduced a scheme for fast multiplication based on using
parallel “pseudoadders.”
A pseudoadder is simply a (3,2) counter.
Rather than
generating a single sum output, a group of (3,2) counters adds together three numbers and
produces two numbers whose sum equals the sum of the original three numbers. The
primary advantage of the (3,2) counter is that it avoids carry propagation. Wallace
proposed that the addition of partial product be performed as follows:
1) Group partial products into groups of three and input each group into
individual sets of (3,2) counters.
2) Group the resulting bits from the 1st step into groups of three and input
each group into sets of (3,2) counters.
3) Repeat the combining into groups of three and adding with sets of
(3,2) counters until two numbers remain.
4) Add the final two numbers using a carry propagating adder to get the
final product.
Dadda [15] later refined Wallace’s method by defining a counter placement
strategy that required fewer counters in the partial product reduction stage at the cost of a
larger carry-propagate adder. For both methods, the total delay is proportional to the
logarithm of the operand word-length.
Other partial product reduction methods have been proposed since the work of
Wallace and Dadda. The Reduced Area [26] and the Windsor [27] methods are based on
10
strategic utilization of (3,2) and (2,2) counters to improve area and layout, while
maintaining the fast speed of the Wallace and Dadda designs. Oklobdzija, et al. [58]
define an algorithm for partial product reduction based on understanding the unequal
delay paths through counters and compressors.
Oklobdzija’s technique sorts and
connects fast inputs and outputs in the critical delay paths while assigning slow inputs
and outputs to signal paths that can tolerate an increase in delay. Other methods reduce
the initial matrix of partial products using either compressors or higher order counters.
2.2.1 Counters and Compressors
The fast speed of column compression multipliers results from the parallel
application of counters or compressors. It is important to note the differences between
counters and compressors [15, 29, 30, 31]. A (q,r) counter is a combinational logic block
where the number of inputs q and the number of outputs r are related by r = 1 + ⎣log2 q ⎦.
For counters, the outputs express the count of the number of inputs that are ones; in other
words, the counter determines how many inputs are active. The outputs for a counter
have differing weights. A (q,r) counter with inputs from the ith column generates one bit
in the ith column and one bit for each of the next r-1 columns.
On the other hand, a q:r compressor consolidates q input bits in the ith
column to r output bits, with one bit output in the ith column and one bit for each of the
next r-1 columns. Additionally, there are L carry-in bits entering the compressor at
different levels and also L carry-out bits leaving the compressor at different levels. These
11
L carry signals enter the compressor from the i-1 column and exit to the i+1 column. The
L carry-out signals are not dependent on the L carry-in signals to avoid the horizontal
ripple of carries.
Since counters or compressors are critical, high-quantity components in column
compression multipliers, any area or performance enhancements made to counters or
compressors directly affect the multipliers.
In [32] Kwon, et al., offer a fast 5:2
compressor that is used to implement a 16 by 16 multiplier-accumulator.
Several
researchers [33, 34, 35, 36, 37, 38, 39] have focused on developing optimal (3,2) counter
designs.
2.2.2 Reduction Schemes
As indicated in Figure 2.3, the multiplication of an N bit multiplicand by an N bit
multiplier generates an N word by N bit matrix of partial products. The reduction of this
partial product matrix to two words requires the parallel application of counters or
compressors. The final two words are then summed using a carry-propagate adder to
obtain the final product.
12
Figure 2.3: Steps for N by N unsigned parallel multiplication
The Wallace [14] and Dadda [15] reduction schemes are realized using (3,2) and
(2,2) counters. During the reduction process, each (3,2) counter takes three inputs from a
given column and outputs a sum bit which remains in that column and a carry bit which
enters the next more significant column. Each (2,2) counter accepts two inputs from a
column and produces a sum bit in the same column and a carry bit in the next column.
A useful tool for illustrating partial product reduction using (3,2) and (2,2)
counters is the dot diagram, developed by Dadda [15]. Each partial product bit is
represented by a dot.
The outputs of each (3,2) counter are depicted as two dots
connected by a plain diagonal line. The outputs of each (2,2) counter are shown as two
13
dots connect by a “crossed” diagonal line. For both types of counter, the dot representing
the “sum” remains in the same column of the partial product bits that are being added.
The dot representing the “carry out” is placed in the next column.
The dot diagram for a 12 by 12 Wallace multiplier is shown in Figure 2.4. In
each stage of the reduction, Wallace performs a preliminary grouping of partial product
rows into sets of three. (3,2) and (2,2) counters are then employed within each three row
set. In the 12 by 12 example, the counters shown in Stage 1 of the reduction are placed in
four sections as determined by the preliminary grouping of partial product bits out of the
AND array into sets of three. If due to the preliminary grouping there is only one partial
product bit, then that bit is directly moved down to the next stage. The reduction of the
partial product bits in Stage 1 by the counters shown in Stage 2 demonstrates that rows
which are not part of a three row set are moved down into the next stage without
modification.
The complete partial product reduction of a 12 by 12 Wallace multiplier requires
five stages (intermediate matrix heights of 8, 6, 4, 3, and 2) and uses 102 (3,2) counters
and 34 (2,2) counters. To complete the multiplication, an 18 bit carry-propagate adder
forms the final product by adding the final two rows of partial product bits shown in
Stage 5.
14
Column
24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
Stage 1
(3.2): 40
(2,2): 8
Stage 2
(3,2): 20
(2,2): 6
Stage 3
(3,2): 20
(2,2): 8
Stage 4
(3,2): 11
(2,2): 5
Stage 5
(3,2): 11
(2,2): 7
Figure 2.4: Dot Diagram for a 12 by 12 Wallace Multiplier
15
2
1
In the development of his reduction scheme using (3,2) and (2,2) counters, Dadda
noted that there exists a sequence of intermediate matrix heights that minimizes the
number of reduction stages. This sequence, determined by working back from the final
two row matrix, limits the height of each matrix to the largest integer that is no more than
1.5 times the height of its subsequent matrix. Table 2.3 indicates the number of reduction
stages based on the number of bits in the multiplier. For example, a 32 by 32 bit Dadda
multiplier requires eight reduction stages with intermediate heights of 28, 19, 13, 9, 6, 4,
3, and finally 2. Although the heights of the intermediate matrices are not always the
same for Wallace and Dadda multipliers, the two schemes utilize the same number of
reduction stages.
Table 2.3: Number of Reduction Stages for a Dadda Multiplier
Bits in Multiplier (N)
Number of Stages
3
4
5≤N≤6
7≤N≤9
10 ≤ N ≤ 13
14 ≤ N ≤ 19
20 ≤ N ≤ 28
29 ≤ N ≤ 42
43 ≤ N ≤ 63
64 ≤ N ≤ 94
1
2
3
4
5
6
7
8
9
10
16
The recursive algorithm used to determine the application of counters for a Dadda
multiplier is as follows:
1)
Let d1 = 2 and dj+1 = ⎣1.5 · dj⎦ is the matrix height for the jth stage
from the end. Find the smallest j such that at least one column of the
original partial product matrix has more than dj bits.
2)
In the jth stage from the end, apply (3,2) and (2,2) counters to obtain
a reduced matrix with no more than dj bits in any column.
3)
Let j = j – 1 and repeat step 2 until a matrix with a height of only two
is achieved.
In Figure 2.5, the dot diagram for a 12 by 12 Dadda multiplier is shown. The first
six matrix heights calculated using the recursive algorithm are 2, 3, 4, 6, 9, and 13. Since
this is a 12 by 12 multiplier, the matrix height of 13 is unnecessary. The next matrix
height to target is 9. Stage 1 of partial product reduction applies (3,2) and (2,2) counters
only to the columns whose total height is greater than 9. In Stage 2, (3,2) and (2,2)
counters are only used in columns whose total height is greater than 6. Note that when
evaluating a column’s height it is important to account for carries from the previous
column. The 12 by 12 Dadda multiplier requires five reduction stages (matrix heights of
9, 6, 4, 3, and 2) and uses 99 (3,2) counters, 11 (2,2) counters, and a 22 bit carrypropagate adder.
17
Column
24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
Stage 1
(3.2): 8
(2,2): 4
Stage 2
(3,2): 27
(2,2): 3
Stage 3
(3,2): 28
(2,2): 2
Stage 4
(3,2): 17
(2,2): 1
Stage 5
(3,2): 19
(2,2): 1
Figure 2.5: Dot Diagram for a 12 by 12 Dadda Multiplier
18
2
1
Another reduction scheme, which uses (3,3) and (2,2) counters, is used for the
Reduced Area (RA) multiplier [26, 40, 41]. The dot diagram for a 12 by 12 Reduced
Area multiplier is shown in Figure 2.6. This multiplier requires five stages (matrix
heights of 9, 6, 4, 3, and 2) and uses 104 (3,2) counters, 11 (2,2) counters, and a 17 bit
carry-propagate adder. The reduction method for the Reduced Area multiplier is:
1) For each reduction stage, the number of (3,2) counters used in each
column is ⎣ki / 3⎦, where ki is the number of bits in column i.
2) (2,2) counters are used only (a) when required to reduce the number of bits
specified by the Dadda sequence, or (b) to reduce the rightmost column
containing exactly two bits.
Rule 1) for the Reduced Area multiplier results in the maximum reduction in the
number of bits entering the next stage. In Figure 2.6, Rule 2a) directs that in the third
reduction stage two (2,2) counters are used to reduce the number of bits in columns 12
and 13 to four. Rule 2b) adds one (2,2) counter to column i during reduction stage i.
This has the advantage of decreasing the word length of the carry-propagate adder by an
amount equal to the number of reduction stages.
19
Column
24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
8
7
6
5
4
3
2
Stage 1
(3.2): 40
(2,2): 1
Stage 2
(3,2): 27
(2,2): 1
Stage 3
(3,2): 17
(2,2): 3
Stage 4
(3,2): 12
(2,2): 1
Stage 5
(3,2): 8
(2,2): 5
Figure 2.6: Dot Diagram for a 12 by 12 Reduced Area Multiplier
20
1
A fourth type of partial product reduction scheme has been proposed by
Wang, et al. [27]. This technique, referred to in a subsequent paper as the Windsor
multiplier, aims to maximize the area efficiency while reducing cross-stage interconnect.
In Wang’s research, area efficiency of the column compression part of the multiplier is
defined as:
T
× 100%
K ⋅ max(T ( k ))
where T is the total number of (3,2) and (2,2) counters used in reduction, K is the required
number of stages, and T(k) is the number of counters in stage k. High area efficiency
percentages indicate close to even distributions of counters within each stage. Even or
near even distributions of counters in each stage would provide more regular, similar
sized, layout blocks of each stage. Routing together each similar sized stage would create
a more regular, compact block of the overall partial product reduction section.
The first step in Wang’s method is to determine the minimum number of (3,2) and
(2,2) counters required for reduction. This total number of counters will be the same as
the number of counters used by Dadda. The allocation of these counters for the Windsor
multiplier attempts to distribute the same number of counters at each stage. The heuristic
procedure for allocating counters for efficient layout is as follows:
1. Calculate the average number of counters for each stage. W0= ⎡ T / K ⎤ .
21
2. If all of stages can accommodate W0 counters, then place at most W0 counters in
each stage, with all T counters allocated to the K stages. The algorithm terminates
here.
3. If step 2 does not apply, and k lower stages cannot contain T counters, then fill
those stages with as many counters as they can contain, and calculate the average
number of counters for the remaining K – k stages, WK.
4. Check if each of the remaining K – k stages can accommodate the number of
counters determined in step 3. If true, distribute at most WK counters to each of
the K – k stages, under the condition that all counters have to be allocated to K
stages.
5. If at least one stage of the remaining K – k stages cannot contain WK counters, go
to step 3 and repeat steps 3-5 until an appropriate WK is determined.
For an 8 by 8 bit Windsor multiplier, an area efficiency of 95.5 percent was achieved in
comparison to the 75 percent attained by an 8 by 8 Dadda multiplier.
The performance of any of these four reduction methods—Wallace, Dadda,
Reduced Area, and Windsor—can be improved by the design of faster or more area
efficient (3,2) and (2,2) counters. Al-Twaijry and Flynn [42] used pass transistor (3,2)
counters as well as domino (3,2) counters in their investigation of the relationship
between the topology of partial product interconnections and possible circuit
implementations.
22
2.2.2.1 Using Higher-Order Counters and Compressors
Instead of using only (3,2) and (2,2) counters, multipliers can be designed using
higher-order (q,r) counters. The most common realizations of higher order counters are
(7,3) and (15,4) counters. Dadda [15], in preparation to use such counters for partial
product reduction defined new sets of intermediate matrix height sequences. He also
offered designs of (3,2) and (7,3) counters using inverting threshold gates and resistortransistor threshold gates.
Swartzlander in [30, 43] discusses three ways to design higher order counters.
The most straightforward implementations for VLSI technology are built using (3,2)
counters as building blocks. Examples of (7,3) and (15,4) counters are shown in Figures
2.7 and 2.8, respectively. The delay to the most significant bit of a (2n – 1, n) counter is
2n – 3 times the delay of a (3,2) counter.
Figure 2.7: (7,3) Counter design using (3,2) counters after [30]
23
Figure 2.8: (15, 4) Counter design using (3,2) counters after [30]
Other designs of higher order counters have been reported. Configurations [44,
45] of (7,3) and (15,4) counters have been logic synthesized under various delay and area
constraints. These synthesized counters have smaller areas and faster delays than such
counters built from (3,2) counters. Another approach [46] is to use a folded transistor,
cross-coupled PMOS load implementation. This method though has three disadvantages:
increased input capacitance, more intermediate node capacitance, and a long pull-down
path.
The reduction of partial product matrices by higher order counters has been
examined in the literature.
Dadda [29] examines partial product reduction using
24
combinations of (7,3), (3,2), and (2,2) counters for board level multiplier designs. In
VLSI technology, using synthesized (7,3) counters, the resultant 16 by 16 multiplier in
[44] is two “simple” gate delays faster than Wallace or Dadda multipliers (i.e., (3,2) and
(2,2) implementations), but with an approximate 10% overall gate count increase.
Compressors, especially 4:2 versions, have also been used to reduce partial
product matrices. The 4:2 compressor can be easily realized from two (3,2) counters as
shown in Figure 2.9. More recently, new 4:2 compressor circuits have been devised that
speed up multiplication. In [47] Nagamatsu, et al. report a 15 nsec 32 by 32 bit CMOS
multiplier, based on the Wallace reduction scheme, by using a specially designed 4:2
compressor cell. Ohkubo, et al. [48] achieved a 4.4 nsec, 54 by 54 bit CMOS multiplier
using a 4:2 compressor designed with pass-transistor multiplexers.
q3 q2 q1 q0
(3,2)
c
s
from
adjacent
column
to
adjacent
column
(3,2)
c
s
r1 r0
Figure 2.9: 4:2 Compressor using (3,2) counters
25
Higher order compressors are possible, but are not frequently used. Song and De
Micheli [46] examined 9:2 and 27:5 compressors, which are reported to have highly
regular layout. In their analysis of 5:3, 6:3, and 7:3 compressors, Jones and Swartzlander
[49] note that since compressors require connections between adjacent compressors for
intermediate carries, layout may be more complex.
2.2.3 The Final Carry-Propagate Adder
The literature offers several different types of optimized carry-propagate adders,
including carry lookahead [50, 51], carry-select [6], carry-skip [52, 53, 54], and modified
Ling [55]. Such adder architectures have been evaluated and ranked on the basis of
speed, size, and number of logic transitions [56]. More specifically, work has been done
to aid in the estimation of power consumption of adders [57].
To better select and design adders for column compression multipliers Oklobdzija
[31, 58] studied the bit arrival times to the final adder. His analysis shows that for Dadda
multipliers, the middle bit-pairs are the latest to arrive to the adder. Both the least
significant and most significant bit-pairs are produced early. Therefore, it is possible to
tailor the final carry-propagate adder to take advantage of the bit-pairs that arrive early.
Oklobdzija suggests using either a ripple carry adder or a variable block adder to sum the
early least significant bit-pairs, a carry look-ahead adder to sum the middle region of bitpairs, and either a carry select adder or conditional sum adder to sum the early most
significant bit pairs.
26
2.2.4 Layout Approaches
In the literature, the three main physical development strategies suggested for
column compression multipliers are to facilitate custom layout. The first method tries to
associate each counter or compressor in a structured grid array. This brute force method
is subject to connectivity errors since there can be a very large number of identical
counters with no visual patterns or clues to aid routing.
The second way to layout the reduction stages is the basis of the Windsor
multiplier. By evenly allocating counters in each stage and eliminating many cross-stage
interconnects, the Wang, et al. approach [27] attempts to maximize area efficiency.
The third method, outlined in [59], divides the N by N partial products into two
groups by an appropriate digit around the center of the initial parallelogram structure.
The right-triangle halves are then rearranged to form a rectangular structure with less
“dead” area. The partial products in each group are added in opposite directions using 4:2
compressors. In the case of a 54 by 54 bit multiplier, this layout method produced an
area that is 19.6% smaller than the conventional multiplier’s area.
27
Chapter 3
Automated Multiplier Netlist Generation
Sixty column compression multipliers were developed in order to examined their
area, delay, and power characteristics.
These sixty multipliers are realizations of
Wallace, Dadda, and Reduced Area multipliers for four operand sizes, four process
technologies, and five standard cell libraries. The daunting task of creating the design
netlists of so many multipliers mandated a flexible, automated process. Therefore a
multiplier generator was developed, in the scripting language Perl, to output the
multipliers’ netlists in gate-level Verilog or spice formats.
This chapter details the actual designs of the multipliers. An overview is given of
the M x N multiplier generator, genmult. The features of the process technologies and
standard cell libraries are also discussed.
3. 1 Basic Multiplier Design
Column compression multipliers with three different compression strategies were
implemented: Wallace, Dadda, and Reduced Area. Figure 3.1 presents the basic toplevel implementation for N by N unsigned multipliers. Mainly, 8 by 8, 16 by 16, 32 by
32, and 64 by 64 unsigned multipliers were developed in different process geometries. It
is possible to implement two’s complement multipliers by using both NAND and AND
28
gates in the partial product array. The NAND gates can be used, as specified by Parhami
[24] and outlined in Chapter 2, with little impact to layout complexity and delay.
Multiplier (B)
Multiplicand (A)
N
N
D flip-flops
D flip-flops
N
N
Buffers
AND Gate Array
Compression Strategy
p0
Wallace or
RA mult
pS,…,p1
Carry Lookahead Adder
p2n-2,…,p1
Dadda
p2n-2,…,pS+1
Wallace or RA
D flip-flops + load caps
Figure 3.1: Block diagram for implemented column compression multipliers
3.1.1 Signal Buffering
In order to approximate typical signal arrival times and drive strengths, D flipflops are used on the primary inputs. D flip-flops drive multiple buffers to distribute
input signals to N2 AND gates. Delay simulations were performed for each cell library to
29
resolve 1) the maximum number of buffers that a single D flip-flop can drive, and 2) the
maximum number of AND gate inputs that a single buffer can drive.
3.1.2 Partial Product Reduction
The least significant bit of the final product is formed from a0 · b0; therefore p0 is
available immediately from the AND gate array.
The Wallace and Reduced Area
reduction stages generate equal numbers of early product bits, while the Dadda reduction
stages generate only the LSB, p0. For Wallace and Reduced Area multipliers, the number
of product bits that are produced early is equal to the number of column compression
stages, S.
The remaining product bits are available after the delay through the final
carry-propagate adder. Figure 3.2 details the connection of a final product bit to a D flipflop and capacitive load which scales with process technology from 0.01 pF to 0.0025 pF.
Figure 3.2: Loading for each product bit
3.1.3 Carry Lookahead Adder
For each of the three types of multipliers implemented, a carry lookahead adder
[60] is used for the final carry-propagate adder. Carry lookahead adders perform fast
addition by generating the carries in parallel with the sum computations. Modified Full
Adders (MFAs), shown in Figure 3.3, are used to sum each bit pair and determine if a
30
carry has been “generated” or would be “propagated.” Carry generation means that both
input bits are ONE and therefore, regardless of the carry-in, a carry-out of ONE is
“generated.” Carry propagation means that at least one of the input bits is a ONE and that
the carry-in will “propagate” directly to the carry-out; that is, a carry-in of ONE in this
situation produces a carry-out of ONE. For a MFA, the generate and propagate signals
are described by gk = xkyk and pk = xk + yk .
Figure 3.3: Modified Full Adder
Based on the generate and propagate signals, lookahead logic blocks can quickly
determine a series of next carries, as shown in Equations (3.1) – (3.4):
ck+1 = gk + pkck
(3.1)
ck+2 = gk+1 + pk+1ck+1 = gk+1 + pk+1gk + pk+1pkck
(3.2)
ck+3 = gk+2 + pk+2ck+2 = gk+2 + pk+2gk+1 + pk+2pk+1gk + pk+2pk+1pkck
(3.3)
ck+4 = gk+3 + pk+3ck+3
= gk+3 + pk+3gk+2 + pk+3pk+2gk+1 + pk+3pk+2pk+1gk + pk+3pk+2pk+1pkck
31
(3.4)
Organizing the lookahead logic block in a 4-bit wide module, it is possible to express
Equation (3.4) in terms of block generate and block propagate signals, gk:k+3 and pk:k+3,
respectively:
ck+4 = gk:k+3 + pk:k+3ck
(3.5)
gk:k+3 = gk+3 + pk+3gk+2 + pk+3pk+2gk+1 + pk+3pk+2pk+1gk
(3.6)
where
and
pk:k+3 = pk+3pk+2pk+1pk
(3.7)
Figure 3.4 shows the block diagram of a 16-bit carry lookahead adder. The carry
lookahead logic blocks are organized in 4-bit modules. The operation of the 16-bit carry
lookahead adder is as follows: 1) the inputs x, y, and c0 are applied, 2) each MFA
computes p and g, 3) the first level of lookahead logic blocks computes the carries and
block generates and propagates, 4) with the carry data, each MFA computes the sum
outputs. The final carry-out, c15, is made from the second level lookahead logic block.
This simple style of a carry lookahead adder was used for all of the multipliers
implemented for this research. Primarily, the lookahead logic blocks were organized in
4-bit modules. Where the number of input bits was not a multiple of four, 1-bit, 2-bit, or
3-bit lookahead logic blocks were applied as needed to the most significant bit pairs.
32
Figure 3.4: Diagram of 16-bit Carry Lookahead Adder
33
3.2 Process Technologies
The four process technologies used in this research are from the same world-class
semiconductor foundry. The four process technologies are 1) a 250 nm CMOS logic,
single poly, five metal layer, salicide 2.5 V process, 2) a 180 nm CMOS logic, single
poly, six metal layer, salicide, 1.8 V process, 3) a 130 nm CMOS logic, single poly, eight
metal layer, salicide, 1.2 V process, and 4) a 90 nm CMOS logic, single poly, nine metal
layer, salicide, 1.0 V process. The 250 nm and 180 nm processes represent today’s
mainstream logic technologies. They are product-proven technologies and the offer best
overall value for mixed signal designs in the consumer and industrial marketplaces. The
130 nm and 90 nm processes are among the foundry’s more advanced process
technologies, offering many low power, high performance options, such as different core
voltages and multiple threshold voltages.
3.3 Cell Libraries
The column compression multipliers were implemented using standard cells from
state-of-the-art libraries. These libraries were created for mainstream applications with
optimizations for speed and density. The library architecture for 180 nm, 130 nm, and 90
nm process technologies is an enhanced generation over the 250 nm library architecture.
For the 130 nm process technology, two cell libraries were used. One is the generic cell
library. The second cell library is specifically architected to be low power and high
density. This additional cell library has been characterized down to 0.6 V to enable
accurate timing simulations at low voltages.
34
The design kit for each standard cell library includes LEF files and timing files.
A LEF (Library Exchange Format) file contains the physical information for a process
technology as well as geometric abstracts of all of the cells. All of the timing files used
for this research are for the nominal temperature, voltage, and process corner, often
named “typical.lib.”
The most critical logic cell of the Wallace, Dadda, and Reduced Area multipliers
is the (3,2) counter. Figure 3.5 shows the schematic of the (3,2) counter that is a standard
cell within each of the libraries.
As expected, the slowest paths are from the a and b
inputs to the carry out, cout.
Figure 3.5: Schematic of (3,2) counter standard cell
3.4 M x N Multiplier Generator
35
The automation of the design process is essential in order to create several
multipliers in a short time frame. In the literature, various programming languages have
been used to create VHDL, Verilog, or netlist files. Several generators [61, 62, 63] for
Booth encoded multiplier with optimized Wallace trees have been written in Lisp, AWK,
or C.
Over time, such multiplier generators have improved in capability, offering
pipeline insertion and opportunities for incremental optimization. In 2000, Hsiao and
Jiang [64] produced a synthesizer which generates gate level Verilog code for a fast
column compression multiplier. Their synthesizer connects the full adders of partial
product reduction by choosing a connection pattern which minimizes the average inputto-output delay, offering global optimization for all of the available adders. In 2003,
Qian and Dong-Hui [65] presented a “Regularized Multiplier Generator,” written in C++
and producing VHDL. This generator uses 4:2 compressors for the partial product
reduction.
For this research, an M x N multiplier generator called “genmult” was created
using the scripting language, Perl. The user can specify several options for the generation
of all or parts of a column compression multiplier. genmult is invoked using
genmult -t <type> -M <size> -N <size> -a <adder> -<all, and, comp ,add> -p <proc>
where
<type> = <dad | wal | ra> = dadda, wallace, or reduced area multiplier
<size> = <8 - 64> = number of bits in the multiplicand (M) or multiplier (N)
<adder> = <rc | cla | NA > = ripple carry, carry lookahead adder, or no adder
<all> = netlist complete multiplier with input and output wrapper of D flip-flops
36
<and> = netlist AND gate array only
<comp> = netlist column compression stage only
<add> = netlist final adder only
<proc> = process ID.
Based on the input options, genmult creates a spice netlist. For example, it is possible to
generate 32-bit by 16-bit Wallace spice netlist, wal32x16.spi, or just a 14-bit wide Carry
Lookahead Adder spice netlist, cla14.spi.
The genmult script uses an additional input file <proc>.list. <proc>.list contains a
mapping of genmult gate names to the appropriate names of logic gates in a particular
process. In one process technology, a full adder cell may be labeled “ADDFX1,” while
in another process, it is labeled “fadder.” The process map file is created manually but
once completed is reused for all designs in that process.
In order to have a gate-level Verilog netlist for layout, a spice to Verilog script,
“spi2ver,” was created using Perl. This script takes a spice netlist as input and uses it to
generate a Verilog netlist. For example, “spi2ver wal32x16.spi” is used to create the
Verilog file “wal32x16.g.”
37
Chapter 4
Automated Multiplier
Implementation and Verification
A design flow encompassing industry standard layout tools and verification
practices was used to develop sixty column compression multipliers. The multiplier
netlists, created using the home-grown tool genmult, were checked using formal
verification techniques and then placed and routed.
The parasitic resistances and
capacitances were then extracted from the layouts and used to back-annotate the netlists
for delay analysis and power simulations. Layout tools and simulation practices were
applied even-handedly to ensure fair representations of each type of multiplier. Scripting
languages, like Perl and C shell, were used to automate often repeated tasks and
streamline information extraction.
This chapter outlines the design flow. A detailed overview of the tools and the
process used for layout, parasitic extraction, delay simulation, and power estimation is
given. Floor planning decisions are also reported.
4.1 Design Flow
Premier tools from Cadence Design Systems, Inc., form the backbone of the
design environment. Figure 4.1 illustrates the design and tool flow for the development
and verification of the column compression multipliers. The first important step is to
verify that the generated gate-level Verilog netlist functions as the desired multiplier.
38
Verilog Netlist Generation
Functional Verification
genmult and spi2ver
perl scripts
Conformal Equivalence Checking
Conformal Ultra
Timing Driven Placement
Encounter NanoPlace
Timing Driven Route
Encounter NanoRoute
RC Extraction
Static Timing Analysis
Power Analysis
Encounter Native Extraction
Encounter Common
Timing Engine
Virtuoso UltraSim
Figure 4.1: Design and Tool Flow
This task is performed using Encounter™ Conformal® Equivalency Checking,
version 5.1, with the additional product of Conformal Ultra which targets datapath
structures. The verified multiplier netlist is then placed and routed using NanoPlace™
39
and NanoRoute™ of the Encounter platform. Parasitic extraction is performed using
Encounter’s native RC extraction program. The static timing analysis tool, Encounter’s
Common Timing Engine, uses this parasitic information to determine path delays. The
parasitic data is also used in power simulations by Cadence’s Virtuoso® UltraSim™,
version 4.2.
All of the tools and scripts were run on a desktop personal computer running the
Red Hat Enterprise Linux 4.2 operating system. The desktop personal computer was
built using 1 GB of memory and a 3.4 GHz Intel® Pentium® D, dual-core, 64-bit
processor.
4.2 Formal Verification
Formal verification is a type of static analysis that applies mathematical
techniques to rigorously prove that a design functions correctly. Equivalence checking
uses formal techniques to determine whether two versions of a design are functionally
equivalent. This powerful verification method can be performed quickly and without the
need for test vectors.
In order to ensure that each generated column compression multiplier operates
correctly, formal verification was performed using Cadence’s Encounter Conformal
Equivalence Checking.
The Conformal Ultra product was added to extend logic
equivalence checking capability to complex datapaths.
Conformal Ultra provides
targeted support for analyzing adders and multipliers with standard architectures.
40
The flowchart in Figure 4.2 shows the Conformal process flow. Each generated
gate-level Verilog netlist of a column compression multiplier is compared to a Verilog
RTL multiplier design.
The Verilog RTL multiplier is considered the “Golden,”
faultless design. The generated gate-level Verilog netlist is the “Revised,” to-be-verified
design.
After reading in the designs, the cell library information, and user-specified
constraints and parameters, Conformal maps key points and compares the logic
implemented to reach them. In the case of the multipliers, the key points are the primary
inputs, primary outputs, and D flip-flops. When the comparison is complete, Conformal
reports areas of equivalence and pinpoints differences.
Conformal also assists in
diagnosing mismatches with error patterns and candidates, gate reporting, source code
viewing, and schematic viewing with trace capability.
41
Figure 4.2: Conformal Process Flow
42
4.3 Layout Floorplanning
Initial floorplanning for the column compression multipliers was performed using
Cadence’s Encounter platform.
The fundamental goal of the floorplanning was to
prepare a physical structure such that the placement and route tools could operate on each
design in a consistent, balanced manner.
To build a floorplan, a minimal set of
constraints and parameters was designated in a configuration file for each multiplier (e.g.
dad8x8j250.conf). The main items specified in each configuration file include
1. the type of design netlist,
2. the filename of the timing data for the standard cell library,
3. the filename(s) of the LEF for the standard cell library,
4. a target aspect ratio for layout height and width,
5. the designation to flip cells to facilitate power abutment,
6. a target row utilization, and
7. the filename for the pin I/O assignments.
In this research, the type of netlist used for placement and route is gate-level
Verilog.
The timing file used for each standard cell library represents the typical
performance at nominal voltage, temperature, and process corner. The LEF provides the
physical geometries of the process technology and the standard cells.
For custom datapaths, a bit-slice of an arithmetic unit tends to fit in long, pitchmatched, rectangular channels. For this research, a key goal is to evaluate column
compression multipliers in a sea-of-gates design flow. There is no requirement to pitch43
match. Instead, the footprint should support high-density placement. To this end, a
target aspect ratio of 0.95 for the layout’s height versus width was given for each
multiplier. This means that each layout would take on an almost square appearance.
Power and ground strips were configured to abut power with power and ground
with ground, as shown in Figure 4.3. No additional space was allotted for routing
channels in metal 1.
Figure 4.3: Power/Ground abutment in layout
The pin I/O assignment defines the labels and placement order of input and output
pins for the multipliers. As shown in Figure 4.4, each pin assignment file places the
multiplicand, A, across the top of the layout, the multiplier, B, along the west side, the
least significant half of the product along the east side, and the most significant half of
the product on the bottom. Note that pin placement order was specified but the exact
44
locations for each pin were not fixed. This allowed the tool to select the optimal pin
placement and the overall layout to grow or shrink as needed.
Figure 4.4: Diagram of pin placement for N by N multipliers
One of the important input parameters for the cell placement tool is row
utilization. Row utilization is the ratio of the total area occupied by the design’s cells to
the total area of the layout region. The tool user specifies a row utilization target before
placement occurs. High row utilization numbers indicate a very dense cell placement.
Depending on the size of the design, a cell placement with a high row utilization may not
be routable. The initial row utilization targets for each multiplier layout was set at 95%.
45
4.4 Timing Driven Placement and Route
Timing driven placement and route offers the opportunity to realize optimized
layout of the cells in a multiplier’s critical path. Based on 1) the floorplan configuration,
2) the input clock period defined in the timing constraints file, 3) cell delays from timing
libraries, and 4) net delays calculated using RC extracted from trial routes, the NanoPlace
tool performs several iterations to determine an optimal placement.
Following cell
placement, filler cells are added as needed to extend power and ground lines.
NanoRoute uses the same input information as NanoPlace to produce a routed
design. Typically, digital designers allow NanoRoute to perform gate upsizing, signal
buffering, and even logic optimization during the routing process. In order to maintain
control over the cells used in a multiplier’s implementation, additional buffering, gate
resizing, and logic optimization were disabled during routing.
4.5 RC Extraction
The native RC extraction tool within Encounter offers two modes of operation:
default and detailed. Per the Encounter User Guide [66], the total capacitance for each
net is calculated based on the net’s geometry and the local wire density in the default
mode. Note that coupling capacitance is not calculated in default mode. In the detailed
mode, the coupling capacitance is also evaluated by considering the actual geometries of
neighboring nets on the same metal layer and the adjacent metal layer when a complete
capacitance table is provided. Detailed mode offers RC values that contribute to more
accurate timing results for a particular process technology.
46
The key to performing a detailed RC extraction is the provision of a capacitance
table created from an IceCaps Technology (ICT) file. For this research, an ICT file was
only available for the 250 nm process. Therefore it was possible to perform both default
and detailed RC extractions when using the 250 nm process. For the 180 nm, 130 nm,
and 90 nm processes, only the default RC extraction is conducted. For the multipliers
developed in the 250 nm process, the timing delays using the detailed RC extractions
were 1.5% to 3.5% slower than the timing delays that included the default RC
extractions. Since the delay differences between the two extraction modes was very
small, the default RC extractions were deemed sufficient for the simulations in this
research.
4.6 Static Timing Analysis
Given a specified set of operating constraints and timing libraries, timing analysis
is used to fine tune and debug speed-limiting critical paths. All timing analysis in this
research was performed using Encounter’s Common Timing Engine (CTE). Figure 4.5
shows the basic configuration of CTE used for this research. For each multiplier, three
main sets of timing data were reported: 1) the top 50 slowest paths, 2) the worst case
delay to each final product bit, and 3) the worst case delay of the bit pairs into the Carry
Look-Ahead Adder.
47
Figure 4.5: Configuration of Common Timing Engine
4.7 Power Simulation
Virtuoso UltraSim is designed to verify analog, mixed signal, and digital circuits
using a multi-purpose, single engine, hierarchical simulator. It is promoted as “ten to
more than 10,000 times faster than SPICE” [67].
In its most accurate mode, this
simulator offers plus or minus one percent accuracy with respect to SPICE.
For this research, Virtuoso UltraSim is used to perform dynamic power analysis,
with monitoring of the average, maximum (i.e., peak), and RMS currents and power
consumed by each multiplier.
It is not possible to evaluate leakage current using
UltraSim and the given standard cell libraries.
Figure 4.6 shows an example
configuration for an UltraSim simulation of a 32 by 32 Dadda multiplier. The primary
input file, dad32x32.sp, provides pointers to required files, such as the circuit’s netlist,
the parasitic resistors and capacitors for back-annotation, and parameters of the process
48
technology. The input test vector file includes stimulus values for the multiplier and the
multiplicand as well as output values for the final product, making the simulations selfchecking.
Figure 4.6: Configuration of UltraSim
The dad32x32.sp file also sets up power, ground, circuit clocking, simulation
modes, and measurement commands. The simulation mode used for all runs was “digital
accurate mode.” The digital accurate mode is used for timing verification of digital
circuits with a simulation error target of less than 5%. In the trade-off of speed and
49
accuracy, the simulations were customized to simulate with slightly faster speed
(speed=6) than the default speed (speed=5) of digital accurate mode, thereby setting the
relative convergence criterion, tol, for the current and voltage calculations to 0.02 and the
absolute current tolerance, iabstol, to 1×10-10.
50
Chapter 5
Multiplier Area
“Complex” and “unwieldy” are two terms often used to describe the physical
realization of column compression multipliers. Intuitively, current computer aided design
techniques offer the opportunity to make the multiplier areas the smallest and most
compact layouts realizable. Today’s modern process technologies offer five or more
layers of metal for signal routing. Thus it is possible to place all of the multiplier’s cells
without having to leave room for routing channels. This means that multiplier area is
solely dependent on the area of the cells used in the design. There is no need for a route
component when estimating area. Since, for N by N multipliers, the number of the largest
cells, the (3,2) counters, grows as N2, then multiplier area is expected to be roughly
proportional to N2.
In 1974, Dennard, et al. [68] indicated that, to a first-order, each new generation
of process technology should expect to make all MOS physical dimensions proportional
to the minimum feature size, λ, of the process technology. Since the height and width of
every cell is proportional to λ, the area of each standard cell is proportional to λ2, and the
total area of an N by N column compression multiplier is expected to be approximately
equal to k λ2 N2, where k is a constant scaling factor.
51
This chapter presents the results of using the place and route tools, NanoPlace and
NanoRoute within Cadence’s Encounter platform, to lay out Wallace, Dadda, and
Reduced Area multipliers. The multipliers have been developed in the standard cell
libraries of four CMOS process technologies: 1) 250 nm, 2.5 V, 2) 180 nm, 1.8 V, 3) 130
nm, 1.2 V, and 4) 90 nm, 1.0 V. For the 130 nm process technology, multipliers were
created using both a generic standard cell library and a low power standard cell library.
Before the actual layout values are reported, a simple analysis is offered to predict the
trends in area as multiplier sizes, standard cell libraries, and process technologies are
changed. This chapter concludes with an analysis of the significance of the layout
results.
5.1 Area Estimation
As noted above, a first order area estimate is based on the number of (3,2)
counters, which is proportional to N2. A more exact area estimate is based on the gate
counts. As shown in Figure 5.1, each type of multiplier is comprised of 1) D flip-flops on
the inputs and outputs, 2) buffers distributing multiplicand and multiplier bits to the AND
gate array, 3) the AND gate array, 4) the reduction stages, and 5) the carry lookahead
adder. In order to perform two’s complement multiplication, a NAND/AND gate array is
implemented to generate the partial products. The NAND/AND gate array has the same
layout complexity as the AND gate array.
52
Figure 5.1: Block diagram of an N by N unsigned column compression multiplier
Table 5.1 indicates the number of D flip-flips, buffers, and AND gates used for
each multiplier. The number of D flip-flops is equivalent to the total number of primary
inputs and outputs, 4N. In examining the drive capabilities of D flip-flops and buffers,
delay simulations showed that one D flip-flop could drive up to eight buffers and a
buffer, depending on size, could drive up to eight AND gate inputs.
53
Table 5.1: Number of D flip-flops, buffers, and AND gates used in the multipliers
Multiplier
D flip-
AND
Word Size
flops
Buffers
gates
8 by 8
16 by 16
32 by 32
64 by 64
32
64
128
256
0
64
256
1024
64
256
1024
4096
Table 5.2 summarizes expressions, reported in [26, 60], for calculating the
number of (3,2) and (2,2) counters used in partial product reduction as well as the word
length of the final carry propagate adder as a function of the input multiplier size, N. The
expressions for counters and adder word length assume N > 6 and are also based on the
number of reduction stages, S, which is approximately equal to log1.4 N. The number of
(3,2) counters in Dadda multipliers comes simply from recognizing that there are N2 bits
in the original partial product matrix and 4N – 3 bits in the final two row matrix.
Compared to Dadda multipliers, the Reduced Area multipliers have S fewer bits in the
final two row matrix that go to the carry propagate adder. This occurs because the
Reduced Area multipliers use (2,2) counters to reduce the rightmost column that has
exactly two bits in each reduction stage.
Table 5.2: Components for Wallace, Dadda, and Reduced Area multipliers
Method
Wallace
Dadda
Reduced Area
(3,2) Counters
N2 – 4N + 2 + S
N2 – 4N + 3
2
N – 4N + 3 + S
54
(2,2) Counters
CPA Length
>N
N–1
N–1
2N – 1 – S
2N – 2
2N – 2 – S
For Wallace multipliers, the number of (3,2) counters is approximately N2 – 4N +
2 + S. If N is greater than five, then there will be either one or two bits in the most
significant column of the final two row matrix. In the former case one less (3,2) counter
is required. The number of (2,2) counters is at least N, and is often much greater than N.
Table 5.3 gives the (3,2) and (2,2) counter quantities and the word length of the
carry propagating adder for 8 by 8, 16 by 16, 32 by 32, and 64 by 64 Wallace, Dadda, and
Reduced Area multipliers. Compared to the Dadda multiplier, the N by N Reduced Area
multiplier requires 2(log2(N)-1) more (3,2) counters but has a final adder that is
2(log2(N)-1) bits smaller. An N by N Wallace multiplier requires one fewer (3,2) counter
than a Reduced Area multiplier and has a carry propagate adder that is one bit wider. The
Wallace multiplier also requires roughly N(2(log2(N)-5)) (2,2) counters.
Table 5.3: Hardware for Wallace, Dadda, and Reduced Area multipliers
Reduction Strategy
Wallace
Dadda
RA
Wallace
Dadda
RA
Wallace
Dadda
RA
Wallace
Dadda
RA
(8 by 8)
(8 by 8)
(8 by 8)
(16 by 16)
(16 by 16)
(16 by 16)
(32 by 32)
(32 by 32)
(32 by 32)
(64 by 64)
(64 by 64)
(64 by 64)
(3,2) Counters
(2,2) Counters
Adder Length
38
35
39
200
195
201
906
899
907
3852
3843
3853
15
7
7
54
15
15
164
31
31
459
63
63
11
14
10
25
30
24
55
62
54
117
126
116
55
The carry lookahead adder designed for the multipliers used a maximum
lookahead logic block width of four. When needed, 1-bit, 2-bit, and 3-bit lookahead
logic blocks were available, so that no additional, unused hardware was included. Per
[57], the complexity of an N-bit carry lookahead adder implemented with 4-bit lookahead
logic blocks is approximately 1.4 times the complexity of N (3,2) counters.
Based on the accumulated data regarding the component counts of the multipliers,
Table 5.4 indicates, to a first order, the complexity of the D flip-flops, buffers, AND
gates, (3,2) counters, (2,2) counters, and the carry lookahead adder for N by N
multipliers. The D flip-flops and the adder length of the final two row matrix grow
linearly with N, the number of bits of the multiplier. For Dadda and Reduced Area
multipliers, the number of (2,2) counters increases proportionally with N. For Wallace
multipliers, the number of (2,2) counters grows faster than linearly with N, but slower
than N2. Examination of several (2,2) counter quantities for different values of N shows
that the number of (2,2) counters is roughly proportional to N log N. The complexity of
the carry lookahead adder is proportional to N since its word length, which is
approximately proportional to N. Mainly, the area growth of the multipliers is dominated
by the N2 term for (3,2) counters, the largest standard cells in the design. The buffers and
AND gates also contribute quadratic growth with N.
56
Table 5.4: Complexity of the multiplier components
Component
Wallace
Dadda
Reduced Area
D flip-flops
Buffers
AND gates
(3,2) Counters
(2,2) Counters
Carry Lookahead Adder
4N
O(N2)
N2
O(N2)
O(N logN)
O(N)
4N
O(N2)
N2
O(N2)
O(N)
O(N)
4N
O(N2)
N2
O(N2)
O(N)
O(N)
Therefore, based on gate counts and the complexity of the multiplier components,
the following area estimates are offered:
For a given process technology and word size, the Wallace multipliers have the
largest area, the Dadda multipliers are smaller, and the Reduced Area multipliers are the
smallest. This is based on first examining the (3,2) counters, then the (2,2) counters, and
finally the carry lookahead adder. The Wallace and Reduced Area multipliers have
roughly equal numbers of (3,2) counters, but Wallace utilizes substantially more (2,2)
counters. If the (2,2) counters are approximately half the size of (3,2) counters, then the
number of (3,2) counters could be increased by half of the number of (2,2) counters when
considering area. For the 16 by 16 multipliers, the area of the Wallace multipliers is
formed by 227 (3,2) counter area equivalents versus 202.5 and 208.5 (3,2) counter area
equivalents for Dadda and Reduced Area multipliers, respectively. A modified full adder
in the carry lookahead adder is approximately the same size as a (3,2) counter. Based on
the adder length, the number of modified full adders could be rolled into the (3,2) counter
evaluation. Now, for a 16 by 16 multiplier, the relative area equivalents in (3,2) counters
57
would be 252, 232.5, and 232.5 for Wallace, Dadda, and Reduced Area multipliers,
respectively.
For a given process technology, a low power standard cell library will yield
multipliers with smaller areas than those built using a generic standard cell library. The
typical approach to developing a low power cell library is to scale down CMOS gate
sizes, providing lower area, drive currents, and lower power consumption. Inspection of
the two 130 nm standard cell libraries shows that the (3,2) counters in the low power
library are 14% smaller in area than the (3,2) counters in the generic library. Therefore,
with area dominated by the (3,2) counters, it is expected that the multipliers in the low
power cell library will be approximately 14% smaller than those developed in the generic
cell library.
5.2 Area Measurements
All of the multipliers were placed and routed using NanoPlace and NanoRoute of
Cadence’s Encounter platform. Power and ground were configured to abut in order to
keep the area as compact as possible. No additional routing channels were provided.
Filler cells were added as needed to extend power and ground lines. Row utilization for
all of the multipliers was 95%. Visual inspection of each of the layouts showed dense,
compact, very regular layouts. There were no empty, gate-free sections of any significant
size. Though five or more layers of metal were available for each process, the small 8 by
8 multipliers were routed using only three layers of metal and the large 64 by 64
multipliers were routed using four layers of metal.
58
5.3 Area of Multipliers in the 250 nm Process Technology
Layouts of Wallace, Dadda, and Reduced Area multipliers were completed in the
250 nm process technology. The layout areas for the twelve multipliers are listed in
Table 5.5. Each doubling of the operand size from 8 by 8 through 64 by 64 increases the
total area by slightly less than a factor of four for all three types of multipliers. As shown
in Table 5.6, the Reduced Area multipliers are the smallest in size, with the Wallace
multipliers being 4% to 6% larger and the Dadda multipliers 0.3% to 6% larger.
Table 5.5: Layout areas for Wallace, Dadda, and Reduced Area
multipliers in the 250 nm process
Word Size
Wallace
(µm2)
Dadda
(µm2)
Reduced Area
(µm2)
8 by 8
14,576
14,570
13,807
16 by 16
53,321
51,288
50,185
32 by 32
195,713
186,909
185,407
64 by 64
738,385
709,509
707,699
Table 5.6: Comparison of Wallace, Dadda, and Reduced Area
multiplier areas in the 250 nm process
Word Size
Wallace
Dadda
RA Multiplier
Area (µm2)
8 by 8
+ 5.6 %
+ 5.5%
13,807
16 by 16
+ 6.2 %
+ 2.2 %
50,185
32 by 32
+ 5.6 %
+ 0.8 %
185,407
64 by 64
+ 4.3 %
+ 0.3 %
707,699
59
5.4 Area of Multipliers in the 180 nm Process Technology
Four Wallace, four Dadda, and four Reduced Area multipliers were completed in
the 180 nm process technology. The layout areas for each multiplier are listed in Table
5.7. Generally, the area differences were small as indicated in Table 5.8. Doubling the
operand size from 8 by 8 through 64 by 64 increases the total area by slightly less than a
factor of four for all three types of multipliers.
Table 5.7: Layout areas for Wallace, Dadda, and Reduced Area
multipliers in the 180 nm process
Word Size
Wallace
(µm2)
Dadda
(µm2)
Reduced Area
(µm2)
8 by 8
8,400
8,421
7,990
16 by 16
30,221
29,174
28,551
32 by 32
109,880
105,230
104, 386
64 by 64
412,456
397,137
396,111
Table 5.8: Comparison of Wallace, Dadda, and Reduced Area
multiplier areas in the 180 nm process
Word Size
Wallace
Dadda
RA Multiplier
Area (µm2)
8 by 8
+ 5.1 %
+ 5.4 %
7,990
16 by 16
+ 5.8 %
+ 2.2 %
28,551
32 by 32
+ 5.3 %
+ 0.8 %
104,386
64 by 64
+ 4.1 %
+ 0.3 %
396,111
60
In all cases, the Reduced Area multipliers are the smallest. For the 16 by 16 and
larger word sizes, Wallace multipliers are larger than the Dadda multipliers as expected,
but for the 8 by 8 multipliers, the Dadda multiplier is very slightly larger. This is due to
the larger area of the carry lookahead adder swamping the area savings from fewer (3,2)
counters and (2,2) counters in the Dadda multiplier. Table 5.9 shows the sections where
there are area differences between the Wallace and Dadda multipliers; this includes the
areas for (3,2) counters and (2,2) counters used in partial product reduction and the carry
lookahead adder.
Table 5.9: Comparison of counter and CLA areas for 8 by 8
Wallace and Dadda multipliers
Wallace 8 by 8
Area of
(3,2) counters
(µm2)
2,655
Area of
(2,2) counters
(µm2)
599
Dadda 8 by 8
2,445
279
2,295
Difference
- 210
- 320
+ 549
Multiplier
Area of CLA
(µm2)
1,746
5.5 Area of Multipliers in the 130 nm Process Technology
For the 130 nm process technology, two standard cell libraries were used. First,
twelve multipliers were designed using the generic 130 nm cell library, herein designated
as “130g”. The areas for these twelve multipliers are given in Table 5.10. For a given
word size, the three types of multipliers are very close in size. For all three types of
multipliers, doubling the operand size increases the total area by slightly less than a factor
of four.
61
Table 5.11 compares Wallace and Dadda multipliers to the Reduced Area
multipliers.
The Wallace and Dadda multipliers are larger than the Reduced Area
multipliers.
For the 8 by 8 multipliers, the Dadda multiplier is slightly larger than the
Wallace multiplier due to using the larger carry lookahead adder. For the larger word
sizes, the higher number of reduction stage counters in the Wallace multipliers dominates
the total area resulting in the Wallace multipliers being the largest implementations.
Table 5.10:
Layout areas for Wallace, Dadda, and Reduced Area
multipliers in the generic 130 nm cell library
Word Size
Wallace
(µm2)
Dadda
(µm2)
Reduced Area
(µm2)
8 by 8
4,388
4,428
4,181
16 by 16
15,661
15,161
14,811
32 by 32
56,584
54,257
53,783
64 by 64
211,551
203,784
203,207
Table 5.11:
Comparison of Wallace, Dadda, and Reduced Area
multiplier areas in the generic 130 nm cell library
Word Size
Wallace
Dadda
RA Multiplier
Area (µm2)
8 by 8
+ 5.0 %
+ 5.9 %
4,181
16 by 16
+ 5.7 %
+ 2.4 %
14,811
32 by 32
+ 5.2 %
+ 0.9 %
53,783
64 by 64
+ 4.1 %
+ 0.3 %
203,207
62
Using the same 130 nm process technology, twelve column compression
multipliers were implemented using a standard cell library designed specifically for low
power performance. Herein, this low power cell library is designated as “130p.” Each
low power cell contains the same logic as the generic cell, but with CMOS gate
geometries down-sized for reduced power consumption. Table 5.12 lists the areas of the
twelve multipliers.
Table 5.13 reports the percentage by which Wallace or Dadda
multipliers are larger than Reduced Area multipliers.
Table 5.12:
Layout areas for Wallace, Dadda, and Reduced Area
multipliers in the low power 130 nm cell library
Word Size
Wallace
(µm2)
Dadda
(µm2)
Reduced Area
(µm2)
8 by 8
3,493
3,515
3,339
16 by 16
12,739
12,353
12,103
32 by 32
46,867
45,104
44,766
64 by 64
177,318
171,474
171,060
Table 5.13:
Comparison of Wallace, Dadda, and Reduced Area
multiplier areas in the low power 130 nm cell library
Word Size
Wallace
Dadda
RA Multiplier
Area (µm2)
8 by 8
+ 4.6 %
+ 5.3 %
3,339
16 by 16
+ 5.2 %
+ 2.1 %
12,103
32 by 32
+ 4.7 %
+ 0.8 %
44,766
64 by 64
+ 3.7 %
+ 0.2 %
171,060
63
Table 5.14 reports the area differences between the multipliers in the 130g cell
library and the multipliers in the 130p cell libraries. The multipliers developed in the
130p cell library are 16% to 21% smaller than the multipliers created in the 130g cell
library.
Table 5.14: Percentage by which multipliers in the 130p cell library
are smaller than multipliers in the 130g cell library
Word Size
Wallace
Dadda
Reduced Area
8 by 8
20%
21%
20%
16 by 16
19%
19%
18%
32 by 32
17%
17%
17%
64 by 64
16%
16%
16%
5.6 Area of Multipliers in the 90 nm Process Technology
Layouts of four Wallace, four Dadda, and four Reduced Area multipliers were
completed in the 90 nm process technology. The layout areas for these twelve multipliers
are listed in Table 5.15. Doubling the operand size from 8 by 8 through 64 by 64
increases the total area by slightly less than a factor of four for all three types of
multipliers. As shown in Table 5.16, the Reduced Area multipliers are the smallest in
size, with the Wallace multipliers being 4% to 6% larger and the Dadda multipliers 0.3%
to 6% larger.
64
Table 5.15:
Layout areas for Wallace, Dadda, and Reduced Area
multipliers in the 90 nm process
Word Size
Wallace
(µm2)
Dadda
(µm2)
Reduced Area
(µm2)
8 by 8
2,157
2,170
2,051
16 by 16
7,717
7,444
7,277
32 by 32
27,953
26,752
26,524
64 by 64
104,780
100,720
100,445
Table 5.16:
Comparison of Wallace, Dadda, and Reduced Area
multiplier areas in the 90 nm process
Word Size
Wallace
Dadda
RA Multiplier
Area (µm2)
8 by 8
+ 5.2 %
+ 5.8 %
2,051
16 by 16
+ 6.0 %
+ 2.3 %
7,277
32 by 32
+ 5.4 %
+ 0.9 %
26,524
64 by 64
+ 4.3 %
+ 0.3 %
100,445
5.7 Area Comparisons
For a given process technology, standard cell library, and word size, the area
differences among the Wallace, Dadda, and Reduced Area multipliers are small; the
largest are at most 6% bigger than the smallest. For multipliers larger than 8 by 8,
Wallace multipliers will be the largest and Reduced Area multipliers the smallest. For 8
by 8 multipliers, Dadda multipliers will be the largest, due to the area of the carry
65
lookahead adder.
Figure 5.2 shows layout areas for all of the Dadda multipliers
developed in this research. For each process, all of the multiplier areas show slightly less
than quadratic growth with increases in N.
2
Area (µm )
Areas of Dadda Multipliers
800,000
700,000
600,000
500,000
400,000
300,000
200,000
100,000
0
Dadda 250nm
Dadda 180nm
Dadda 130g
Dadda 130p
Dadda 90nm
0 10 20 30 40 50 60 70
Word Size, N
Figure 5.2: Dadda multiplier areas using different process technologies
and standard cell libraries
Tables 5.17, 5.18, and 5.19 give the normalized area calculations for Wallace,
Dadda, and Reduced Area multipliers in the generic libraries of the 250 nm, 180 nm, 130
nm, and 90 nm CMOS process technologies. The area for each multiplier is normalized
to the area of the 8 by 8 multiplier in that particular process technology.
These
normalized areas show that doubling the operand size increases the total area by slightly
less than a factor of four.
This occurs because most, but not all, elements of the
multiplier area are increasing quadratically with the operand size.
66
Table 5.17: Wallace multiplier areas relative to each process’s 8 by 8 case
Normalized Normalized Normalized Normalized
Multiplier
Area
Area
Area
Area
90 nm
130 nm
180 nm
250 nm
8 by 8
1.0
1.0
1.0
1.0
16 by 16
3.7
3.6
3.6
3.6
32 by 32
13.4
13.1
12.9
13.0
64 by 64
50.7
49.1
48.2
48.6
Table 5.18: Dadda multiplier areas relative to each process’s 8 by 8 case
Multiplier
8 by 8
16 by 16
32 by 32
64 by 64
Normalized Normalized Normalized Normalized
Area
Area
Area
Area
90 nm
130 nm
180 nm
250 nm
1.0
1.0
1.0
1.0
3.5
3.5
3.4
3.4
12.8
12.5
12.3
12.3
48.7
47.2
46.0
46.4
Table 5.19: Reduced Area multiplier areas relative to each process’s 8 by 8 case
Multiplier
8 by 8
16 by 16
32 by 32
64 by 64
Normalized Normalized Normalized Normalized
Area
Area
Area
Area
250 nm
180 nm
130 nm
90 nm
1.0
1.0
1.0
1.0
3.6
3.6
3.5
3.5
13.4
13.1
12.9
12.9
51.3
49.6
48.6
49.0
Tables 5.20, 5.21, and 5.22 report the area ratios of the multipliers in the 250 nm,
180 nm, and 130 nm processes to the 90 nm multiplier in the same word size. These area
ratios show that transitioning from 250 nm to 180 nm to 130 nm decreases multiplier area
by slightly less than a factor of 2 with each process transition, with the average area
67
reduction with each respective step being slightly less than (0.25/0.18)2 and slightly more
than (0.18/0.13)2.
Transitioning from 130 nm to 90 nm decreases multiplier area by
approximately a factor of two, which is slightly less than (0.13/0.09)2.
Table 5.20: Wallace multiplier areas relative to 90 nm
Multiplier 250 nm to 90 nm 180 nm to 90 nm 130nm to 90 nm
8 by 8
16 by 16
32 by 32
64 by 64
6.8
6.9
7.0
7.0
3.9
3.9
3.9
3.9
2.0
2.0
2.0
2.0
Table 5.21: Dadda multiplier areas relative to 90 nm
Multiplier 250 nm to 90 nm 180 nm to 90 nm 130 nm to 90 nm
8 by 8
16 by 16
32 by 32
64 by 64
6.7
6.9
7.0
7.0
3.9
3.9
3.9
3.9
2.0
2.0
2.0
2.0
Table 5.22: Reduced Area multiplier areas relative to 90 nm
Multiplier 250 nm to 90 nm 180 nm to 90 nm 130 nm to 90 nm
8 by 8
16 by 16
32 by 32
64 by 64
6.7
6.9
7.0
7.0
3.9
3.9
3.9
3.9
68
2.0
2.0
2.0
2.0
Figures 5.3, 5.4, and 5.5 show area pie charts of each type of multiplier. Clearly,
as expected, the (3,2) counters form the biggest portion of each multiplier’s area, ranging
from 30% of the area of an 8 by 8 Dadda multiplier to 69% of the area of a 64 by 64
Reduced Area multiplier. This significant area contribution indicates that any efforts to
appreciably reduce the area of column compression multipliers should be targeted at
minimizing the size of the (3,2) counters.
69
8 by 8 Wallace Multiplier Area
CLA
22%
BUFF
3%
DFF
12%
(3,2)
32%
BUFF
0%
DFF
23%
Filler
5%
Filler
5%
32 by 32 Wallace Multiplier Area
CLA
8%
BUFF
3%
DFF
7%
(3,2)
48%
(2,2)
7%
64 by 64 Wallace Multiplier Area
BUFF CLA
DFF 3% 5%
4%
(3,2)
AND2
13%
(2,2)
Filler
Filler
5%
(3,2)
59%
Filler
5%
CLA
14%
AND2
11%
(2,2)
8%
AND2
10%
AND2
12%
16 by 16 Wallace Multiplier Area
(2,2)
5%
(2,2)
6%
Figure 5.3: Area pie charts of Wallace multipliers
70
AND2
(3,2)
65%
DFF
BUFF
CLA
16 by 16 Dadda Multiplier Area
8 by 8 Dadda Multiplier Area
CLA
28%
CLA
18%
(3,2)
30%
BUFF
0%
BUFF
3%
(2,2)
4%
DFF
23%
AND2
10%
AND2
11%
Filler
5%
32 by 32 Dadda Multiplier Area
BUFF
3%
Filler
5%
(2,2)
2%
64 by 64 Dadda Multiplier Area
CLA
10%
BUFF
DFF 3%
CLA
5%
(3,2)
4%
AND2
13%
DFF
7%
(2,2)
Filler
AND2
AND2
13%
Filler
5%
(3,2)
48%
DFF
13%
Filler
5%
(3,2)
61%
(2,2)
1%
(2,2)
1%
Figure 5.4: Area pie charts of Dadda multipliers
71
DFF
(3,2)
69%
BUFF
CLA
16 by 16 RA Multiplier Area
8 by 8 RA Multiplier Area
CLA
BUFF 13%
3%
CLA
21%
(3,2)
35%
BUFF
0%
DFF
24%
DFF
13%
AND2
12%
(2,2)
4%
AND2
11%
Filler
5%
Filler
5%
AND2
13%
CLA
9%
DFF
4%
(3,2)
62%
Filler
5%
(2,2)
4%
64 by 64 RA Multiplier Area
32 by 32 RA Multiplier Area
BUFF
3%
DFF
7%
(3,2)
50%
BUFF CLA
3% 5%
(3,2)
(2,2)
AND2
13%
Filler
5%
(2,2)
1%
(2,2)
1%
Filler
AND2
DFF
BUFF
(3,2)
69%
Figure 5.5: Area pie charts of Reduced Area multipliers
72
CLA
At the beginning of this chapter, it was predicted that the area of the column
compression multipliers could be estimated by k λ2N2, where λ is the process’s minimum
feature size, N is the word size, and k is a constant scaling factor. This prediction was
based on observing the large gate count of the (3,2) counters. The pie charts show that
for the 16 by 16 and smaller multipliers, the area of the (3,2) counters is significant but
not dominant, only accounting for 30% to 50% of the total area. Even adding in the areas
of the other gates whose count grows as O(N2), such as AND gates and buffers, only
accounts for 40% to 65% of multiplier area. For the 16 by 16 and smaller multipliers,
the remaining components, such as the final carry-propagate adder, contribute as much as
35% to 60% of the overall area. This indicates that predicting area to grow as N2 does not
completely describe the smaller multipliers. An equation estimating area for column
compression multipliers needs to include both an N2 term and an N term. Table 5.23
details the percentage that the O(N2) cells, the carry lookahead adder, and the remaining
cells contribute to overall multiplier area.
73
Table 5.23: Breakdown of multiplier areas by components
Multiplier
Word
Size
% of Area
formed by
O(N2) cells
Wallace
Dadda
RA
Wallace
Dadda
RA
Wallace
Dadda
RA
Wallace
Dadda
RA
8 by 8
8 by 8
8 by 8
16 by 16
16 by 16
16 by 16
32 by 32
32 by 32
32 by 32
64 by 64
64 by 64
64 by 64
42%
40%
46%
62%
62%
65%
74%
77%
78%
81%
85%
85%
% of Area
formed by
CLA
22%
28%
21%
14%
18%
13%
8%
10%
9%
5%
5%
5%
% of Area
formed by
remaining
cells
36%
32%
33%
24%
20%
22%
18%
13%
13%
14%
10%
10%
A least squares method was applied to the measured multiplier areas to calculate
quadratic approximations in the form of
Area ≅ k1λ2N2 + k2λ2N + bλ2
(5.1)
where λ is the process’s minimum feature size, N is the word size, k1 and k2 are
coefficients, and b is a constant. Equations 5.2, 5.3, and 5.4 provide area approximations
for each type of column compression multiplier, with λ in units of nanometers.
Comparing these estimated areas from the equations to the measured areas, the error
ranges for the Wallace, Dadda, and Reduced Area data sets are 8.9% to -4.6%, 10.2% to
-4.1%, and 10.1% to -3.9%, respectively. The approximation equations provide the best
estimates to the areas measured for the 180 nm, 130 nm, and 90 nm processes and
74
libraries, with the error for the Wallace, Dadda, and Reduced Area data sets ranging from
-0.1% to -4.6%, -0.2% to -4.1%, and -0.1% to -3.9%, respectively.
AreaWallace ≅ 0.00283 λ2N2 + 0.015 λ2N - 0.0472 λ2
(5.2)
AreaDadda ≅ 0.00275 λ2N2 + 0.0122 λ2N - 0.0166 λ2
(5.3)
AreaRA ≅ 0.00276 λ2N2 + 0.00114 λ2N - 0.0246 λ2
(5.4)
Equation 5.5 is the general, combined form of a quadratic area approximation for
any of the three types of column compression multiplier.
AreaCCmultiplier ≅ 0.00277 λ2N2 + 0.0129 λ2N - 0.0295 λ2
(5.5)
Table 5.24 reports the differences between the measured areas and the estimated
areas calculated using Equation 5.5 for the Wallace, Dadda, and Reduced Area
multipliers, respectively. Note that Equation 5.5 overestimates areas by 2.8% to 13.6%
for all of the multipliers in the 250 nm process and underestimates areas by 0.9% to 7.0%
for all of the multipliers in the 90 nm process. The general area approximations differ the
most from the measured areas for the Reduced Area multipliers in 250 nm,
overestimating from 7.2% for the 64 by 64 multipliers to 13.6% for the 8 by 8
multipliers. For the other three process technologies, Equation 5.5 provides a very good
area estimate, with error percentages ranging between 1.8% and -7.0%.
75
Table 5.24: Comparison of estimated areas using general quadratic
approximation versus measured areas of the multipliers
Multiplier
Wallace
Dadda
RA
Wallace
Dadda
RA
Wallace
Dadda
RA
Wallace
Dadda
RA
Word Size
250 nm
180 nm 130 nm 90 nm
8 by 8
8 by 8
8 by 8
16 by 16
16 by 16
16 by 16
32 by 32
32 by 32
32 by 32
64 by 64
64 by 64
64 by 64
+ 7.6%
+ 7.7%
+ 13.6%
+ 3.9%
+ 8.0%
+ 10.3%
+ 2.8%
+ 7.7%
+ 8.5%
+ 2.8%
+ 7.0%
+ 7.2%
- 3.2%
- 3.4%
+ 1.8%
- 5.0%
- 1.6%
+ 0.5%
- 5.1%
- 0.9%
- 0.1%
- 4.6%
- 0.9%
- 0.7%
- 3.3%
- 4.2%
+ 1.4%
- 4.4%
- 1.2%
+ 1.1%
- 3.8%
+ 0.3%
+ 1.2%
- 3.0%
+ 0.7%
+ 1.0%
- 5.8%
- 6.3%
- 0.9%
- 7.0%
- 3.6%
- 1.4%
- 6.7%
- 2.5%
- 1.7%
- 6.1%
- 2.4%
- 2.1%
The area estimates for the process geometries that are smaller than 250 nm can be
improved by removing the 250 nm area data from the calculations. This is a valid option
because the 250 nm cell library belongs to an architecturally different family of cell
libraries. The 180 nm, 130 nm, and 90 nm cell libraries are all part of the same design
family. Equations 5.6, 5.7, 5.8, and 5.9 can be used to approximate area for designs in
180 nm or smaller process geometries.
AreaWallace, <180 nm ≅ 0.00288 λ2N2 + 0.0156 λ2N – 0.0479 λ2
(5.6)
AreaDadda, <180 nm ≅ 0.00280 λ2N2 + 0.0128 λ2N – 0.0169 λ2
(5.7)
AreaRA, <180 nm ≅ 0.00280 λ2N2 + 0.0120 λ2N – 0.0252 λ2
(5.8)
AreaCCmultiplier, <180 nm ≅ 0.00283 λ2N2 + 0.0134 λ2N – 0.0300 λ2
76
(5.9)
Comparing the area approximations from Equations 5.6, 5.7, and 5.8 to the
measured areas, the error ranges for the Wallace, Dadda, and Reduced Area data sets are
1.8% to -1.9%, 1.8% to -1.6%, and 1.6% to -1.6%, respectively. When the 250 nm data
is included in the development of the equations, the magnitude of the error is as high as
4.6% for multipliers in the 180 nm and smaller geometries. Excluding the 250 nm data
allows the error for area approximations to be within ± 2%.
Table 5.25 reports the differences between the measured areas and the estimated
areas calculated using Equation 5.9 for the Wallace, Dadda, and Reduced Area
multipliers, respectively. For the 180 nm and smaller process technologies, Equation 5.9
provides an excellent area estimate, with error percentages ranging between 4.8% to
-4.6%.
Table 5.25: Comparison general area approximations for geometries
< 180 nm versus measured areas of the multipliers
Multiplier
Wallace
Dadda
RA
Wallace
Dadda
RA
Wallace
Dadda
RA
Wallace
Dadda
RA
Word Size 180 nm 130 nm
90 nm
- 0.5%
- 1.4%
+ 4.4%
- 1.9%
+ 1.3%
+ 3.7%
- 1.5%
+ 2.7%
+ 3.6%
- 0.8%
+ 3.0%
+ 3.3%
- 3.0%
- 3.6%
+ 2.0%
- 4.6%
-1.1%
+ 1.2%
- 4.5%
- 0.2%
+ 0.7%
- 4.0%
- 0.1%
+ 0.2%
8 by 8
8 by 8
8 by 8
16 by 16
16 by 16
16 by 16
32 by 32
32 by 32
32 by 32
64 by 64
64 by 64
64 by 64
- 0.4%
- 0.6%
+ 4.8%
- 2.6%
+ 0.9%
+ 3.1%
- 2.8%
+ 1.5%
+ 2.3%
- 2.4%
+ 1.3%
+ 1.6%
77
Using Equations 5.6, 5.7, 5.8, and 5.9, it is possible to predict the areas of column
compression multipliers in smaller process geometries. Table 5.26 lists the predicted
areas for Wallace, Dadda, and Reduced Area multipliers in a 65 nm process technology.
The approximate area of the generalized column compression multiplier is also given.
Table 5.26: Predicted areas for column compression multipliers
in a 65 nm process technology
Word Size
Wallace
(µm2)
Dadda
(µm2)
8 by 8
16 by 16
32 by 32
64 by 64
1,104
3,967
14,367
53,856
1,118
3,822
13,773
51,845
Reduced
General
Area
CC Multiplier
(µm2)
(µm2)
1,056
1,091
3,733
3,840
13,630
13,929
51,594
52,471
5.8 Area Summary
In this research, the actual layouts for the Wallace, Dadda, and Reduced Area
multipliers successfully establish the area differences among the three multipliers types.
Where only gate-level area estimates existed in previous research, it has been shown with
timing-driven placed and routed designs that Wallace multipliers are generally the largest
of the three multipliers and Reduced Area multipliers the smallest.
This order of
multiplier size exists across process technologies and for various word sizes greater than
8 by 8. Careful examination of the data suggests that this will hold for multipliers even
larger than 64 by 64.
78
Area for column compression multipliers reduces by approximately 0.5 for each
generational transition by approximately 1
2 in the process minimum feature size, λ.
This 50% area reduction is supported by the MOSFET scaling rules outlined by Dennard,
et al. [68].
Timing-driven placement of column compression multipliers offers the
opportunity to achieve extremely high row utilization, creating very compact designs.
All placements of the multipliers achieved a 95% row utilization. In previous research
[41], placement algorithms had only the connectivity from the Verilog netlist to indicate
possible nearest neighbors in order to direct the cell placement.
This type of
connectivity-guided placement often yielded 5%-10% lower row utilizations with extra
filler spacing required for additional routing tracks.
For N by N multipliers, the actual layouts also confirm the dominance of the
multiplier components whose complexity is roughly proportional to N2. As the largest
component in the designs, the (3,2) counters, with first order complexity of O(N2), are a
major portion of the overall area. The AND gates and buffers also contribute to quadratic
growth with N. Across different process technologies, doubling the operand size will
increase the total area by a factor of somewhat less than four for each type of multiplier.
The area for column compression multipliers has been estimated in terms of the
word size, N, and the process’s minimum feature size, λ. In order to minimize the error
in an area approximation, the expression must include both N2 and N terms.
79
Chapter 6
Multiplier Delay
Column compression multipliers are often cited for their high speed. In the
literature, most of the high-speed multipliers are implemented as fully custom designs. In
such cases, design engineers expend significant time and effort manually constructing
layouts to minimize routing loads and optimize timing. A relatively minor change in the
design, such as increasing the word size or moving to a different process technology, can
require a time-consuming, major redesign of the multiplier.
The automation of column compression multiplier development yields not only
compact layouts as discussed in Chapter 5, but also fast and consistent delay times.
Though they differ slightly by the number and significantly by the method of application
of (3,2) and (2,2) counters, intuitively, Wallace, Dadda, and Reduced Area multipliers
should have approximately equal delay times for a given word size and process
technology. This is due to the three multipliers using the same number of reduction
stages. It is possible for some word sizes of Wallace and Reduced Area multipliers to be
slightly faster than Dadda multipliers if their final carry-propagate adder is faster. If a
carry lookahead adder is used to implement the carry-propagate adder, this will be a
minor effect.
80
As Dadda demonstrates in the generation of his sequence of intermediate column
heights, the number of reduction stages is proportional to the logarithm of the word size,
N. With the partial product reduction dominating multiplier delay, then the total delay for
the column compression multipliers is expected to be proportional to the logarithm of N
for an N-bit multiplier.
As process technologies scale down, overall multiplier delays are expected to
decrease in proportion to λ, where λ is the minimum feature size. At the smaller
technology features, the question becomes whether route parasitics will begin to have a
greater, more negative effect on timing.
The post-layout delay simulations of this
research will show the impact of parasitics for 250 nm to 90 nm process technologies.
This chapter presents the results of delay analysis using the Common Timing
Engine within Cadence’s Encounter platform.
The designs of each multiplier were
placed and routed in the standard cell libraries of four CMOS process technologies: 1)
250 nm, 2.5 V, 2) 180 nm, 1.8 V, 3) 130 nm, 1.2 V, and 4) 90 nm, 1.0 V. The worst case
delays of Wallace, Dadda, and Reduced Area multipliers are examined both with and
without the back-annotation of parasitic resistances and capacitances extracted from the
layouts. Before the actual delay values are reported, a simple analysis of delays is
provided to predict the trends in multiplier speed as a function of the input word sizes and
process technologies.
81
6.1 Delay Estimation
The total delay of a column compression multiplier is the sum of delays through
1) input signal buffering, 2) partial product array formation, 3) the reduction, and 4) the
final carry propagate adder (assumed to be a carry lookahead adder for this research).
Since the (3,2) and (2,2) counters are applied in parallel in each stage, the delay of each
stage is one (3,2) counter delay. Equations (6.1), (6.2), and (6.3) offer simple equations
for total delay of Wallace, Dadda, and Reduced area multipliers
DelayWallace,NxN = tbuffer + tAND + S⋅ t(3,2) + tCPA(2N-1-S)
(6.1)
DelayDadda,NxN = tbuffer + tAND + S⋅ t(3,2) + tCPA (2N-2)
(6.2)
DelayRA,NxN = tbuffer + tAND + S⋅ t(3,2) + tCPA(2N-2-S)
(6.3)
where:
S is the number of reduction stages, S ≅ log1.4 N.
For example, a 20 by 20 Dadda multiplier uses seven reduction stages. The total delay
for the 20 by 20 Dadda multiplier implemented with a carry lookahead adder is:
DelayDadda,20x20 = tbuffer + tAND + 7⋅ t(3,2) + tCLA(38)
(6.4)
For the word sizes and process technologies used in this research, the Wallace,
Dadda, and Reduced Area multipliers will have approximately the same total delays.
Examining Equations (6.1), (6.2), and (6.3), the delays through the three types of
multipliers are equal until the data flow reaches the final carry propagate adder. Wallace
and Reduced Area multipliers require smaller final carry propagate adders than Dadda
multipliers. Depending on the implementation of the final adder, the adder length may
make a small difference in the total delays amongst the three multipliers.
82
For carry lookahead adders, there are predictable points at which delay will be
increased due to the addition of a new lookahead logic level. If 4-bit lookahead blocks
are used, as in this research, these occur for N = 4k, for integer values of k. A few of the
adder lengths, where the increase from one length to the next length adds the delay of one
lookahead logic block, are going from adder length equal 4 to 5, from 16 to 17, from 64
to 65, from 256 to 257, etc. To see an impact from differing numbers of lookahead
levels, one would need to look at 34 by 34 Wallace, Dadda, and Reduced area
multipliers.
The adder lengths would be 59, 66, and 58 respectively.
The carry
lookahead adder used for the 34 by 34 Dadda multiplier would have four levels of
lookahead logic where as the Wallace and Reduced Area multipliers would have three
levels of lookahead logic.
For a given word size and process technology, the delays of the Wallace, Dadda,
and Reduced Area multipliers are proportional to log(N). Though different numbers of
(3,2) and (2,2) counters are applied within each reduction stage, the number of reduction
stages for each type of multiplier is the same.
The number of reduction stages is
proportional to the logarithm of the word size. Especially for large values of N, the delay
through the reduction stages dominates the overall multiplier delay.
Therefore, the
overall multiplier delay is approximately proportional to the logarithm of the word size.
Using simple analytic models [23], it is possible to predict a gate’s delay as an RC
delay expressed in terms of channel width, W, channel length, L, supply voltage, V,
current, I, and gate-oxide thickness, tox, :
83
⎛V ⎞
tgate ~ R C ~ ⎜ ⎟ C
⎝I⎠
(6.5)
Using the gate capacitance and current expressions
C~
WL
t ox
(6.6)
and
⎛W
I ~⎜
⎝L
⎞ ⎛⎜ 1
⎟⎜
⎠ ⎝ t ox
⎞ 2
⎟⎟ V
⎠
(6.7)
the approximation for gate delay becomes
⎛ W L ⎞ ⎛ (L )(t ox ) ⎞ ⎛ V ⎞
⎟⎟ ⎜
tgate ~ ⎜⎜
⎟⎜ 2 ⎟
⎝ t ox ⎠ ⎝ W ⎠ ⎝ V ⎠
(6.8)
cancelling the channel width, the supply voltage, and the gate-oxide thickness terms
yields
tgate
L2
~
V
(6.9)
To a first order, with both channel length and supply voltage proportional to λ, the
process’s minimum feature size, a gate’s propagation delay is proportional to λ. Since
each multiplier’s delay is the sum of gate delays, then multiplier delay is also
proportional to λ.
Based on these predictions of the effects of the word size, N, and process
geometry, λ, on delay, a rough estimate of column compression multiplier delay is
84
DelayCCMultiplier ≈ k λ (log N)
(6.10)
where k is a constant scaling factor.
6.2 Delay Analysis
In this research, delay analysis is performed using two methods. The first method
estimates delays based on the intrinsic delays listed in the datasheets for each library cell.
The intrinsic delay is the delay through the cell when there is no load on the output. No
routing delays are included. The intrinsic delay values are taken at 25°C, nominal
voltage, and typical process.
The second method uses the Common Timing Engine (CTE) of Cadence’s
Encounter platform. The Common Timing Engine takes as inputs a design’s netlist
(Verilog), cell library process information, parasitic resistance and capacitance data, and
simulation environment parameters such as temperature and voltage. All of the timing
analysis is performed at the nominal voltage level, 2.5 V, 1.8 V, 1.2 V, or 1.0 V, for the
particular process technology. Temperature is set at 25°C. Typical process models are
used.
6.3 Delay for Multipliers in the 250 nm Process Technology
Following cell placement and route in the 250 nm cell library, parasitic
resistances and capacitances were extracted for four Wallace multipliers, four Dadda
85
multipliers, and four Reduced Area Multipliers. Tables 6.1, 6.2, and 6.3 report the
comparisons of the delay values for the Wallace, Dadda, and Reduced Area multipliers.
At each word size, the estimated delays without routing are approximately the
same for the three types of multipliers. The variations in the estimated delays at the
smaller word sizes are due to delay increases in lookahead logic blocks as they increase
widths from 2-wide to 3-wide or 3-wide to 4-wide.
Comparing the estimated delays without route to the CTE delays that include
route, the delay increase is 18% or less for word sizes equal to or smaller than 32 by 32.
Generally, a 20% or less increase in delay due to routing is very reasonable. The impact
of larger areas and longer routing is seen more clearly in the 64 by 64 multipliers. For
these large multipliers, the delay increases from estimated delays without route to CTE
delays with route ranges from 27% to 46%.
86
Table 6.1: Delay values for Wallace multipliers in the 250 nm process
Word Size
Estimated Delay
w/o route
(nsec)
CTE Delay
w/ parasitics
(nsec)
Change due
to parasitics
8 by 8
4.1
4.8
17%
16 by 16
6.2
6.9
11%
32 by 32
7.5
8.6
12%
64 by 64
8.6
12.6
46%
Table 6.2: Delay values for Dadda multipliers in the 250 nm process
Word Size
Estimated Delay
w/o route
(nsec)
CTE Delay
w/ parasitics
(nsec)
Change due
to parasitics
8 by 8
4.1
4.7
15%
16 by 16
6.4
6.7
5%
32 by 32
7.5
8.5
13%
64 by 64
8.6
10.9
27%
Table 6.3: Delay values for Reduced Area multipliers in the 250 nm process
Word Size
Estimated Delay
w/o route
(nsec)
CTE Delay
w/ parasitics
(nsec)
Change due
to parasitics
8 by 8
4.0
4.7
18%
16 by 16
6.1
6.7
10%
32 by 32
7.5
8.5
13%
64 by 64
8.6
11.4
32%
87
There is a 3% or less difference among the CTE delays for the three types of
multipliers at the 8 by 8, 16 by 16, and 32 by 32 word sizes. For the 64 by 64 word size,
the Wallace multiplier was the slowest and the Dadda Multiplier was the fastest. The
Wallace multiplier is 15% slower and the Reduced Area multiplier is 6% slower than the
Dadda multiplier. Closer inspection of simulation report files for all three multipliers
reveals that the timing differences are due to routing load variations. Table 6.4 shows the
delays through input buffering (D flip-flops and buffers), partial product generation and
reduction, and the final carry propagate adder. The Wallace multiplier had significantly
higher loading and slower slew rates for the D flip-flops, the buffers, and four of the ten
(3,2) counters in the partial product reduction path. Replacing the more heavily loaded
1X-strength (3,2) counters with 2X-strength (3,2) counters would improve timing in the
partial product reduction without impacting area, since the 1X and 2X (3,2) counters
share the same foot prints. Also during timing driven placement, the tools could be
allowed to upsize cells when slew rate and timing budgets are not met.
Table 6.4: Critical section delays for 64 by 64 multipliers in 250 nm process
Multiplier
section
Wallace
Multiplier
(nsec)
Dadda
Multiplier
(nsec)
Reduced Area
Multiplier
(nsec)
Input buffering
1.7
1.4
1.4
6.6
5.0
5.7
4.3
4.5
4.3
Partial product
generation and
reduction
Final carry
propagate adder
88
6.4 Delay for Multipliers in the 180 nm Process Technology
Tables 6.5, 6.6, and 6.7 list the delay estimates and CTE timing analysis results
for the Wallace, Dadda, and Reduced Area multipliers, respectively, developed in 180 nm
process technology. For each word size, the estimated delays without routing are
approximately the same for the three types of multipliers. Comparing the estimated
delays without route to the CTE delays that include route, the delay increase is 16% or
less for word sizes equal to or smaller than 32 by 32. The impact of larger areas and
longer routing is seen more clearly in the 64 by 64 multipliers.
For these large
multipliers, the percentage of increased delay ranges from 24% to 36%.
There is a 4% or less difference amongst the CTE delays for the three types of
multipliers at the 8 by 8, 16 by 16, and 32 by 32 word sizes. For the 64 by 64 word size,
the Wallace multiplier was the slowest and the Dadda Multiplier was the fastest. The
Wallace and Reduced Area multipliers are 8% and 4%, respectively, slower than the
Dadda multiplier. Table 6.8 shows the delays through input buffering (D flip-flops and
buffers), partial product generation and reduction, and the final carry propagate adder.
Closer inspection of simulation report files reveals that the 0.6 nsec timing difference
between the Dadda multiplier and the Wallace multiplier is due to slightly higher routing
loads along the path of (3,2) counters in the partial product reduction stage.
89
Table 6.5: Delay values for Wallace multipliers in the 180 nm process
Word Size
Estimated Delay
w/o route
(nsec)
CTE Delay
w/ parasitics
(nsec)
Change due
to parasitics
8 by 8
2.9
3.2
10%
16 by 16
4.3
4.7
9%
32 by 32
5.1
5.9
16%
64 by 64
5.9
8.0
36%
Table 6.6: Delay values for Dadda multipliers in the 180 nm process
Word Size
Estimated Delay
w/o route
(nsec)
CTE Delay
w/ parasitics
(nsec)
Change due
to parasitics
8 by 8
2.8
3.2
14%
16 by 16
4.4
4.5
2%
32 by 32
5.1
5.8
14%
64 by 64
5.9
7.4
24%
Table 6.7: Delay values for Reduced Area multipliers in the 180 nm process
Word Size
Estimated Delay
w/o route
(nsec)
CTE Delay
w/ parasitics
(nsec)
Change due
to parasitics
8 by 8
2.8
3.1
11%
16 by 16
4.2
4.6
10%
32 by 32
5.1
5.8
14%
64 by 64
5.9
7.7
30%
90
Table 6.8: Critical section delays for 64 by 64 multipliers in 180nm process
Multiplier
section
Wallace
Multiplier
(nsec)
Dadda
Multiplier
(nsec)
RA
Multiplier
(nsec)
Input buffering
1.0
0.9
1.1
4.0
3.4
3.6
3.0
3.2
3.0
Partial product
generation and
reduction
Final carry
propagate adder
6.5 Delay for Multipliers in the 130 nm Process Technology
Two standard cell libraries were used to design column compression multipliers
in the 130 nm process technology. The generic standard cell library is referred to as
“130g” and the low power library as “130p.” Tables 6.9, 6.10, and 6.11 report the delay
values for the Wallace multipliers, Dadda multipliers, and Reduced Area multipliers,
respectively, developed using the 130g cell library.
At each word size, the estimated delays without routing are approximately the
same for the three types of multipliers. Comparing the estimated delays without route to
the CTE delays that include route, the delay increase was 20% or less for word sizes
equal to or smaller than 32 by 32. The impact of larger areas and longer routing is seen
more clearly in the 64 by 64 multipliers. For these large multipliers, the percentage of
increased delay ranges from 28% to 43%.
91
Table 6.9: Delay values for Wallace multipliers in the 130g cell library
Word Size
Estimated Delay
w/o route
(nsec)
CTE Delay
w/ parasitics
(nsec)
Change due
to parasitics
8 by 8
2.2
2.6
18%
16 by 16
3.3
3.8
15%
32 by 32
4.0
4.8
20%
64 by 64
4.6
6.6
43%
Table 6.10: Delay values for Dadda multipliers in the 130g cell library
Word Size
Estimated Delay
w/o route
(nsec)
CTE Delay
w/ parasitics
(nsec)
Change due
to parasitics
8 by 8
2.2
2.6
18%
16 by 16
3.4
3.8
12%
32 by 32
4.0
4.6
15%
64 by 64
4.6
5.9
28%
Table 6.11: Delay values for Reduced Area multipliers in the 130g cell library
Word Size
Estimated Delay
w/o route
(nsec)
CTE Delay
w/ parasitics
(nsec)
Change due
to parasitics
8 by 8
2.2
2.6
18%
16 by 16
3.3
3.8
15%
32 by 32
4.0
4.7
18%
64 by 64
4.6
6.3
37%
92
There is a 4% or less difference among the CTE delays for the three types of
multipliers at the 8 by 8, 16 by 16, and 32 by 32 word sizes. For the 64 by 64 word size,
the Wallace multiplier was the slowest and the Dadda Multiplier was the fastest. The
Wallace multiplier is 13% slower and the Reduced Area multiplier is 7% slower than the
Dadda multiplier. Table 6.12 shows the delays through input buffering (D flip-flops and
buffers), partial product generation and reduction, and the final carry propagate adder.
Closer inspection of simulation report files reveals that the 0.7 nsec timing difference
between the Dadda multiplier and the Wallace multiplier is due to slightly higher routing
loads along the path of (3,2) counters in the partial product reduction stage.
Table 6.12: Critical section delays for 64 by 64 multipliers in 130g cell library
Multiplier
section
Wallace
Multiplier
(nsec)
Dadda
Multiplier
(nsec)
RA
Multiplier
(nsec)
Input buffering
0.7
0.7
0.8
3.4
2.7
3.0
2.5
2.5
2.5
Partial product
generation and
reduction
Final carry
propagate adder
Tables 6.13, 6.14, and 6.15 report the delay values for the Wallace multipliers,
Dadda multipliers, and Reduced Area multipliers, respectively, developed using the low
power standard cell library designated as “130p.”
93
Table 6.13: Delay values for Wallace multipliers in the 130p cell library
Word Size
Estimated Delay
w/o route
(nsec)
CTE Delay
w/ parasitics
(nsec)
Change due
to parasitics
8 by 8
2.6
3.0
15%
16 by 16
3.8
4.3
13%
32 by 32
4.7
5.5
17%
64 by 64
5.4
7.6
41%
Table 6.14: Delay values for Dadda multipliers in the 130p cell library
Word Size
Estimated Delay
w/o parasitics
(nsec)
CTE Delay
w/ parasitics
(nsec)
Change due
to parasitics
8 by 8
2.5
2.9
16%
16 by 16
3.9
4.2
8%
32 by 32
4.7
5.4
15%
64 by 64
5.4
7.2
33%
Table 6.15: Delay values for Reduced Area multipliers in the 130p cell library
Word Size
Estimated Delay
w/o route
(nsec)
CTE Delay
w/ parasitics
(nsec)
Change due
to parasitics
8 by 8
2.5
2.9
14%
16 by 16
3.7
4.3
16%
32 by 32
4.7
5.4
15%
64 by 64
5.4
7.2
33%
94
At each word size, the estimated delays without routing are approximately the
same for the three types of multipliers developed with the 130p library. Comparing the
estimated delays without route to the CTE delays that include route, the delay increase
was 17% or less for word sizes equal to or smaller than 32 by 32. The impact of larger
areas and longer routing is seen more clearly in the 64 by 64 multipliers. For these large
multipliers, the percentage of increased delay ranges from 33% to 41%.
At each word size, the CTE delays for the three multiplier types in the 130p
library are approximately the same. The largest delay difference is only 0.4 nsec (5%)
between the 64 by 64 Wallace and Dadda multipliers.
6.6 Delay for Multipliers in the 90 nm Process Technology
Following cell placement and route in the 90 nm cell library, parasitic resistances
and capacitances were extracted for Wallace, Dadda, and Reduced Area Multipliers.
Tables 6.16, 6.17, and 6.18 report the comparisons of the delay values for each multiplier.
Generally, a 20% or less increase in delay due to including the routing characteristics is
very reasonable.
With the exception of the 16 by 16 Dadda multiplier, the delay
increases from estimated delays without route to CTE delays with route range from 22%
to 50%; the delay increase for the 16 by 16 Dadda multiplier is only 16%.
At each word size, the CTE delays for the three multiplier types are
approximately the same. The delay differences among the three types of multipliers are
less than 4% for each word size.
95
Table 6.16: Delay values for Wallace multipliers in the 90 nm process
Word Size
Estimated Delay
w/o route
(nsec)
CTE Delay
w/ parasitics
(nsec)
Change due
to parasitics
8 by 8
1.2
1.5
25%
16 by 16
1.8
2.2
22%
32 by 32
2.2
2.9
32%
64 by 64
2.6
3.9
50%
Table 6.17: Delay values for Dadda multipliers in the 90 nm process
Word Size
Estimated Delay
w/o route
(nsec)
CTE Delay
w/ parasitics
(nsec)
Change due
to parasitics
8 by 8
1.2
1.5
25%
16 by 16
1.9
2.2
16%
32 by 32
2.2
2.8
27%
64 by 64
2.6
3.9
50%
Table 6.18: Delay values for Reduced Area multipliers in the 90 nm process
Word Size
Estimated Delay
w/o route
(nsec)
CTE Delay
w/ parasitics
(nsec)
Change due
to parasitics
8 by 8
1.2
1.5
25%
16 by 16
1.8
2.2
22%
32 by 32
2.2
2.8
27%
64 by 64
2.6
3.8
46%
96
6.7 Delay Comparisons
Table 6.19 lists all of the back-annotated CTE delay data for the Wallace, Dadda,
and Reduced Area multipliers developed in generic standard cell libraries. Overall, for
the same word size and process technology, the three multipliers show approximately
equal delays. For the 64 by 64 multipliers, the Wallace multipliers are slightly slower
than the Dadda and Reduced Area multipliers. Figure 6.1 plots the back-annotated CTE
delays for the Dadda multipliers. Inspection of the multipliers’ delay data provides early
support of the contention that delay is proportional to the logarithm of the word size N.
Table 6.19:
Back-annotated delays for Wallace, Dadda, and Reduced Area
multipliers developed in generic standard cell libraries
Word Size
8 by 8
8 by 8
8 by 8
8 by 8
16 by 16
16 by 16
16 by 16
16 by 16
32 by 32
32 by 32
32 by 32
32 by 32
64 by 64
64 by 64
64 by 64
64 by 64
Process
250 nm
180 nm
130 nm
90 nm
250 nm
180 nm
130 nm
90 nm
250 nm
180 nm
130 nm
90 nm
250 nm
180 nm
130 nm
90 nm
Wallace Dadda RA Mult
(nsec) (nsec)
(nsec)
4.8
3.2
2.6
1.5
6.9
4.7
3.8
2.2
8.6
5.9
4.8
2.9
12.6
8.0
6.6
3.9
97
4.7
3.2
2.6
1.5
6.7
4.5
3.8
2.2
8.5
5.8
4.6
2.8
10.9
7.4
5.9
3.9
4.7
3.1
2.6
1.5
6.7
4.6
3.8
2.2
8.5
5.8
4.7
2.8
11.4
7.7
6.3
3.8
Back-annotated Delay for
N by N Dadda Multipliers
12
Delay (nsec)
10
8
Dadda 250nm
6
Dadda 180nm
Dadda 130g
4
Dadda 90nm
2
0
0
10
20
30
40
50
60
70
Word Size, N
Figure 6.1: Back-annotated delay for N by N Dadda multipliers
Table 6.20 provides comparison of the delays for all of the multipliers developed
in the 130 nm process. The multipliers implemented with the 130p library are 10% to
22% slower than the ones built in the 130g library.
98
Table 6.20: Back-annotated delays for Wallace, Dadda, and Reduced Area
multipliers developed in 130g and 130p cell libraries
Word Size
Cell
Wallace Dadda RA Mult
Library
(nsec) (nsec)
(nsec)
8 by 8
8 by 8
130g
130p
2.6
3.0
2.6
2.9
2.6
2.9
16 by 16
16 by 16
130g
130p
3.8
4.3
3.8
4.2
3.8
4.3
32 by 32
32 by 32
130g
130p
4.8
5.5
4.6
5.4
4.7
5.4
64 by 64
64 by 64
130g
130p
6.6
7.6
5.9
7.2
6.3
7.2
Tables 6.21, 6.22, and 6.23 give the normalized back-annotated delay values of
the Wallace, Dadda, and Reduced Area multipliers in the 250 nm, 180 nm, 130 nm
(generic library), and 90 nm CMOS technologies. The delay for each multiplier is
normalized to the delay of the 8 by 8 multiplier in that particular process technology.
These normalized delays show that as the operand size doubles the total delay increases
by slightly less than 50%. Moreover, the consistency of the normalized delays shows that
the multiplier delays are not adversely impacted by routing parasitics in the smaller
process geometries as the word size increases. If disproportionate scaling in critical
physical parameters or concentration densities had occurred in the 90 nm process
technology, one effect would have been a significant imbalance between drive
capabilities and routing parasitics.
99
Table 6.21: Wallace multipliers with back-annotated delays
relative to each process’s 8 by 8 case
Multiplier
8 by 8
16 by 16
32 by 32
64 by 64
Normalized
Delay
250 nm
1.0
1.4
1.8
2.6
Normalized
Delay
180 nm
1.0
1.5
1.8
2.5
Normalized
Delay
130 nm
1.0
1.5
1.8
2.5
Normalized
Delay
90 nm
1.0
1.5
1.9
2.6
Table 6.22: Dadda multipliers with back-annotated delays
relative to the process’s 8 by 8 case
Multiplier
8 by 8
16 by 16
32 by 32
64 by 64
Normalized
Delay
250 nm
1.0
1.4
1.8
2.3
Normalized
Delay
180 nm
1.0
1.4
1.8
2.3
Normalized
Delay
130 nm
1.0
1.5
1.8
2.3
Normalized
Delay
90 nm
1.0
1.5
1.9
2.6
Table 6.23: Reduced Area multipliers with back-annotated delays
relative to the process’s 8 by 8 case
Multiplier
8 by 8
16 by 16
32 by 32
64 by 64
Normalized
Delay
250 nm
1.0
1.4
1.8
2.4
Normalized
Delay
180 nm
1.0
1.5
1.9
2.5
Normalized
Delay
130 nm
1.0
1.5
1.8
2.4
Normalized
Delay
90 nm
1.0
1.5
1.9
2.5
Tables 6.24, 6.25, and 6.26 report the delay ratios of the multipliers in the 250
nm, 180 nm, and 130 nm processes to the 90 nm multiplier in the same word size. A
column compression multiplier design that is ported from a 250 nm process to a 90 nm
process will be approximately three times faster. Porting from 180 nm to 90 nm yields a
100
multiplier that is approximately twice as fast. Porting from 130 nm to 90 nm improves
delay by approximately 40%.
Table 6.24: Back-annotated Wallace multiplier delays relative to 90 nm
Multiplier 250 nm to 90 nm 180 nm to 90 nm 130 nm to 90 nm
8 by 8
3.2
2.1
1.7
16 by 16
3.1
2.1
1.7
32 by 32
3.0
2.0
1.7
64 by 64
3.2
2.0
1.7
Table 6.25: Back-annotated Dadda multiplier delays relative to 90 nm
Multiplier 250 nm to 90 nm 180 nm to 90 nm 130 nm to 90 nm
8 by 8
3.1
2.1
1.7
16 by 16
3.0
2.0
1.7
32 by 32
3.0
2.1
1.6
64 by 64
2.8
1.9
1.5
Table 6.26: Back-annotated Reduced Area multiplier delays relative to 90 nm
Multiplier 250 nm to 90 nm 180 nm to 90 nm 130 nm to 90 nm
8 by 8
3.1
2.1
1.7
16 by 16
3.0
2.1
1.7
32 by 32
3.0
2.1
1.7
64 by 64
3.0
2.0
1.7
101
At the beginning of this chapter, a rough estimate of column compression delay
was given as kλ log(N), where λ is the process’s minimum feature size, N is the word
size, k is a constant scaling factor. The delay approximations in Equations 6.11, 6.12,
and 6.13 are realized by directly calculating average k values for each type of multiplier,
with λ in units of nanometers. The generalized delay approximation in Equation 6.14 is
produced by averaging all of the calculated k values.
DelayWallace ≅ 0.0069 λ log2(N)
(6.11)
DelayDadda ≅ 0.0066 λ log2(N)
(6.12)
DelayRA ≅ 0.0067 λ log2(N)
(6.13)
DelayCCMultiplier ≅ 0.0068 λ log2(N)
(6.14)
In a few cases, the delay approximations generated using Equations 6.11 – 6.14
are slightly too “rough.” The differences between the estimated delay values from these
equations to the measured delay values can be as large as 24%. This poor approximation
to some delay values is due to the increasing delay of the carry-propagate adder as the
multiplier’s word size increases. Figure 6.2 shows contributions of the three main design
sections—the partial production generation (PP Gen) which includes input buffering, the
partial product reduction, and the carry lookahead adder—to the delays of the 250 nm
Dadda multipliers. There needs to be an additional term that accounts for the delay
caused by the input buffering and partial product generation and the carry-lookahead
adder.
102
16 by 16 Dadda Multiplier Delay
8 by 8 Dadda Multiplier Delay
PP Gen
0.7 ns
14%
PP Gen
0.8 ns
12%
CLA
1.9 ns
41%
CLA
2.8 ns
42%
Reduction
3.1ns
46%
Reduction
2.1ns
45%
32 by 32 Dadda Multiplier Delay
64 by 64 Dadda Multiplier
PP Gen
0.8 ns
10%
PP Gen
1.4 ns
13%
CLA
3.3 ns
38%
CLA
4.5 ns
41%
Reduction
5 ns
46%
Reduction
4.4 ns
52%
Figure 6.2: Delay pie charts for back-annotated Dadda multipliers
103
Better approximations can be found through the application of a least squares
method, solving for delay in the form of
Delay ≅ d1λ log2(N) + d2λ
(6.15)
where λ is the process’s minimum feature size, N is the word size, d1 and d2 are constant
scaling factors. Equations 6.16, 6.17, and 6.18 provide delay approximations for each
type of column compression multiplier, with λ in units of nanometers.
DelayWallace ≅ 0.0094 λ log2(N) – 0.011 λ
(6.16)
DelayDadda ≅ 0.0082 λ log2(N) – 0.0066 λ
(6.17)
DelayRA ≅ 0.0087 λ log2(N) – 0.0083 λ
(6.18)
Equation 6.19 is the general, combined form of the delay approximation for any
of the three types of column compression multiplier.
DelayCCmultiplier ≅ 0.0088 λ log2(N) – 0.0085 λ
(6.19)
The largest difference between the estimated delay values from these equations to
the measured delay values is 14%, when using the generalized delay equation. The best
fits are provided by the delay approximations for specific type of multiplier.
The
magnitude of the error ranges between 1.9% to 14% in the comparison of the Wallace
delay approximations from using Equation 6.16 to the measured Wallace multiplier
delays. The magnitude of the error ranges between 1.2% to 11% and 1.1% to 13% in the
comparison of approximated delays versus measured delays for Dadda and Reduced Area
multipliers respectively. Figure 6.3 plots the measured back-annotated Dadda delays and
the approximated delays using Equation 6.18.
104
Back-annotated Dadda Multiplier Delays
Measured verus Estimated
12.00
Delay (nsec)
10.00
Dadda 250nm
Estimate 250nm
8.00
Dadda 180nm
6.00
Estimate 180nm
4.00
Dadda 130nm
2.00
Estimate 130nm
Dadda 90nm
0.00
0
10
20
30
40
50
60
70
Estimate 90nm
Word Size, N
Figure 6.3: Back-annotated Dadda multiplier delays versus estimated delays
Using Equations 6.16, 6.17. 6.18, and 6.19, the delays of column compression
multipliers in smaller process geometries can be predicted. Table 6.27 lists the delay
predictions for column compression multipliers in a 65 nm process technology. An 8 by
8 multiply is expected to complete in approximately 1.2 nsec, a 16 by 16 in 1.7 nsec, a 32
by 32 in 2.3 nsec, and a 64 by 64 in 2.9 nsec.
105
Table 6.27: Predicted delays for column compression multipliers
in a 65 nm process technology
Word Size
Wallace
(nsec)
Dadda
(nsec)
8 by 8
16 by 16
32 by 32
64 by 64
1.1
1.7
2.3
3.0
1.2
1.7
2.2
2.8
General
Reduced
CC Multiplier
Area
(nsec)
(nsec)
1.2
1.2
1.7
1.7
2.3
2.3
2.8
2.9
6.8 Delay Summary
Timing analysis using Cadence’s Common Timing Engine has provided
significant insight into the delay characteristics of column compression multipliers. For a
given process technology, the delays of Wallace, Dadda, and Reduced Area multipliers
are approximately equal for 32 by 32 word sizes and smaller; the delay differences
amongst the multipliers are at most 5%. For the 64 by 64 multipliers, the Wallace
multipliers were slower than the Dadda and Reduced Area multipliers; the delay
differences are 8% and higher. For 32 by 32 and smaller multipliers, these delay-related
findings mean that an IC architect or designer can use other information, such as area or
power consumption, to select amongst the three types of column compression multipliers.
These findings also support the usage of automated design and layout tools since no
multiplier showed parasitic capacitances that were significantly detrimental to delay.
The delay data does confirm that column compression delay is proportional to the
logarithm of the word size. Delay can be very roughly estimated by k λ log(N), where k
106
is a constant scaling factor, λ is the minimum feature size, and N is the word size. This
research has shown that a better approximation to delay includes an additional linear term
in λ. This additional term is needed because while log(N) correctly represents the delay
growth in the reduction stages it is not sufficient to completely approximate the
increasing delay of the input buffering, the partial product generation, and the final carrypropagate adder.
107
Chapter 7
Multiplier Power Consumption
For today’s consumer and industrial product markets, the power consumption of
IC components is a critical concern.
Portable, battery operated devices require
conscientious power reduction techniques for all sub-components. Even products that
utilize a supply cord are manufactured in compact form factors, requiring that the heat
generation be minimized.
With multiplication among the most common arithmetic
operations performed for signal processing, it is important to examine the power
characteristics of column compression multipliers across various operand word sizes and
process technologies.
This chapter presents the results of multiplier simulations using Cadence’s multipurpose, hierarchical simulator, Virtuoso® UltraSim™. The designs of each multiplier
were placed and routed in the standard cell libraries of three CMOS process technologies:
1) 250 nm, 2.5 V, 2) 180 nm, 1.8 V, and 3) 130 nm, 1.2 V. For the 130 nm process
technology, a generic standard cell library and a low power standard cell library are used
to build multipliers. The average power consumption by Wallace, Dadda, and Reduced
Area multipliers is examined with the back-annotation of parasitic resistances and
capacitances extracted from the layouts. Before the actual power values are reported, a
simple analysis is provided to predict the trends in multiplier power consumption as input
word sizes and process technologies are changed.
108
7.1 Power Estimation
Unfortunately, in the literature there are no equations or even decent heuristics for
calculating the average power consumption of column compression multipliers.
Researchers have tried to offer relative measures for power characteristics, examining
nodal toggle counts, attempting to reduce spurious transitions, and offering probabilistic
analysis of switching activity. These power estimation attempts frequently fall short of
giving realistic results.
Dynamic power consumption in CMOS is described by
Power = C × V 2 × f
(7.1)
where C is capacitance, V is supply voltage, and f is operating frequency. Power for
column compression multipliers can be expressed as
⎛ Area CCMultipli er
PowerCCMultipli er ≅ ⎜⎜
t ox
⎝
⎞
⎛
⎞
1
⎟
⎟⎟ × V 2 × ⎜
⎟
⎜ max( Delay
)
CCMultipli
er
⎠
⎠
⎝
(7.2)
where AreaCCMultiplier is the total layout area, tox is the gate oxide thickness, and
max(DelayCCMultiplier) is the maximum delay for the multiplier’s word size. As discussed
in Chapter 5, the total area of an N by N column compression multiplier is expected to be
approximately equal to k λ2 N2, where k is a constant scaling factor and λ is the minimum
process geometry. To a first order, gate oxide thickness, supply voltage, and multiplier
delay are proportional to λ. Therefore, the expression for power becomes
⎛ k λ2 N 2
PowerCCMultiplier ~ ⎜⎜
⎝ λ
109
⎞
⎛1⎞
⎟⎟ × λ2 × ⎜ ⎟
⎝λ ⎠
⎠
(7.3)
Simplifying Equation 7.3, column compression multiplier power can be
approximated by
PowerCCMultiplier ~ k λ2 N 2
(7.4)
This approximation for power consumption in terms of λ and N indicates that power
should scale in the same manner as area for column compression multipliers. Based on
the similarities of the area and power approximations, for a given process technology,
doubling the operand size is predicted to increase the average power consumption by
approximately a factor of four. Also, average power consumption is estimated to reduce
by approximately 0.5 for each generational transition by approximately 1
2 in the
process minimum feature size.
For the word sizes and process technologies used in this research, it is expected
that the Wallace, Dadda, and Reduced Area multipliers will consume basically equivalent
amounts of power since the complexities and delays are similar for all three. With the
largest area and highest gate count, Wallace multipliers are expected to use the most
power, but since the Wallace area and gate counts are only slightly bigger than that of the
other two multipliers, the total average power consumptions should be very close.
7.2 Power Simulations
In this research, all power simulations were performed using Virtuoso®
UltraSim™. UltraSim takes as inputs a design’s netlist, RC paracitics file in SPEF
format, process technology information, simulation environment parameters such as
110
temperature and voltage, and a vector stimulus file. All simulations were performed at
the nominal voltage levels, 2.5 V, 1.8 V, or 1.2 V for the particular process technology.
The simulation temperature was set at 25°C. Typical process models were used.
For each word size, the same vector stimulus files were applied for power
analysis. Vector stimulus files containing randomly selected multiplier input values and
the expected multiplication products were used to collect average current and power
values. The average power values were determined multiplying the nominal voltage by
the average current. It was not possible to evaluate leakage current using UltraSim and
the given standard cell libraries.
One vector was applied each clock cycle. The period for each clock cycle was
determined by the longest timing delay for a given word size and process technology.
For example, in the 250 nm process technology, the worst case delays are 12.6 nsec, 10.9
nsec, and 11.4 nsec for the 64 by 64 Wallace, Dadda, and Reduced Area multipliers,
respectively. The clock period for simulating the three 64 by 64 bit multipliers is set at
13 nsec.
For this research, the average power consumption of column compression
multipliers is the main focus. Average power consumption is used to determine the
duration of battery life for portable consumer and industrial products. Knowing the peak
power is important for ensuring that the battery can provide the maximum instantaneous
power needed. According to [69, 70, 71], RMS power is directly related to the Joule
heating of the circuit, where high RMS current exacerbates electromigration effects and
111
creates thermal gradients across a chip. Therefore it is important to limit the RMS
current density in a design. In order to properly evaluate RMS power for the multipliers,
equivalent and extremely long time intervals would have been needed for each of the
multipliers during the power simulations. RMS power was not closely examined due to
time limitations on the availability of tools, libraries, and process technologies for this
research.
7.3 Power for Multipliers in the 250 nm Process Technology
Table 7.1 reports the average power values for four Wallace multipliers, four
Dadda multipliers, and four Reduced Area multipliers developed in the 250 nm process
technology.
All of the power simulations were performed with the back-annotation of
parasitic resistances and capacitances.
In all examined cases, the Reduced Area
multipliers utilized the least power and the Wallace multipliers the most. As shown in
Table 7.2, the Wallace and Dadda multipliers consumed significantly more power than
the Reduced Area multipliers, ranging from 5% to 48% more. For each of these column
compression multipliers, as the word size doubles, average power consumption increases
by approximately a factor of five. Figure 7.1 shows plots of the average power for each
of the multipliers.
112
Table 7.1: Average power for Wallace, Dadda, and Reduced Area
multipliers in the 250 nm process
Word Size
Wallace
(µW/MHz)
Dadda
(µW/MHz)
RA Mult
(µW/MHz)
8 by 8
22
21
20
16 by 16
114
110
102
32 by 32
539
658
442
64 by 64
3255
3034
2776
Table 7.2: Comparison of average power for Wallace, Dadda, and
Reduced Area multipliers in the 250 nm process
Word Size
Wallace
(µW/MHz)
Dadda
(µW/MHz)
RA Mult
(µW/MHz)
8 by 8
+ 10%
+ 5%
20
16 by 16
+ 12%
+ 8%
102
32 by 32
+ 22%
+ 48%
442
64 by 64
+ 17%
+ 9%
2776
113
Average Power in 250 nm Process
3500
µW/MHz
3000
2500
Wallace
2000
Dadda
RA Mult
1500
1000
500
0
0
10
20
30
40
50
60
70
Word Size, N
Figure 7.1: Average power consumption for Wallace, Dadda, and
Reduced Area multipliers in the 250 nm process
7.4 Power for Multipliers in the 180 nm Process Technology
Table 7.3 gives the average power values for four Wallace multipliers, four Dadda
multipliers, and four Reduced Area multipliers developed in the 180 nm process
technology.
All of the power simulations were performed with the back-annotation of
parasitic resistances and capacitances. As shown in Table 7.4, the power differences
amongst the three multipliers ranges from 3% to 23%.
For all of the simulated
multipliers, as the word size doubles, power consumption increases by approximately a
factor of five. Figure 7.2 displays plots of average power values for each multiplier.
114
Table 7.3: Average power for Wallace, Dadda, and Reduced Area
multipliers in the 180 nm process
Multiplier
Wallace
(µW/MHz)
Dadda
(µW/MHz)
RA Mult
(µW/MHz)
8 by 8
6.5
6.3
5.9
16 by 16
36
33
32
32 by 32
200
211
172
64 by 64
1022
926
1058
Table 7.4: Comparison of average power for Wallace, Dadda, and
Reduced Area multipliers in the 180 nm process
Word Size
Wallace
(µW/MHz)
Dadda
(µW/MHz)
RA Mult
(µW/MHz)
8 by 8
+ 10%
+ 7%
5.9
16 by 16
+ 13%
+ 3%
32
32 by 32
+ 16%
+ 23%
172
64 by 64
- 3%
- 12%
1058
115
Average Power in 180 nm Process
1200
µW/MHz
1000
800
Wallace
600
Dadda
400
RA Mult
200
0
0
10
20
30
40
50
60
70
Word Size, N
Figure 7.2: Average power consumption for Wallace, Dadda, and
Reduced Area multipliers in the 180 nm process
7.5 Power for Multipliers in the 130 nm Process Technology
Two standard cell libraries were used to design column compression multipliers
in the 130 nm process technology. The generic standard cell library is referred to as
“130g.” The low power standard cell library is referred to as “130p.”
Table 7.5 gives the average power values for four Wallace multipliers, four Dadda
multipliers, and four Reduced Area multipliers developed in the 130g cell library. All of
the power simulations were performed with the back-annotation of parasitic capacitances.
In all examined cases, the Reduced Area multipliers utilized the least power and the
Wallace multipliers the most. As shown in Table 7.6, the Wallace and Dadda multipliers
consumed significantly more power than the Reduced Area multipliers, ranging from 9%
116
to 23% more. For all of the simulated multipliers, as the word size doubles, power
consumption increases by approximately a factor of five.
Table 7.5: Average power for Wallace, Dadda, and Reduced Area
multipliers in the 130g cell library
Multiplier
Wallace
(µW/MHz)
Dadda
(µW/MHz)
RA Mult
(µW/MHz)
8 by 8
1.91
1.86
1.71
16 by 16
10.2
9.9
9.0
32 by 32
46
43
39
64 by 64
281
270
229
Table 7.6: Comparison of average power for Wallace, Dadda, and
Reduced Area multipliers in the 130g cell library
Word Size
Wallace
(µW/MHz)
Dadda
(µW/MHz)
RA Mult
(µW/MHz)
8 by 8
+ 12%
+ 9%
1.71
16 by 16
+ 13%
+ 10%
9.0
32 by 32
+ 18%
+ 10%
39
64 by 64
+ 23%
+ 18%
229
Table 7.7 gives the average power values for four Wallace multipliers, four Dadda
multipliers, and four Reduced Area multipliers developed in the low power 130p cell
library. As shown in Table 7.8, the power differences amongst the three multipliers
117
ranges from 4% to 14%. For all of the simulated multipliers, as the word size doubles,
power consumption increases by approximately a factor of five.
Table 7.7: Average power for Wallace, Dadda, and Reduced Area
multipliers in the 130p cell library
Multiplier
Wallace
(µW/MHz)
Dadda
(µW/MHz)
RA Mult
(µW/MHz)
8 by 8
1.47
1.49
1.37
16 by 16
8.3
8.2
7.8
32 by 32
39
37
34
64 by 64
203
199
212
Table 7.8: Comparison of average power for Wallace, Dadda, and
Reduced Area multipliers in the 130p cell library
Word Size
Wallace
(µW/MHz)
Dadda
(µW/MHz)
RA Mult
(µW/MHz)
8 by 8
+ 7%
+ 9%
1.37
16 by 16
+ 6%
+ 5%
7.8
32 by 32
+ 14%
+ 9%
34
64 by 64
- 4%
- 6%
211
Using the 130p cell library does reduce power in comparison to the 130g cell
library. Table 7.9 shows that the average power values decrease by 7% to 28%. Figure
7.3 shows plots of the average power for the multipliers in the 130g and 130p cell
libraries.
118
Table 7.9: Comparison of average power of a multiplier in the 130g cell
library to the respective multiplier in the 130p cell library
Word Size
Wallace
% reduction
Dadda
% reduction
RA Mult
% reduction
8 by 8
23%
20%
25%
16 by 16
19%
17%
13%
32 by 32
15%
14%
13%
64 by 64
28%
26%
7%
Average Power in 130g and 130p libraries
300
Wallace 130g
250
Dadda 130g
RA 130g
Wallace 130p
µ W/MHz
200
150
100
Dadda 130p
RA 130p
50
0
0
10
20
30
40
50
60
70
Word Size, N
Figure 7.3: Average power consumption for Wallace, Dadda, and
Reduced Area multipliers in 130g and 130p cell libraries
7.6 Power Comparisons
Table 7.10 lists all of the average power data for the Wallace, Dadda, and
Reduced Area multipliers developed in this research. Contrary to initial estimations, the
119
average power consumed by the three multipliers are not approximately equal for a given
word size and process technology. In this research, a 5% or less difference in average
power would be considered “approximately equal.” For a given word size, average
power values differed amongst the three multipliers by as little as 3% and as much as
48%, with a 10% to 20% variation being typical. These differences are more noticeable
when operating frequencies greater than 1 MHz are considered. For example, at 200
MHz in the 130g cell library, the average power consumed for 64 by 64 Wallace, Dadda,
and Reduced Area multipliers would be 56.2 mW, 54.0 mW, and 45.8 mW respectively.
Table 7.10: Comparison of average power consumption for
Wallace, Dadda, and Reduced Area multipliers
Word Size
8 by 8
8 by 8
8 by 8
8 by 8
16 by 16
16 by 16
16 by 16
16 by 16
32 by 32
32 by 32
32 by 32
32 by 32
64 by 64
64 by 64
64 by 64
64 by 64
Process
Wallace
(µW/MHz)
Dadda
(µW/MHz)
RA Mult
(µW/MHz)
250 nm
180 nm
130g
130p
250 nm
180 nm
130g
130p
250 nm
180 nm
130g
130p
250 nm
180 nm
130g
130p
22
6.5
1.91
1.47
114
36
10.2
8.3
539
200
46
39
3255
1022
281
203
21
6.3
1.86
1.49
110
33
9.9
8.2
658
211
43
37
3034
926
270
199
20
5.9
1.71
1.37
102
32
9.0
7.8
442
172
39
34
2776
1058
229
212
120
The Reduced Area multiplier is the lowest average power choice for word sizes
equal to or smaller than 32 by 32. For the larger multipliers, there is no clear lowest
power performer. For the 64 by 64 word size, the Dadda multipliers exhibited the lowest
average power in the 180 nm process and when using the low power cell library with the
130 nm process. These inconsistent power profiles among the larger 64 by 64 multipliers
are caused by 1) differences in routing parasitics due to the non-uniform nature of placed
and routed multiplier designs, and 2) spurious signal transitions in lower reduction stages
and the final carry propagate adder as partial product bits are summed.
Across word sizes and standard cell libraries, each timing-driven cell placement
and route provides a unique layout solution. The place and route of a 64 by 64 Dadda
multiplier is not the place and route of a 32 by 32 Dadda multiplier extended by
additional cells. The place and route of a 64 by 64 Dadda multiplier in the 130 nm
standard cell library is not identical to the layout of a 64 by 64 Dadda multiplier in the
180 nm standard cell library with physical characteristics scaled down. Timing-driven
placement and route constructs an optimized layout of the design’s critical path according
to user-specified timing constraints, but it does not attempt to optimize non-critical paths
that meet basic timing requirements. This means that connected cells can be placed at
various distances from each other, resulting in disparate routing parasitics for non-critical
paths. The signal toggling along these non-uniform, non-optimized routes helps to create
a unique power consumption signature for each automated multiplier layout.
Spurious signal transitions also contribute to higher than anticipated power
consumption. For cells with multiple primary inputs, the propagation delays of each
121
primary input to each primary output are not equal. This delay imbalance allows a cell’s
primary outputs to possibly transition unnecessarily before all of the input stimuli are
resolved. These spurious transitions cause additional power to be consumed in the
reduction stages as well as in the carry-propagate adder.
As the word size doubles, the average power increases by approximately a factor
of five. In Section 7.1, it is predicted that the average power consumed in column
compression multipliers would increase by approximately a factor of four.
This
prediction is based on examining power as a function of capacitance, voltage and
frequency. It does not take into account the multiplier’s long combinatorial logic paths
which allow for significant numbers of spurious transitions during the partial product
reduction and the addition by the final carry-propagate adder.
Also in Section 7.1, it is predicted that average power consumption should reduce
by approximately 50% for each generational transition by approximately 1
2 in the
process’s minimum feature size, λ. This research finds that average power consumption
decreases by approximately a factor of 3.5 or by roughly 70%. In [23], Weste and
Esharaghian point out that more rigorous analysis would modify the first order
approximations for the scaling of certain MOS device characteristics. They note that
when all MOS dimensions, device voltages, and concentration densities are scaled by
1
2 , dynamic power consumption will decrease by somewhat more than the expected
factor of 2.
122
Examining power/area ratios in Tables 7.11, 7.12, and 7.13 provides insight into
possible high power consumption within a given area. Where low power is of utmost
concern, detection of these possible hot spots allows the designer to adjust circuits
accordingly before silicon manufacture. From the tables, it is clear that increasing the
operand word size increases the power to area ratio. It is also important to note that
porting from processes with larger cell features to a smaller cell features reduces the
power to area ratio for a given word size.
Table 7.11: Power/Area for Wallace multipliers
Word Size
Process
Power
(µW/MHz)
Area
(µm2)
Power/Area
[µW/(MHz mm2)]
8 by 8
8 by 8
8 by 8
8 by 8
16 by 16
16 by 16
16 by 16
16 by 16
32 by 32
32 by 32
32 by 32
32 by 32
64 by 64
64 by 64
64 by 64
64 by 64
250 nm
180 nm
130g
130p
250 nm
180 nm
130g
130p
250 nm
180 nm
130g
130p
250 nm
180 nm
130g
130p
22
6.5
1.91
1.47
114
36
10.2
8.3
539
200
46
39
3255
1022
281
203
14,576
8,400
4,388
3,493
53,321
30,221
15,661
12,739
195,713
109,880
56,584
46,867
738,385
412,456
211,551
177,318
1509
774
444
421
2138
1191
651
652
2754
1821
813
832
4408
2478
1328
1144
123
Table 7.12: Power/Area for Dadda multipliers
Word Size
Process
8 by 8
8 by 8
8 by 8
8 by 8
16 by 16
16 by 16
16 by 16
16 by 16
32 by 32
32 by 32
32 by 32
32 by 32
64 by 64
64 by 64
64 by 64
64 by 64
250 nm
180 nm
130g
130p
250 nm
180 nm
130g
130p
250 nm
180 nm
130g
130p
250 nm
180 nm
130g
130p
Power
(µW/MHz)
21
6.3
1.86
1.49
110
33
9.9
8.2
658
211
43
37
3034
926
270
199
Area
(µm2)
14,570
8,421
4,428
3,515
51,288
29,174
15,161
12,353
186,909
105,230
54,257
45,104
709,509
397,137
203,784
171,474
Power/Area
[µW/(MHz mm2)]
1441
748
420
424
2145
1131
653
664
3520
2005
793
820
4276
2332
1325
1161
Table 7.13: Power/Area for Reduced Area multipliers
Word Size
8 by 8
8 by 8
8 by 8
8 by 8
16 by 16
16 by 16
16 by 16
16 by 16
32 by 32
32 by 32
32 by 32
32 by 32
64 by 64
64 by 64
64 by 64
64 by 64
Process
250 nm
180 nm
130g
130p
250 nm
180 nm
130g
130p
250 nm
180 nm
130g
130p
250 nm
180 nm
130g
130p
Power
(µW/MHz)
20
5.9
1.71
1.37
102
32
9.0
7.8
442
172
39
34
2776
1058
229
212
124
Area
(µm2)
13,807
7,990
4,181
3,339
50,185
28,551
14,811
12,103
185,407
104, 386
53,783
44,766
707,699
396,111
203,207
171,060
Power/Area
[mW/(MHz mm2)]
1449
738
409
410
2032
1121
608
661
2384
1648
725
760
3923
2671
1127
1239
In order to find a better approximation of average power consumption in column
compression multipliers than the expression k λ2 N2 discussed in Section 7.1, this research
initially attempted to find a constant scaling factor, α, where average power consumption
in µW/MHz for column compression multipliers could be estimated by the expression:
PowerCCMultiplier ≅ α ×
AreaCCMultiplier
t ox
×V 2
(7.5)
With supply voltage and gate oxide thickness proportional to λ, the expression for
average power consumption in µW/MHz becomes
PowerCCMultiplier ≅ α ×
AreaCCMultiplier
λ
× λ2
(7.6)
Substituting the form of the quadratic approximation for area given in Equation 5.1,
PowerCCMultiplier ≅ α × (k1λ3 N 2 + k 2 λ3 N + βλ3 )
(7.7)
where N is the word size and k1, k2, and β are constant scaling factors determined using a
least squares method for each type of multiplier in Chapter 5. Examination of plots of the
collected power data showed that average power consumption was better approximated in
terms of λ4N2 rather than λ3N2.
Attempts to determine a constant value for α were unsuccessful. Values for α
ranged too widely.
For example, for Wallace multipliers in the 250 nm process
geometry, values of α would be 2.17×10-8, 3.18×10-8, 4.15×10-8, and 6.66×10-8, for 8 by
8, 16 by 16, 32 by 32, and 64 by 64 word sizes, respectively. Closer examination of the
125
calculated α values revealed that α was growing by factors of roughly 2 with each
doubling of the word size N. Instead of being a constant, α is specified as
α = q × 2 (log
2
N)− 3
(7.8)
where q is a constant scaling factor. Values of q range between 2.0×10-8 and 3.0×10-8.
Using area approximation equations 5.2, 5.3, and 5.4, the average power for
column compression multipliers is estimated by
PowerWallace ≅ 2.38×10-8
PowerDadda ≅ 2.55×10-8
PowerRA ≅ 2.14×10-8
2
2
2
(log 2 N ) − 3 4
λ (0.00283 N2 + 0.015 N – 0.0472)
(log 2 N ) − 3 4
λ (0.00275 N2 + 0.0122 N – 0.0166)
(log 2 N ) − 3 4
λ (0.00276 N2 + 0.0014 N – 0.0246)
(7.9)
(7.10)
(7.11)
Comparing the average power from these equations to the measured power, the
error ranges for the Wallace, Dadda, and Reduced Area data sets are 15% to -17%, 19%
to -19%, and 20% to -28%, respectively.
For the Reduced Area multipliers, the
magnitude of the error only exceeds 20% in the case of the 64 by 64 multiplier developed
in the 180 nm process technology.
For multipliers developed in process technologies that are smaller than 250 nm,
the power approximations can be improved by using the area approximations given by
Equations 5.6, 5.7, and 5.8. Estimates for the average power of multipliers in 180 nm and
smaller process geometries can be calculated using
PowerWallace, <180nm ≅ 2.60×10-8 2
PowerDadda, <180nm ≅ 2.57×10-8 2
(log 2 N ) − 3 4
λ (0.00288 N2 + 0.0156N – 0.0479) (7.12)
(log 2 N ) − 3 4
λ (0.0028 N2 + 0.0128N – 0.0169)
126
(7.13)
PowerRA, <180nm ≅ 2.53×10-8 2
(log 2 N ) − 3 4
λ (0.0028 N2 + 0.012 N – 0.0252)
(7.14)
Comparing the average power from Equations 7.12, 7.13, and 7.14 to the
measured power, the error ranges for the Wallace, Dadda, and Reduced Area data sets for
180 nm and smaller process geometries are 10% to -7%, 13% to -17%, and 19% to -13%,
respectively.
When the 250 nm data is included in the development of the power
estimation equations, the magnitude of the error is as high as 28% for multipliers in the
180 nm and smaller geometries. Excluding the 250 nm data allows the error for the
power consumption approximations to be within ± 20%.
Using Equations 7.12, 7.13, and 7.14, it is possible to predict the average power
of each type of column compression multiplier in smaller process geometries. Tables
7.14 and 7.15 list the predicted average power for Wallace, Dadda, and Reduced Area
multipliers in 90 nm and 65 nm process technologies.
Table 7.14: Predicted average power for column compression
multipliers in a 90 nm process technology
Wallace
Dadda
Word Size
(µW/MHz) (µW/MHz)
8 by 8
16 by 16
32 by 32
64 by 64
0.44
2.24
11.5
61.3
0.45
2.16
11.0
58.5
127
Reduced
Area
(µW/MHz)
0.41
2.07
10.7
57.3
Table 7.15: Predicted average power for column compression
multipliers in a 65 nm process technology
Word Size
8 by 8
16 by 16
32 by 32
64 by 64
Wallace
Dadda
(µW/MHz) (µW/MHz)
0.12
0.61
3.14
16.7
0.12
0.59
2.99
15.9
Reduced
Area
(µW/MHz)
0.11
0.56
2.91
15.6
A generalized power approximation equation for all types of column compression
multipliers is not given. The differences amongst the average power data for the three
types of multipliers for a given word size and process technology were as little as 3% and
as high as 48%. Attempts at developing one generalized power approximation equation
would produce an error that significantly exceeds + 20%. For this research, and generally
in design engineering practice, equations for approximating area, delay, or power are
deemed unacceptable and worthless if the error exceeds + 20%.
7.7 Power Summary
The UltraSim simulations performed in this research provide insight into the
power characteristics of Wallace, Dadda, and Reduced Area multipliers. One of the key
conclusions from examining forty eight column compression multipliers is that Wallace,
Dadda, and Reduced Area multipliers do not consume equal power. Typically, average
power varies between 10% to 20% amongst the multipliers for a given word size. For
word sizes 32 by 32 and smaller, Reduced Area multipliers consistently consume the
least power. Wallace multipliers usually, but not always, consume the most power.
128
As the word size doubles, the average power increases by approximately a factor
of five.
Initial analysis estimated a factor of four increase, not five, but power
examinations that do not take into account the flow of signals, including spurious
transitions, will underestimate power in these large, multi-staged multipliers.
Average power consumption decreases by approximately a factor of 3.5 or by
roughly 70% for each generational transition by approximately 1
2 in the process
minimum feature size. This reduction in average power consumption is larger than the
factor of 2 or 50% initially given as a first order approximation.
More rigorous
examinations of power consumption support larger decreases than the expected 50% as
MOS device characteristics are scaled by 1
2.
The Wallace, Dadda, and Reduced Area multipliers show very similar power to
area ratios for a given word size and process technology. This indicates that these
column compression multipliers would indicate similar “hot spot” characteristics.
The average power for each of the three types of column compression multipliers
has been estimated in terms of the word size, N, and the process’s minimum feature size,
λ. It is not feasible to create one generalized power approximation equation since the
average power varies significantly amongst the types of column compression multipliers
for a given word size and process technology.
If the design goal is to quickly develop a low power column compression
multiplier, then a Reduced Area multiplier should be implemented using an automated
place and route methodology.
For a given word size and process technology, the
129
Reduced Area multipliers consumed the least power 14 out of 16 times when compared
to Wallace and Dadda multipliers.
130
Chapter 8
Conclusions
During the past decade, many chip architects and IC designers have insisted that
fast multipliers are only realized by custom design and layout, which often takes an
engineer three months or more to design, layout, and verify.
Column compression
multipliers are dismissed as too time consuming and complex to layout because of their
irregular structure. Unlike the two to three year development periods allotted for large
microprocessors, the time to market for application specific ICs is typically three to six
months. This research demonstrates that an automated multiplier generation and layout
process makes the column compression multiplier a viable option for application specific
CMOS products.
In this research, sixty column compression multipliers were designed in order to
better understand size and performance characteristics. These multipliers were developed
and analyzed using industry standard design practices and tools. The resultant area,
delay, and power data provide key insight into how the multipliers perform as the
operand word size and the process technology are changed.
The place and route of the multipliers yields extremely compact and regular
layouts. All of the multipliers show very high row utilizations at 95%. Across different
process technologies, doubling the operand size will increase the total area by
approximately a factor of four for each type of multiplier. Area for column compression
131
multipliers reduces by approximately 0.5 for each generational transition by
approximately 1
2 in the process minimum feature size.
This research has shown that the area of an N by N column compression
multiplier is best estimated by an approximation that includes both an N2 term and an N
term. For 180 nm and smaller process geometries, the area of column compression
multipliers can be estimated to within + 2% for a given process geometry,
, using the
following type-specific equations or the generalized equation:
AreaWallace, <180 nm ≅ 0.00288 λ2N2 + 0.0156 λ2N – 0.0479 λ2
(8.1)
AreaDadda, <180 nm ≅ 0.0028 λ2N2 + 0.0128 λ2N – 0.0169 λ2
(8.2)
AreaRA, <180 nm ≅ 0.0028 λ2N2 + 0.012 λ2N – 0.0252 λ2
(8.3)
AreaCCmultiplier, <180 nm ≅ 0.00283 λ2N2 + 0.0134 λ2N – 0.03 λ2
(8.4)
The exclusion of the 250 nm data to develop the improved area approximations for
designs in 180 nm and smaller process geometries is a valid step. The 250 nm cell library
belongs to an architecturally different family of cell libraries, whereas the 180 nm, 130
nm, and 90 nm cell libraries are all part of the same design family. The semiconductor
foundry for all of the process technologies and the supplier of all the standard cell
libraries are the premier vendors for their respective markets. Other tier 1 and tier 2
merchant semiconductor foundries attempt to clone the process technologies of this
semiconductor foundry. Utilizing this latest architectural family of standard cell libraries
in conjunction with the processes technologies from the premier, merchant
132
semiconductor foundry forms the best basis for predicting area for column compression
multipliers for 180 nm and smaller geometries.
The delay data of this research challenges the prediction that multiplier delay
increases in proportion to the logarithm of the word size.
Using this logarithmic
relationship solely provides a very rough estimate, with the error often exceeding + 20%.
A better approximation for delay includes an additional λ term. The delay of column
compression multipliers can be estimated to within + 14 % for a given process geometry,
λ, using the following type-specific equations or the generalized equation:
DelayWallace ≅ 0.0094 λ log2(N) – 0.011 λ
(8.5)
DelayDadda ≅ 0.0082 λ log2(N) – 0.0066 λ
(8.6)
DelayRA ≅ 0.0087 λ log2(N) – 0.0083 λ
(8.7)
DelayCCmultiplier ≅ 0.0088 λ log2(N) – 0.0085 λ
(8.8)
The average power values are very close, but it can not be said that the three
multipliers show approximately equal power consumption for a given word size and
process technology.
Power consumption in column compression multipliers does
increase with an increase in word size. For a given word size and process technology, the
Reduced Area multipliers consume the least average power 14 out of 16 times in
comparison to Wallace and Dadda multipliers. As the word size doubles, the average
power increases by approximately a factor of five.
Average power consumption
decreases by approximately a factor of 3.5 or by roughly 70% for each generational
transition by approximately 1
2 in the process minimum feature size.
133
Using Equations 8.9 and 8.10, the average power of Wallace and Dadda
multipliers can be estimated to within + 17% and + 19%, respectively, for a given process
geometry. Using Equation 8.11, the average power for Reduced Area multipliers can be
estimated to within + 28%.
This significantly larger error for the Reduced Area
multiplier is due deriving the power equation using two power data points which seem
unusually high. These higher than anticipated measure power values may be due to nonoptimized routing loads and spurious signal transitions.
PowerWallace ≅ 2.38×10-8
PowerDadda ≅ 2.55×10-8
PowerRA ≅ 2.14×10-8
2
2
2
(log 2 N ) − 3 4
λ (0.00283 N2 + 0.015 N – 0.0472)
(log 2 N ) − 3 4
λ (0.00275 N2 + 0.0122 N – 0.0166)
(log 2 N ) − 3 4
λ (0.00276 N2 + 0.0014 N – 0.0246)
(8.9)
(8.10)
(8.11)
Excluding 250 nm data from the development of the area equations yields
power approximation equations with lower error magnitudes. The average power of
Wallace, Dadda, and Reduced Area multipliers can be estimated to within + 10%, + 17%,
and + 19%, respectively, for 180 nm and smaller geometries using the following
equations:
PowerWallace, <180nm ≅ 2.60×10-8 2
PowerDadda, <180nm ≅ 2.57×10-8 2
PowerRA, <180nm ≅ 2.53×10-8 2
(log 2 N ) − 3 4
λ (0.00288 N2 + 0.0156N – 0.0479) (8.12)
(log 2 N ) − 3 4
λ (0.0028 N2 + 0.0128N – 0.0169)
(log 2 N ) − 3 4
λ (0.0028 N2 + 0.012 N – 0.0252)
134
(8.13)
(8.14)
Column compression multipliers should be used when fast multiplication is
needed in CMOS products that are largely developed using automated design, layout, and
verification practices. When the IC development schedule is short, these multipliers can
be generated, placed and routed, and simulated in a matter of days instead of the two or
three months required for a custom implementation.
With the automation of the
development of the column compression multipliers, chip architects and designers can
quickly accommodate changes in the word size or the selection of a different
semiconductor foundry. Also, once developed, the multiplier’s fully verified netlist can
be easily reused for future products.
Based on the analysis of area, delay, and power data summarized herein, select
the Reduced Area multiplier for implementation. The Reduced Area multiplier lives up
to its name by having the smallest area of the three types of multipliers examined. In
most cases, the Reduced Area multiplier consumes the least average power. While
achieving the smallest area and the lowest power, the Reduced Area multiplier maintains
the same fast delay as the Wallace and Dadda multipliers.
The selection of which type of column compression multiplier may be impacted
by architectural requirements on the final carry propagate adder. For example, if the final
carry propagate adder is going to used to sum operands from other arithmetic functions,
the word length of the final carry propagate adder may need to be longer than the word
length for a Reduced Area multiplier. In this case, the IC designer should consider the
Wallace multiplier with its one bit pair longer adder word length or the Dadda multiplier
with its S bit pairs longer adder word length, where S is the number of reduction stages.
135
Finally, this research shows the critical importance of the (3,2) counters.
Regardless of which type of column compression multiplier is selected, an IC designer
can significantly improve the multiplier’s performance by adjusting the (3,2) counter cell.
For the 130 nm and smaller process geometries, many standard cell library vendors offer
special (3,2) counter cells that have been tailored to be low power or high speed. If time
is available for any custom design and layout, then develop a new (3,2) counter that fits
the performance goals. Note that for a new, custom (3,2) counter cell to be inserted into a
project’s set of standard cells, the new cell would require significant simulation time to
fully characterize it and create its timing file as well as the creation of multiple layout
views.
136
Bibliography
[1]
Robert F. Shaw, “Arithmetic Operations in a Binary Computer,” Review of
Scientific Instruments, vol. 21, pp. 687-693, 1950.
[2]
J. C. Majithia and R. Kitai, “An Iterative Array for Multiplication of Signed
Binary Numbers,” IEEE Transactions on Electronic Computers, vol. EC-13, pp.
14-17, 1964.
[3]
R. De Mori, “Suggestions for an I.C. Fast Parallel Multiplier,” Electronics
Letters, vol. 5, pp. 50 -51, 1965.
[4]
H. H. Guild, “Fully Iterative Fast Array for Binary Multiplication,” Electronics
Letters, vol. 38, pp. 843-852, 1968.
[5]
A. D. Pezaris, “A 40ns 17-bit by 17-bit Array Multiplier,” IEEE Transactions on
Computers, vol. C-20, pp. 442-447, 1971.
[6]
Israel Koren, Computer Arithmetic Algorithms, Englewood Cliffs, NJ: Prentice
Hall, Inc., 1993.
[7]
Andrew D. Booth, “A Signed Binary Multiplication Technique,” Quarterly
Journal of Mechanics and Applied Mathematics, vol. 4, pp. 236-240, 1951.
[8]
Charles R. Baugh and Bruce. A. Wooley, “A Two’s Complement Parallel Array
Multiplication Algorithm,” IEEE Transactions on Computers, vol. C-22, pp.
1045-1047, 1973.
[9]
Thomas K. Callaway and Earl E. Swartzlander, Jr., “Optimizing Multipliers for
WSI,” Proceedings of the 1993 International Conference on Wafer Scale
Integration, pp. 85-94, 1993.
[10]
N. Vansantha, M. Satyam, and K. Subba Rao, “Technique for Minimizing Power
Consumption in Array Multipliers through Input Vector Ordering,” Proceedings
of the International Conference on Signal Processing, Communications, and
Networking, pp. 162-167, February, 2007.
[11]
Edwin de Angel and Earl E. Swartzlander, Jr., “An Ultra Low Power Multiplier,”
International Conference on Signal Processing Applications and Technology, pp.
2118-2122, 1995.
137
[12]
Shivaling S. Mahant-Shetti, Poras T. Balsara, and Carl Lemonds, “High
Performance Low Power Array Multiplier Using Temporal Tiling,” IEEE
Transactions on Very Large Scale Integration Systems, vol. 7, pp. 121-124, 1999.
[13]
Chang-Young Han, Hyoung-Joon Park, and Lee-Sup Kim, “A Low-Power Array
Multiplier Using Separated Multiplication Technique,” IEEE Transactions on
Circuits and Systems II: Analog and Digital Signal Processing, vol. 48, pp. 866871, 2001.
[14]
C. S. Wallace, “A Suggestion for a Fast Multiplier,” IEEE Transactions on
Electronic Computers, vol. EC-13, pp. 14-17, 1964.
[15]
Luigi Dadda, “Some Schemes for Parallel Multipliers,” Alta Frequenza, vol. 34,
pp. 349-356, August 1965.
[16]
P. R. Cappello and K. Steiglitz, “A VLSI Layout for a Pipelined Dadda
Multiplier,” ACM Transactions on Computer Systems, vol. 1, pp. 157-174, 1983.
[17]
L. Breveglieri, L. Dadda, and V. Piuri, “Column Compression Pipelined
Multipliers,” Proceedings 1995 International Conference on Application Specific
Array Processors, pp. 93-103, 1995.
[18]
Jieh-Hwang Yen, Lan-Rong Dung, and Chi-Yuan Shen, “Design of Power-Aware
Multiplier with Graceful Quality-Power Trade-Offs,” IEEE International
Symposium on Circuits and Systems, vol. 2, pp. 1642-1645, May 2005.
[19]
O. L. MacSorley, “High-Speed Arithmetic in Binary Computers,” Proceedings of
the IRE, vol. 49, pp. 67-91, 1961.
[20]
Bruce Gilchrist, J. H. Pomerene, and S. Y. Wong, “Fast Carry Logic for Digital
Computers,” IRE Transactions on Electronic Computers, vol. EC-4, pp. 133-136,
1955.
[21]
A. Weinberger and J. L. Smith, “A Logic for High-Speed Addition,” Nat. Bur.
Stand. Circ. 591, pp. 3-12, 1958.
[22]
J. Sklansky, “An Evaluation of Several Two-summand Binary Adders,” IRE
Transactions on Electronic Computers, vol. EC-9, pp. 213-226, 1960.
[23]
Neil H. Weste and Kamran Eshraghian, Principles of CMOS VLSI Design: A
Systems Perspective, 2nd Edition, Reading, MA: Addison-Wesley Publishing Co.,
1993.
138
[24]
Behrooz Parhami, Computer Arithmetic: Algorithms and Hardware Designs,
New York: Oxford University Press, 2000.
[25]
G. W. McIver, R. W. Miller, and T. G. O’Shaughnessy, “A Monolithic 16x16
Digital Multiplier,” IEEE International Solid-State Circuits Conference Digest of
Technical Papers, pp. 231-233, 1974.
[26]
K’Andrea C. Bickerstaff, Michael J. Schulte, and Earl E. Swartzlander, Jr.,
“Reduced Area Multipliers,” Proceedings of the 1993 International Conference
on Application Specific Array Processors, pp. 478-489, 1993.
[27]
Z. Wang, G. A. Jullien, and W. C. Miller, “ A New Design Technique for Column
Compression Multipliers,” IEEE Transactions on Computers, vol. 44, pp. 962970, 1995.
[28]
V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A Method for Speed Optimized
Partial Product Reduction and Generation of Fast Parallel Multipliers Using an
Algorithmic Approach,” IEEE Transactions on Computers, vol. 45, pp. 294-305,
1996.
[29]
Luigi Dadda, “On Parallel Digital Multipliers,” Alta Frequenza, vol. 45, pp. 574580, 1976.
[30]
Earl E. Swartzlander, Jr., “Parallel Counters,” IEEE Transactions on Computers,
vol. C-22, pp. 1021-1024, 1973.
[31]
V. G. Oklobdzija, “Improving Multiplier Design by Using Improved Column
Compression Tree and Optimized Final Adder in CMOS Technology,” IEEE
Transactions on VLSI Systems, vol. 3, pp. 292-301, 1995.
[32]
Ohsang Kwon, K. Nowka, and Earl E. Swartzlander, Jr., “A 16-bit x 16-bit MAC
design using fast 5:2 compressor,” Proceedings of the IEEE International
Conference on Application Specific Systems, Architectures, and Processors, pp.
235-243, July, 2000.
[33]
M. Zhuang and H. Hu, “A new design of the CMOS full adder,” IEEE Journal of
Solid-State Circuits,” vol. 27, pp. 840-844, 1992.
[34]
M. Alioto and G. Palumbo, “Analysis and comparison on full adder block in
submicron technology,” IEEE Transactions on VLSI Systems, vol. 10, pp. 806823, 2002.
139
[35]
D. Radhakrishnan, “Low-voltage low-power CMOS full adder,” Proceedings IEE
Circuits, Devices, and Systems, vol. 148, pp. 19-24, 2001.
[36]
Hung Tien Bui, Yuke Wang, and Yingtao Jiang, “Design and analysis of lowpower 10-transistor full adders using novel XOR-XNOR gates,” IEEE
Transactions on Circuits and Systems II: Analog and Digital Signal Processing,
vol. 49, pp. 25-30, 2002.
[37]
Chip-Hong Chang, Jiangmin Gu, and Mingyan Zhang, “A Review of 0.18-µm
Full Adder Performances for Tree Structured Arithmetic Circuits,” IEEE
Transactions on VLSI Systems, vol. 13, pp. 686-695, 2005.
[38]
Ahmed M. Shams, Tarek K. Darwish, and Magdy A. Bayoumi, “Performance
Analysis of Low-Power 1-Bit CMOS Full Adder,” IEEE Transactions on VLSI
Systems, vol. 10, pp. 20-29, 2002.
[39]
Sumeer Goel, Ashok Kumar, and Magdy A. Bayoumi, “Design of Robust,
Energy-Efficient Full Adders for Deep-Submicrometer Design Using HybridCMOS Logic Style,” IEEE Transactions on VLSI Systems, vol. 14, pp. 13091321, 2006.
[40]
K’Andrea C. Bickerstaff, Michael J. Schulte, and Earl E. Swartzlander, Jr.,
“Parallel Reduced Area Multipliers,” Journal of VLSI Signal Processing, vol. 9,
pp. 181-191, 1995.
[41]
K’Andrea C. Bickerstaff, Earl E. Swartzlander, Jr, and Michael J. Schulte,
“Analysis of Column Compression Multipliers,” Proceedings of the 15th IEEE
Symposium on Computer Arithmetic, pp. 33-39, 2001.
[42]
H. Al-Twaijry and M. Flynn, Multipliers and Datapaths, Stanford University,
Technical Report: CSL-TR-94-654, December, 1994.
[43]
Earl E. Swartzlander, Jr., “A Review of Large Parallel Counter Designs,”
Proceedings of the IEEE Computer Society Annual Symposium on VLSI Emerging
Trends in VLSI Systems Design, pp. 88-98, February, 2004.
[44]
M. Mehta, V. Parmar, and Earl E. Swartzlander, Jr., “High-Speed Multiplier
Design Using Multi-Input Counter and Compressor Circuits,” Proceedings of the
10th Symposium on Computer Arithmetic, pp. 43-50, 1991.
[45]
Robert F. Jones and Earl E. Swartzlander, Jr., “Parallel Counter
Implementations,” Journal of VLSI Signal Processing, vol. 7, pp. 223-232, 1994.
140
[46]
P. J. Song and G. De Micheli, “Circuit and Architecture Trade-offs for HighSpeed Multiplication,” IEEE Journal of Solid-State Circuits, vol. 26, pp. 1184 –
1198, 1991.
[47]
M. Nagamatsu, S. Tanaka, J. Mori, K. Hirano, T. Noguchi, and K. Hatanaka, “A
15-ns 32x32-b CMOS Multiplier with an Improved Parallel Structure,” IEEE
Journal of Solid-State Circuits, vol. 25, pp. 494-497, 1990.
[48]
N. Ohkubo, M. Suzuki, T. Shinbo, T. Yamanaka, A. Shimizu, K. Sasaki, and Y.
Nakagome, “A 4.4 ns CMOS 54x54-b Multiplier Using Pass-Transistor
Multiplexer,” IEEE Journal of Solid-State Circuits, vol. 30, pp. 251-257, 1995.
[49]
Robert F. Jones and Earl E. Swartzlander, Jr., “ Parallel Counter Implementation,”
Twenty-Sixth Asilomar Conference on Signals, Systems and Computers, vol. 1,
pp. 381-385, October, 1992.
[50]
P. K. Chan, M. D. F. Schlag, C. D. Thomborson, and V. G. Oklobdzija, “Delay
Optimization of Carry-Skip Adders and Block Carry-Lookahead Adders,”
Proceedings of the 10th Symposium on Computer Arithmetic, pp. 154-164, 1991.
[51]
B. D. Lee and V. G. Oklobdzija, “Improved CLA Scheme with Optimized
Delay,” Journal of VLSI Signal Processing, vol. 3, pp. 265-274, 1991.
[52]
S. Turrini, “Optimal Group Distribution in Carry-Skip Adders,” Proceedings of
the 9th IEEE Symposium on Computer Arithmetic, pp. 96-103, 1991.
[53]
P. K. Chan, M. D. F. Schlag, “A Note on Design Two-Level Carry-Skip Adders,”
Journal of VLSI Signal Processing, vol. 3, pp. 275-281, 1991.
[54]
V. Kantaburtra, “Designing Optimum Carry-Skip Adders,” Proceedings of the
10th Symposium on Computer Arithmetic, pp. 146-153, 1991.
[55]
N. T. Quach and Michael J. Flynn, “High-Speed Addition in CMOS,” IEEE
Transactions on Computers, vol. 41, 1992.
[56]
Thomas K. Callaway and Earl E. Swartzlander, Jr., “Optimizing Adders for
WSI,” Proceedings 1992 International Conference on Wafer-Scale Integration,
pp. 251-260, 1992.
[57]
Thomas K. Callaway and Earl E. Swartzlander, Jr., “Estimating the Power
Consumption of CMOS Adders,” Proceedings of the 11th Symposium on
Computer Arithmetic, pp. 210-216, 1993.
141
[58]
V. G. Oklobdzija, “Design and Analysis of Fast Carry-Propagate Adder Under
Non-Equal Input Signal Arrival Profile,” 28th Asilomar Conference Signals,
Systems, and Computers, pp. 1398-1401, 1995.
[59]
Niichi Itoh, Yuka Naemura, Hiroshi Makino, Yasunobu Nakase, Tsutomo
Yoshihara, and Yasutaka Horiba, “A 600-MHz 54x54-bit Multiplier with
Rectangular-Styled Wallace Tree,” IEEE Journal of Solid-State Circuits, vol. 36,
pp. 249-257, 2001.
[60]
Earl E. Swartzlander, Jr., “High-Speed Computer Arithmetic,” in Allen B.
Tucker, ed., The Computer Science and Engineering Handbook, Boca Raton:
CRC Press, pp. 462-481, 1997.
[61]
Jalil Fadavi-Ardekani, “MxN Booth Encoded Multiplier Generator Using
Optimized Wallace Trees,” Proceedings of the IEEE 1992 International
Conference on Computer Design, pp, 114-117, October, 1992.
[62]
Pascal Delamotte, Jean-Michel Servant, and Yann Boyer-Chammard, “M_GM: A
Module Generator for Multipliers,” Proceedings of the 32nd Midwest Symposium
on Circuits and Systems, pp. 813-816, August, 1989.
[63]
Johnny Pihl and Einar J. Aas, “A Multiplier and Squarer Generator for High
Performance DSP Applications,” IEEE 39th Symposium on Circuits and Systems,
pp. 109-112, August, 1996.
[64]
S. –F. Hsiao and M. –R. Jiang, “Efficient Synthesiser for Generation of Fast
Parallel Multipliers,” IEE Proceedings of Computers and Digital Techniques, vol.
147, pp. 49-52, 2000.
[65]
Yu Qian and Wang Dong-Hui, “A Design of Regularized Multiplier Generator,”
Proceedings of the 5th International Conference on ASIC, pp. 1269-1272,
October, 2003.
[66]
EncounterTM User Guide, pp. 582-592, February, 2006.
[67]
Virtuoso® UltraSim User Guide, p. 17, June, 2004.
[68]
Robert H. Dennard, Fritz H. Gaensslen, Hwa-Nien Yu, V. Leo Rideout, Ernest
Bassous, and Andre R LeBlanc, “Design of Ion-Implanted MOSFETs with Very
Small Physical Dimensions,” IEEE Journal of Solid-State Circuits, vol. SC-9, pp.
256-268, 1974.
142
[69]
“Virtuoso UltraSim Full-Chip Simulator Netlist-Based Electromigration Voltage
Drop (EMIR) Flow,” Cadence Design Systems, Inc., 2007
[70]
“Electromigration for Designers,” Cadence Design Systems, Inc., 2002
[71]
William R. Hunter, “The Implications of Self-Consistent Current Density Design
Guidelines Comprehending Electromigration and Joule Heating for Interconnect
Technology Evolution,” International Electron Devices Meeting, pp. 483-486,
1995.
143
Vita
K’Andrea Catherine Bickerstaff was born in Montgomery, Alabama, on May 28,
1967, the daughter of Pressley and Doris Bickerstaff. In 1985, as class valedictorian, she
completed the high school curriculum of Saint Jude Educational Institute in Montgomery,
Alabama, and entered the Massachusetts Institute of Technology in Cambridge,
Massachusetts. During her undergraduate studies, she was employed as a summer intern
at Hewlett Packard Company in Santa Rosa, California, NCR Corporation in Liberty,
South Carolina, and Polaroid Corporation in Cambridge, Massachusetts. In 1989, she
received the degree of Bachelor of Science in Electrical Engineering from the
Massachusetts Institute of Technology. Awarded a GEM Fellowship with sponsorship
from Polaroid Corporation, she received the Master of Science degree from the
University of Texas at Austin in December 1992. From 1993 to 1995, she worked as a
Development Engineer in the PCI Components Division at Intel Corporation in Folsom,
California. Returning to Austin, Texas, in 1995, she has worked as a Senior Design
Engineer at Crystal Semiconductor, a Technical Consultant at Brobeck, Phleger, and
Harrison LLP, and a Design Manager at Cirrus Logic. In 2005, she founded KenQuest
LLC, offering research, design, and management services. Most recently, as Acting
Director of Engineering at Luminary Micro, Inc., she led the engineering team through
the development of the first ARM Cortex-M3 based SoC.
Permanent Address: 5216 Crystal Water Drive, Austin, Texas, 78735
This dissertation was typed by the author.
144