Synthesis Flow - People @ EECS at UC Berkeley

Reinventing The Wheel:
Developing a New Standard-Cell
Synthesis Flow
Alan Mishchenko
University of California, Berkeley
Outline


Motivation
The flow






Technology-independent synthesis
Technology mapping
Buffering
Sizing
Experimental results
Conclusion
2
Motivation

Synthesis tools are out there, but they are




slow
suboptimal
complicated
expensive
3
ABC





It is a public-domain tool developed by our research
group since 2005
It addresses both synthesis and verification of
synchronous hardware
It is based on years of experience in developing efficient
data-structures and algorithms
It is used in industry and academia
For more information, visit
https://bitbucket.org/alanmi/abc
4
The Flow

Technology-independent synthesis
Technology mapping
Buffering
Sizing

These steps are not disconnected; they overlap






Synthesis talks to mapping through structural choices
Mapping talks to buffering through fanout estimations
Buffer and sizing can be interleaved
5
Synthesis: Old and New


“AIG rewriting”
Delay/area costs









“Over-re-structuring”
Slow for large, deep logic
iterate “mapping” and
“unmapping” several times
Results



user-specified cost for n-input
AND/XOR/MUX/MAJ
Restructuring

Acceptable quality
Acceptable runtime
Problems
“AIG reshaping”
Delay/area cost

for all 4-input cuts, try all AIG
subgraphs, choose the one
with the min nodes under
delay constraint
Results


AND2 levels/nodes
Restructuring


Comparable quality
3-10 faster
Problems

None so far
6
Mapping: Old and New

“Traditional” cut-based mapping




iterate over the subject graph
re-compute priority cuts
use structural or functional
matching (ICCAD’97)
“Improved” cut-based mapping





For standard-cell mapping









Results


Acceptable quality
Tolerable runtime
For standard-cell mapping

use a gain-based library
map both (pos and neg) phase
of each node into gates
select best cuts (gates)
pre-compute priority cuts
iterate over the subject graph
evaluate cuts using different
costs
use structural or functional
matching
use a gain-based library
map into NPN classes of
functions from the library
select best cuts (NPN classes)
perform phase-assignment and
determine gates during buffering
Results


Quality not known yet
Runtime is expected 3-10x faster
7
Buffering: Old and New



Enumerating buffer tree
topologies
Buffering for near-continuous
libraries
Other incremental local fanout
optimization methods

Several ideas tried, none
is a clear winner




“Technology-independent”
buffering after the gainbased library
Buffer-tree construction
given required times and
loads of the fanouts
Incremental buffering
interleaved with
incremental sizing
Results are mixed
8
Incremental Buffering Illustrated

Growing

Bypassing
9
Sizing: Old and New



Non-linear programming
Linear programming
Lagrangian multipliers

Incremental sizing




find critical region
find best gates to resize
perform the resizing
incrementally update timing

Iterate until no improvement
Can be combined with
incremental buffering

Results




Reasonable
Surprisingly fast
If an optimum solution is known,
seems to converge to it
10
Commands of The Flow









read_lib
write_lib
print_lib
read_scl
write_scl
dump_genlib
print_gs
stime
buffer









unbuffer
minsize
maxsize
upsize
dnsize
print_buf
read_constr
print_constr
reset_constr
11
Experimental Setting





19 OpenCore designs were synthesized and mapped by
an industrial tool using public library vsclib013.lib from
http://www.vlsitechnology.org/
Delay, area, and runtime were collected and used as a
reference
Sizing was tested by applying min-sizing, followed by resizing
Buffering was tested by un-buffering and min-sizing,
followed by re-buffering and re-sizing
The flow was tested by restructuring the design, followed
by mapping, buffering, and sizing
12
Comments on The Table




Column “Gate” shows the number of gates produced by
the industrial tool
Other columns “Gate” show the percentage in the number
of gates relative to the result produced by the tool.
Similarly, columns “Area” and “Delay” show the
percentage of change in area and delay, respectively.
Runtimes are in seconds on a Linux workstation
13
Original Statistics
Design
ac97_ctrl
aes_core
des_area
des_perf
DMA
DSP
ethernet
i2c
RISC
sasc
spi
ss_pcm
systemcaes
systemcdes
tv80
usb_funct
usb_phy
vga_lcd
wb_conmax
leon3pm
leon3
Statistics
PI
PO
4482
2251
1319
668
496
72
17850
9038
5070
2559
7835
3954
21216 10698
275
144
15678
8111
250
132
505
277
193
98
1600
819
512
258
732
404
3620
1858
211
111
34247 21412
2670
2189
Industrial tool
Gate
Area
Delay
6010
35801
970
16801 109380 1575
3708
27167 2126
64932 445444 1760
14152
78356 1690
24283 149707 3277
26275 174272 1806
784
4112
782
36810 203719 2737
442
2620
628
2178
12303 1924
234
1361
582
5401
34750 2353
2356
16730 1804
4694
27471 2575
8927
49886 1630
364
2018
560
58053 331985 1709
20482 121079 1690
217858 142925 332749 2137735 10358
370159 252691 618738 3959576 12656
14
Comparing Two Sizing Option
Design
Area
Delay
Runtime
MinSize MaxSize MinSize MaxSize MinSize MaxSize
ac97_ctrl
102.5
99.8
92.7
93.2
1.87
2.45
aes_core
97.5
94.4
100.1
100.6
11.30
13.04
des_area
91.1
91.2
97.6
97.8
4.05
3.55
des_perf
95.6
95.9
100.2
98.7
35.68
45.21
DMA
100.2
100.5
99.6
98.2
5.33
7.51
DSP
94.6
95.6
100.3
97.3
7.70
12.57
ethernet
97.8
97.5
96.6
97.5
6.56
10.67
i2c
102.7
101.9
99.7
99.9
0.28
0.34
RISC
96.1
95.6
100.7
100.5
6.32
13.52
sasc
101.8
104.1
93.8
93.9
0.22
0.23
spi
95.4
96.2
103.7
103.0
1.43
1.25
ss_pcm
99.0
102.6
99.3
98.3
0.18
0.14
systemcaes
93.3
93.3
102.5
101.2
1.51
2.29
systemcdes
94.5
94.4
99.2
98.8
2.02
2.13
tv80
96.1
95.3
101.4
100.9
3.09
3.98
usb_funct
98.6
98.0
96.3
97.9
2.31
3.53
usb_phy
99.3
102.7
97.5
95.0
0.13
0.19
vga_lcd
96.6
96.6
101.9
99.4
21.04
29.16
wb_conmax
96.7
95.6
101.4
101.3
5.60
11.43
Geomean
0.973
0.974
0.991
0.986
1.000
1.302
leon3mp
leon3
Geomean
88.5
86.6
0.875
88.6
86.7
0.876
94.9
85.9
0.903
89.7
83.7
0.866
135.65
171.99
1.000
194.13
438.57
1.910
15
Comparing Full Flow
Design
Statistics
PI
PO
ac97_ctrl
4482 2251
aes_core
1319
668
des_area
496
72
des_perf
17850 9038
DMA
5070 2559
DSP
7835 3954
ethernet
21216 10698
i2c
275
144
RISC
15678 8111
sasc
250
132
spi
505
277
ss_pcm
193
98
systemcaes
1600
819
systemcdes
512
258
tv80
732
404
usb_funct
3620 1858
usb_phy
211
111
vga_lcd
34247 21412
wb_conmax 2670 2189
Geomean
Industrial tool
ABC
Gate
Area Delay Gate,% Area,% Delay,%
6010 35801
970
139.9
125.5
102.7
16801 109380 1575
109.4
100.4
121.6
3708 27167 2126
115.5
91.1
114.8
64932 445444 1760
124.3
93.3
120.7
14152 78356 1690
118.0
106.9
118.9
24283 149707 3277
130.2
111.4
110.8
26275 174272 1806
157.5
118.1
118.3
784
4112
782
100.6
104.4
113.9
36810 203719 2737
141.7
121.4
110.4
442
2620
628
110.4
103.5
121.3
2178 12303 1924
114.7
106.4
114.3
234
1361
582
135.5
128.1
113.6
5401 34750 2353
138.7
116.6
116.1
2356 16730 1804
106.4
92.2
109.6
4694 27471 2575
125.3
108.8
125.7
8927 49886 1630
123.9
108.5
119.0
364
2018
560
103.3
97.4
121.8
58053 331985 1709
161.3
139.3
127.6
20482 121079 1690
156.2
129.2
113.1
1.257
1.099
1.164
Time, s
2.55
12.51
3.52
45.00
7.91
20.69
21.98
0.40
21.48
0.73
1.62
0.16
4.60
2.49
4.55
4.37
0.17
56.54
13.66
16
Full Flow with Improvements
Design
ABC w/ delay opt
Gates Area Delay Time,s
ac97_ctrl
148.7 129.8 108.9
4.70
aes_core
111.5 99.4 120.3 13.89
des_area
113.6 96.2 104.9
5.03
des_perf
140.4 109.5 109.6 75.16
DMA
129.3 118.3 118.0 12.78
DSP
135.3 112.7 111.4 27.94
ethernet
159.3 120.1 101.8 35.17
i2c
101.8 104.7 111.0
0.58
RISC
138.2 122.1 103.6 33.62
sasc
121.3 109.3 122.1
0.43
spi
134.3 126.5 103.0
2.47
ss_pcm
150.4 146.4 118.6
0.25
systemcaes 142.8 123.6 112.1
6.35
systemcdes 113.1 93.0 111.6
2.85
tv80
129.7 117.3 112.0
6.81
usb_funct
128.9 111.7 118.0
5.49
usb_phy
97.8 93.8 110.7
0.18
vga_lcd
150.0 130.2 120.6 83.88
wb_conmax 154.6 125.8 115.9 19.80
Geomean
1.304 1.145 1.122 1.000
ABC w/ delay opt + sizing opt
Gates Area Delay Time,s
148.7 129.4 108.4
5.57
111.5 101.4 117.8 19.33
113.6 97.4 104.3
6.54
140.4 108.5 109.5 90.04
129.3 118.2 117.7 15.38
135.3 112.7 110.8 31.47
159.3 120.0 100.9 37.96
101.8 104.9 111.0
0.75
138.2 121.5 104.9 37.28
121.3 109.2 121.8
0.53
134.3 126.8 102.1
3.23
150.4 155.9 115.8
0.46
142.8 123.1 111.4
7.89
113.1 95.8 110.1
4.06
129.7 117.5 111.9
8.27
128.9 111.8 116.1
6.09
97.8 95.0 107.5
0.25
150.0 129.8 118.8 92.07
154.6 125.6 115.0 21.91
1.304 1.152 1.112 1.245
17
Two Larger Designs
Design
Gates
Area
Delay
T, syn
leon3mp
leon3
633638
1048239
leon3mp
leon3
604586
1040428
T, map
T, size
3289861
5613805
4634.37
4734.49
686.18
1156.04
115.96
219.88
143.29
297.02
3465547
5385768
4626.71
5006.44
10.34
18.35
39.77
74.97
185.02
274.25
18
Experimental Results

The following notation is used below:
ToolD = industrial tool run in delay mode
ToolA = industrial tool run in area mode
AbcD = ABC run in delay mode
AbcDF = ABC run in delay mode with novel fast synthesis feature
AbcA = ABC run in area mode
Gate count include buffers and inverters.

(1.1) AbcD has -19% gates, -13% area, and +3% delay, compared to ToolD.
(1.2) AbcDF has -23% gates, -17% area, and +10% delay, compared to ToolD.
(1.3) AbcA has -16% gates, +2% area, and -2x delay, compared to ToolA.
The runtime of AbcDF (1.2) is about 2x faster than AbcD (1.1).
The runtime of AbcA (1.3) is about 5x faster than AbcD (1.1).

The same flow produces the following results on the public 130nm library:
(2.1) AbcD has +31% gates, +16% area, and -15% delay, compared to ToolD.
(2.3) AbcA has +18% gates, +11% area, and -65% delay, compared to ToolA.
19
Potential Issues

Not specifying input driving cells and output loads


Over-tuning for one particular library




Not sure heuristics will hold for submicron libraries
Not looking at power


This was addressed and experiments show it is fine
Not taking high and low Vt cells into account
Not mapping into multi-output cells
Not mapping sequential elements
Not considering multiple clock domains
20
Conclusion


A new synthesis flow is being developed and
implemented in ABC
An opportunity




to rethink some of the classical problems
improve on some of the known solutions
come up with a new public implementation
Results are encouraging



delay (in delay-oriented synthesis) is within 5-15%
area (in area-oriented synthesis) is within 1-3%
runtime is about 20-50x better
21
Abstract
This presentation focuses on adding new capabilities to
synthesize standard cell designs in the public-domain
synthesis/verification tool ABC. An optimization flow has been
developed, which included gain-based technology mapping,
fanout-optimization by buffering and gate duplication, and gatesizing. Novel heuristic algorithms have been proposed for several
well-known optimization steps. For example, buffer tree
construction can be performed not as a separate step, but
concurrently with gate-sizing by reshaping initial well-balanced
buffer trees. Each tree reshaping and each gate resizing transform
are evaluated for delay/area improvement using a common costfunction and the most promising one is selected. The delay is
measured by lookup table based delay model, which computes
the delay of a gate from its input flew and output capacitance.
Experiments show that the flow produces results that are 10%
within those of industrial tools 20x faster.
22