Reinventing The Wheel: Developing a New Standard-Cell Synthesis Flow Alan Mishchenko University of California, Berkeley Outline Motivation The flow Technology-independent synthesis Technology mapping Buffering Sizing Experimental results Conclusion 2 Motivation Synthesis tools are out there, but they are slow suboptimal complicated expensive 3 ABC It is a public-domain tool developed by our research group since 2005 It addresses both synthesis and verification of synchronous hardware It is based on years of experience in developing efficient data-structures and algorithms It is used in industry and academia For more information, visit https://bitbucket.org/alanmi/abc 4 The Flow Technology-independent synthesis Technology mapping Buffering Sizing These steps are not disconnected; they overlap Synthesis talks to mapping through structural choices Mapping talks to buffering through fanout estimations Buffer and sizing can be interleaved 5 Synthesis: Old and New “AIG rewriting” Delay/area costs “Over-re-structuring” Slow for large, deep logic iterate “mapping” and “unmapping” several times Results user-specified cost for n-input AND/XOR/MUX/MAJ Restructuring Acceptable quality Acceptable runtime Problems “AIG reshaping” Delay/area cost for all 4-input cuts, try all AIG subgraphs, choose the one with the min nodes under delay constraint Results AND2 levels/nodes Restructuring Comparable quality 3-10 faster Problems None so far 6 Mapping: Old and New “Traditional” cut-based mapping iterate over the subject graph re-compute priority cuts use structural or functional matching (ICCAD’97) “Improved” cut-based mapping For standard-cell mapping Results Acceptable quality Tolerable runtime For standard-cell mapping use a gain-based library map both (pos and neg) phase of each node into gates select best cuts (gates) pre-compute priority cuts iterate over the subject graph evaluate cuts using different costs use structural or functional matching use a gain-based library map into NPN classes of functions from the library select best cuts (NPN classes) perform phase-assignment and determine gates during buffering Results Quality not known yet Runtime is expected 3-10x faster 7 Buffering: Old and New Enumerating buffer tree topologies Buffering for near-continuous libraries Other incremental local fanout optimization methods Several ideas tried, none is a clear winner “Technology-independent” buffering after the gainbased library Buffer-tree construction given required times and loads of the fanouts Incremental buffering interleaved with incremental sizing Results are mixed 8 Incremental Buffering Illustrated Growing Bypassing 9 Sizing: Old and New Non-linear programming Linear programming Lagrangian multipliers Incremental sizing find critical region find best gates to resize perform the resizing incrementally update timing Iterate until no improvement Can be combined with incremental buffering Results Reasonable Surprisingly fast If an optimum solution is known, seems to converge to it 10 Commands of The Flow read_lib write_lib print_lib read_scl write_scl dump_genlib print_gs stime buffer unbuffer minsize maxsize upsize dnsize print_buf read_constr print_constr reset_constr 11 Experimental Setting 19 OpenCore designs were synthesized and mapped by an industrial tool using public library vsclib013.lib from http://www.vlsitechnology.org/ Delay, area, and runtime were collected and used as a reference Sizing was tested by applying min-sizing, followed by resizing Buffering was tested by un-buffering and min-sizing, followed by re-buffering and re-sizing The flow was tested by restructuring the design, followed by mapping, buffering, and sizing 12 Comments on The Table Column “Gate” shows the number of gates produced by the industrial tool Other columns “Gate” show the percentage in the number of gates relative to the result produced by the tool. Similarly, columns “Area” and “Delay” show the percentage of change in area and delay, respectively. Runtimes are in seconds on a Linux workstation 13 Original Statistics Design ac97_ctrl aes_core des_area des_perf DMA DSP ethernet i2c RISC sasc spi ss_pcm systemcaes systemcdes tv80 usb_funct usb_phy vga_lcd wb_conmax leon3pm leon3 Statistics PI PO 4482 2251 1319 668 496 72 17850 9038 5070 2559 7835 3954 21216 10698 275 144 15678 8111 250 132 505 277 193 98 1600 819 512 258 732 404 3620 1858 211 111 34247 21412 2670 2189 Industrial tool Gate Area Delay 6010 35801 970 16801 109380 1575 3708 27167 2126 64932 445444 1760 14152 78356 1690 24283 149707 3277 26275 174272 1806 784 4112 782 36810 203719 2737 442 2620 628 2178 12303 1924 234 1361 582 5401 34750 2353 2356 16730 1804 4694 27471 2575 8927 49886 1630 364 2018 560 58053 331985 1709 20482 121079 1690 217858 142925 332749 2137735 10358 370159 252691 618738 3959576 12656 14 Comparing Two Sizing Option Design Area Delay Runtime MinSize MaxSize MinSize MaxSize MinSize MaxSize ac97_ctrl 102.5 99.8 92.7 93.2 1.87 2.45 aes_core 97.5 94.4 100.1 100.6 11.30 13.04 des_area 91.1 91.2 97.6 97.8 4.05 3.55 des_perf 95.6 95.9 100.2 98.7 35.68 45.21 DMA 100.2 100.5 99.6 98.2 5.33 7.51 DSP 94.6 95.6 100.3 97.3 7.70 12.57 ethernet 97.8 97.5 96.6 97.5 6.56 10.67 i2c 102.7 101.9 99.7 99.9 0.28 0.34 RISC 96.1 95.6 100.7 100.5 6.32 13.52 sasc 101.8 104.1 93.8 93.9 0.22 0.23 spi 95.4 96.2 103.7 103.0 1.43 1.25 ss_pcm 99.0 102.6 99.3 98.3 0.18 0.14 systemcaes 93.3 93.3 102.5 101.2 1.51 2.29 systemcdes 94.5 94.4 99.2 98.8 2.02 2.13 tv80 96.1 95.3 101.4 100.9 3.09 3.98 usb_funct 98.6 98.0 96.3 97.9 2.31 3.53 usb_phy 99.3 102.7 97.5 95.0 0.13 0.19 vga_lcd 96.6 96.6 101.9 99.4 21.04 29.16 wb_conmax 96.7 95.6 101.4 101.3 5.60 11.43 Geomean 0.973 0.974 0.991 0.986 1.000 1.302 leon3mp leon3 Geomean 88.5 86.6 0.875 88.6 86.7 0.876 94.9 85.9 0.903 89.7 83.7 0.866 135.65 171.99 1.000 194.13 438.57 1.910 15 Comparing Full Flow Design Statistics PI PO ac97_ctrl 4482 2251 aes_core 1319 668 des_area 496 72 des_perf 17850 9038 DMA 5070 2559 DSP 7835 3954 ethernet 21216 10698 i2c 275 144 RISC 15678 8111 sasc 250 132 spi 505 277 ss_pcm 193 98 systemcaes 1600 819 systemcdes 512 258 tv80 732 404 usb_funct 3620 1858 usb_phy 211 111 vga_lcd 34247 21412 wb_conmax 2670 2189 Geomean Industrial tool ABC Gate Area Delay Gate,% Area,% Delay,% 6010 35801 970 139.9 125.5 102.7 16801 109380 1575 109.4 100.4 121.6 3708 27167 2126 115.5 91.1 114.8 64932 445444 1760 124.3 93.3 120.7 14152 78356 1690 118.0 106.9 118.9 24283 149707 3277 130.2 111.4 110.8 26275 174272 1806 157.5 118.1 118.3 784 4112 782 100.6 104.4 113.9 36810 203719 2737 141.7 121.4 110.4 442 2620 628 110.4 103.5 121.3 2178 12303 1924 114.7 106.4 114.3 234 1361 582 135.5 128.1 113.6 5401 34750 2353 138.7 116.6 116.1 2356 16730 1804 106.4 92.2 109.6 4694 27471 2575 125.3 108.8 125.7 8927 49886 1630 123.9 108.5 119.0 364 2018 560 103.3 97.4 121.8 58053 331985 1709 161.3 139.3 127.6 20482 121079 1690 156.2 129.2 113.1 1.257 1.099 1.164 Time, s 2.55 12.51 3.52 45.00 7.91 20.69 21.98 0.40 21.48 0.73 1.62 0.16 4.60 2.49 4.55 4.37 0.17 56.54 13.66 16 Full Flow with Improvements Design ABC w/ delay opt Gates Area Delay Time,s ac97_ctrl 148.7 129.8 108.9 4.70 aes_core 111.5 99.4 120.3 13.89 des_area 113.6 96.2 104.9 5.03 des_perf 140.4 109.5 109.6 75.16 DMA 129.3 118.3 118.0 12.78 DSP 135.3 112.7 111.4 27.94 ethernet 159.3 120.1 101.8 35.17 i2c 101.8 104.7 111.0 0.58 RISC 138.2 122.1 103.6 33.62 sasc 121.3 109.3 122.1 0.43 spi 134.3 126.5 103.0 2.47 ss_pcm 150.4 146.4 118.6 0.25 systemcaes 142.8 123.6 112.1 6.35 systemcdes 113.1 93.0 111.6 2.85 tv80 129.7 117.3 112.0 6.81 usb_funct 128.9 111.7 118.0 5.49 usb_phy 97.8 93.8 110.7 0.18 vga_lcd 150.0 130.2 120.6 83.88 wb_conmax 154.6 125.8 115.9 19.80 Geomean 1.304 1.145 1.122 1.000 ABC w/ delay opt + sizing opt Gates Area Delay Time,s 148.7 129.4 108.4 5.57 111.5 101.4 117.8 19.33 113.6 97.4 104.3 6.54 140.4 108.5 109.5 90.04 129.3 118.2 117.7 15.38 135.3 112.7 110.8 31.47 159.3 120.0 100.9 37.96 101.8 104.9 111.0 0.75 138.2 121.5 104.9 37.28 121.3 109.2 121.8 0.53 134.3 126.8 102.1 3.23 150.4 155.9 115.8 0.46 142.8 123.1 111.4 7.89 113.1 95.8 110.1 4.06 129.7 117.5 111.9 8.27 128.9 111.8 116.1 6.09 97.8 95.0 107.5 0.25 150.0 129.8 118.8 92.07 154.6 125.6 115.0 21.91 1.304 1.152 1.112 1.245 17 Two Larger Designs Design Gates Area Delay T, syn leon3mp leon3 633638 1048239 leon3mp leon3 604586 1040428 T, map T, size 3289861 5613805 4634.37 4734.49 686.18 1156.04 115.96 219.88 143.29 297.02 3465547 5385768 4626.71 5006.44 10.34 18.35 39.77 74.97 185.02 274.25 18 Experimental Results The following notation is used below: ToolD = industrial tool run in delay mode ToolA = industrial tool run in area mode AbcD = ABC run in delay mode AbcDF = ABC run in delay mode with novel fast synthesis feature AbcA = ABC run in area mode Gate count include buffers and inverters. (1.1) AbcD has -19% gates, -13% area, and +3% delay, compared to ToolD. (1.2) AbcDF has -23% gates, -17% area, and +10% delay, compared to ToolD. (1.3) AbcA has -16% gates, +2% area, and -2x delay, compared to ToolA. The runtime of AbcDF (1.2) is about 2x faster than AbcD (1.1). The runtime of AbcA (1.3) is about 5x faster than AbcD (1.1). The same flow produces the following results on the public 130nm library: (2.1) AbcD has +31% gates, +16% area, and -15% delay, compared to ToolD. (2.3) AbcA has +18% gates, +11% area, and -65% delay, compared to ToolA. 19 Potential Issues Not specifying input driving cells and output loads Over-tuning for one particular library Not sure heuristics will hold for submicron libraries Not looking at power This was addressed and experiments show it is fine Not taking high and low Vt cells into account Not mapping into multi-output cells Not mapping sequential elements Not considering multiple clock domains 20 Conclusion A new synthesis flow is being developed and implemented in ABC An opportunity to rethink some of the classical problems improve on some of the known solutions come up with a new public implementation Results are encouraging delay (in delay-oriented synthesis) is within 5-15% area (in area-oriented synthesis) is within 1-3% runtime is about 20-50x better 21 Abstract This presentation focuses on adding new capabilities to synthesize standard cell designs in the public-domain synthesis/verification tool ABC. An optimization flow has been developed, which included gain-based technology mapping, fanout-optimization by buffering and gate duplication, and gatesizing. Novel heuristic algorithms have been proposed for several well-known optimization steps. For example, buffer tree construction can be performed not as a separate step, but concurrently with gate-sizing by reshaping initial well-balanced buffer trees. Each tree reshaping and each gate resizing transform are evaluated for delay/area improvement using a common costfunction and the most promising one is selected. The delay is measured by lookup table based delay model, which computes the delay of a gate from its input flew and output capacitance. Experiments show that the flow produces results that are 10% within those of industrial tools 20x faster. 22
© Copyright 2026 Paperzz