Partition-Driven Placement with Simultaneous Level Processing and Global Net Views K. Zhong and S. Dutt Department of Electrical Engineering and Computer Science, University of Illinois at Chicago Zhong & Dutt, UIC, Nov. 2000 Overview • Problem • Previous Work • New Partition-Driven Placement Algorithm (SPADE) • Experimental Evaluation • Conclusions and Future Work Zhong & Dutt, UIC, Nov. 2000 Problem • Placement for Deep Sub-Micron (DSM) – Very large input size (up to tens of millions) – More optimization objectives (area, delay, power) – Various heterogeneous constraints (congestion, crosstalk, heat distribution, etc.) Zhong & Dutt, UIC, Nov. 2000 Major Approaches to Placement • Three mainstream placement approaches • Partition-Driven Placement (PDP) (e.g. [Breuer, DAC ‘77], [Huang et al, ISPD ‘97]) •Simulated Annealing (SA) (e.g. [Sun et al, TCAD ‘95]) • Mathematical programming (e.g. [Eisenmann et al, DAC ‘98]) • Global and detailed placement • NRG [Wang et al, ICCAD ‘97], Snap-On [Yang et al, ISPD ‘00], etc. Zhong & Dutt, UIC, Nov. 2000 Advantages of PDP • Time-efficient • divide-and-conquer approach • Balanced decision with a global view • top-down placement flow • Can tackle almost any objective function accurately (up to interconnect length model) • delay, WL, power (in iterative improvement, update cost per move) • Flexibility in tackling multiple constraints • iterative improvement---check per move Zhong & Dutt, UIC, Nov. 2000 Previous PDP Work • Sequential level partitioning [Breuer, DAC ‘77] – regions at the same level are cut sequentially – may result in sub-optimal wire-length or cutsize • Terminal propagation [Dunlop et al, TCAD ‘85] – addresses external connections during partitioning • Quadrisection [Suaris et al, TCAS ‘88; Huang et al, ISPD ‘97] – 4-way partitioning better controls wire length in both directions, but run time goes up Zhong & Dutt, UIC, Nov. 2000 New PDP Techniques--- Rectify Drawbacks of Prior PDP • Placer SPADE (Simultaneous level PArtitioning with Distributed nEt views) • Simultaneous Level Partitioning (SLP)---rectifies prior drawback of sequentially-ordered optimization • Global net views---rectifies prior drawback of localized subcircuit views and cost + inaccuracy of Term. Prop. • Wire-length based gain computation---rectifies prior drawback of mincut-based gain (not strictly WL) • Modified CLIP-FM partitioner [Dutt et al, ICCAD ‘96] • Maximum row length control • Post-processing (cell swaps) Zhong & Dutt, UIC, Nov. 2000 Simultaneous Level Partitioning • Simultaneous partitioning of all regions within the same level • Cell moves are naturally interleaved across all regions based on gains (as shown in the figure) • Achieves simultaneous optimization across multiple regions Zhong & Dutt, UIC, Nov. 2000 1 2 1 3 4 2 SLP vs. Sequential Level Partitioning • Sequential level partitioning may not be able to escape local optima Orig Cost=8 1 1 v New Cost = 3 1 1 v v cells 3 (1) 3 u pads 4 3 3 3 u u 4 (2) 4 Initial partitioning: nets Sequential: sub-optimal labeled with weights move sequence, if upper region processed first Zhong & Dutt, UIC, Nov. 2000 New Cost = 1 1 1 v u 4 (1) u 4 SLP: only the cell in lower region moved Global Net View vs. Terminal Propagation • Terminal propagation may be inaccurate for wire length reduction • With a global net view we can do better (e.g., moving left is better in the figure shown as it can shrink the BB, while the right move expands BB) Zhong & Dutt, UIC, Nov. 2000 Dummy Possible moves: dummy position does not help De-coupled Regions: a Caveat • Suitable for row-based designs • Property: For a hor. cut, WL change due to cell moves in regions in one side of the previous-level cutline does not affect WL of the subcircuits in regions on the other side • Sequential partitioning of regions separated by previouslevel horizontal cutlines justified • Reduced run time at NO cost of wire length Zhong & Dutt, UIC, Nov. 2000 c d c’ Two segments can be shrunk separately; Regions spanning cutline c is de-coupled from those spanning c’ by previous cutline d Wire-length Based Gain • Pin coordinates (x or y) of each net along the direction orthogonal to current cutline are stored in a binary search tree • SPADE-FM: A cell move can have non-zero gain only when it changes global boundingboxes of connected nets Zhong & Dutt, UIC, Nov. 2000 Illustration of Gain Computation u v g(v)=5L du x 3L d' d'' 8L w d SPADE-FM: gain(u) = gain(w) = 0; since neither move can change bounding box by itself; only gain(v)=5L is positive and all others have gain zero as “internal” nodes. SPADE-PROP: gain(u) = (d'-d)•p(u)•p(w)/p(u) + (d'' - d')•p(x), where p(y) is the probability of y. The gain is of two parts: single-step PROP gain of moving u and w, and multi-step gain for moving cells not on the boundary of BB (e.g., x) from same side as u. Zhong & Dutt, UIC, Nov. 2000 Global Gain Update • Every move may entail out-of-region update of cell gains • Total time taken for such update per pass is bounded by O(p*log(p)), where p is the pin number Zhong & Dutt, UIC, Nov. 2000 cell move 1 0 0 Gain update needed 1 Maximum Row Length Control • A decisive factor in die-area utilization • Gradually increase row-balance deviations w/ partitioning tree levels to max allowable – cannot use the prescribed max. row-length devn, as it can freeze moves for future cuts (see figure below) Initial devn set as max allowed value Devn avail. Max devn reached, further partitioning badly hampered • Row devn assigned inversely proportional to logarithm of # of rows of target regions Zhong & Dutt, UIC, Nov. 2000 Local Region Balance Control • Relaxed local balance but strict row-balance control • Local Deviation (from closest possible balance to 5050) = Row Deviation overconstrains the problem • Allow Local Deviation = (Row Deviation), > 1, but maintain overall row deviation Zhong & Dutt, UIC, Nov. 2000 Circuit Partitioning Engine • CLIP-FM variation (SHRINK-FM) or SHRINKPROP algorithm at the core – shrinking initial gain helps cluster removal – iterative mode: shrink factor gradually enlarged to get independent gains after most clusters are removed through earlier passes • Two-level gain tree structure – local binary search tree for each region – top-gain cells of local trees sorted into global tree • Efficient global cell selection strategy – row-balance violation: search opposite global tree – local violation: switch to opposite local tree – tie-breaking: following latest move Zhong & Dutt, UIC, Nov. 2000 Post-processing • Intra-row horizontal neighbor swap • Intra-row clustering based on int/ext nets ratio • Inter-row vertical swap – some cells have to be shifted due to cell overlap • Results in about 1-2% improvement Horizontal neighbor swap Zhong & Dutt, UIC, Nov. 2000 Vertical cell swap Experimental Evaluation • MCNC standard cell benchmarks: up to 100k cells • Compared with prior methods – – – – TimberWolf 7.0 [Sun et al, TCAD ‘95] FD-98 [Eisenmann et al, DAC ‘98] QUAD [Huang et al, ISPD ‘97] Snap-On [Yang et al, ISPD ‘00] • Same number of rows as TimberWolf 7.0 • Part of IBM-PLACE circuits also tested (ibm11 ibm15) and compared to iTools [internetCAD] • Experiments conducted on 550 MHz Pentium-III Linux workstations Zhong & Dutt, UIC, Nov. 2000 Comparison with Previous Methods SLP vs Seq. SPADE-FM Sequentail WL imprv. Total WL (6 ckts) 52.86 Total time (6 ckts) 7052 Circuit primary1 struct primary2 biomed industry2 industry3 avqsmall avqlarge golem3 Total (8/8 ckts) Total (5/7 ckts) SPADE-FM imprv. SPADE-PROP imprv. run time (8 ckts) run time (6 ckts) scaled time ratio SPADE-FM 0.74 0.291 3.13 1.43 11.9 35.37 5.59 6.16 19.84 84.16 / 64.61 15.94 / 64.32 TW 7.0 0.83 3.53 1.61 13.3 41.53 5.08 5.65 22.6 94.13 / 10.60% 11.92% 15001 14710 1 Zhong & Dutt, UIC, Nov. 2000 19034 0.69 65.57 1719 FD-98 0.87 0.338 3.72 1.78 14.6 45.1 4.91 5.38 19.38% QUAD 0.9 0.378 3.68 6.29 6.59 Snap-On SPADE-PROP 0.95 3.66 1.84 14.48 44.7 5.15 5.21 / 76.70 15.80% 17.13% 7173 17.84 / 10.70% 15.81% 57920 0.26 1.16 / 75.99 15.30% 16.74% 0.74 0.285 3.07 1.38 12.07 35.09 5.31 5.61 19.64 82.91/63.56 15.02/63.27 18108 18071 1.21 Other Experimental Results • Trade-off between run time and solution quality of SPADE-FM with 8 and 16 runs for the MCNC suite Trade-off SPADE-FM/8 SPADE-FM/16 Best WL Total WL Total time 89.65 29117 84.45 37738 82.87 • Results for IBM-PLACE Benchmarks Circuit ibm11 ibm12 ibm13 ibm14 ibm15 Total WL imprv. vs. itools Zhong & Dutt, UIC, Nov. 2000 SPADE-FM SPADE-PROP 37.27 66.52 42.94 121.38 134.68 402.79 1.24% 36.28 64.92 42.4 121.17 130.45 395.22 3.10% iTools 39.76 69.56 49.11 118.8 130.6 407.83 16 vs 8 5.81% 1.3 x Conclusions and Future Work • Introduced novel concepts of: – SLP – global net view – bounding-box based gain computation • PDP alone can be competitive (in fact better) – up to 15.8% better in aggregate result than s-of-art – among large circuits: • best-known result for largest MCNC ckt - golem3 • best-known results for ibm11-ibm13 • Run time reasonable, but can be reduced – early-stop per pass – multilevel clustering • On-going work – timing-driven PDP – multi-constraint PDP (congestion, thermal distr, mult obj) Zhong & Dutt, UIC, Nov. 2000
© Copyright 2026 Paperzz