Partition-Driven Placement with Simultaneous Level

Partition-Driven Placement with
Simultaneous Level Processing and
Global Net Views
K. Zhong and S. Dutt
Department of Electrical Engineering and
Computer Science,
University of Illinois at Chicago
Zhong & Dutt, UIC, Nov. 2000
Overview
• Problem
• Previous Work
• New Partition-Driven Placement Algorithm
(SPADE)
• Experimental Evaluation
• Conclusions and Future Work
Zhong & Dutt, UIC, Nov. 2000
Problem
• Placement for Deep Sub-Micron (DSM)
– Very large input size (up to tens of millions)
– More optimization objectives (area, delay, power)
– Various heterogeneous constraints (congestion,
crosstalk, heat distribution, etc.)
Zhong & Dutt, UIC, Nov. 2000
Major Approaches to Placement
• Three mainstream placement approaches
• Partition-Driven Placement (PDP) (e.g. [Breuer,
DAC ‘77], [Huang et al, ISPD ‘97])
•Simulated Annealing (SA) (e.g. [Sun et al, TCAD
‘95])
• Mathematical programming (e.g. [Eisenmann et
al, DAC ‘98])
• Global and detailed placement
• NRG [Wang et al, ICCAD ‘97], Snap-On [Yang et al,
ISPD ‘00], etc.
Zhong & Dutt, UIC, Nov. 2000
Advantages of PDP
• Time-efficient
• divide-and-conquer approach
• Balanced decision with a global view
• top-down placement flow
• Can tackle almost any objective function
accurately (up to interconnect length model)
• delay, WL, power (in iterative improvement,
update cost per move)
• Flexibility in tackling multiple constraints
• iterative improvement---check per move
Zhong & Dutt, UIC, Nov. 2000
Previous PDP Work
• Sequential level partitioning [Breuer, DAC ‘77]
– regions at the same level are cut sequentially
– may result in sub-optimal wire-length or cutsize
• Terminal propagation [Dunlop et al, TCAD ‘85]
– addresses external connections during partitioning
• Quadrisection [Suaris et al, TCAS ‘88; Huang et al,
ISPD ‘97]
– 4-way partitioning better controls wire length in
both directions, but run time goes up
Zhong & Dutt, UIC, Nov. 2000
New PDP Techniques--- Rectify
Drawbacks of Prior PDP
• Placer SPADE (Simultaneous level PArtitioning with
Distributed nEt views)
• Simultaneous Level Partitioning (SLP)---rectifies prior
drawback of sequentially-ordered optimization
• Global net views---rectifies prior drawback of localized
subcircuit views and cost + inaccuracy of Term. Prop.
• Wire-length based gain computation---rectifies prior
drawback of mincut-based gain (not strictly WL)
• Modified CLIP-FM partitioner [Dutt et al, ICCAD ‘96]
• Maximum row length control
• Post-processing (cell swaps)
Zhong & Dutt, UIC, Nov. 2000
Simultaneous Level Partitioning
• Simultaneous partitioning
of all regions within the
same level
• Cell moves are naturally
interleaved across all
regions based on gains
(as shown in the figure)
• Achieves simultaneous
optimization across
multiple regions
Zhong & Dutt, UIC, Nov. 2000
1
2
1
3
4
2
SLP vs. Sequential Level Partitioning
• Sequential level partitioning may not be able
to escape local optima
Orig Cost=8
1
1 v
New Cost = 3
1
1 v
v
cells
3
(1)
3
u
pads
4
3
3
3
u
u
4
(2)
4
Initial partitioning: nets Sequential: sub-optimal
labeled with weights move sequence, if upper
region processed first
Zhong & Dutt, UIC, Nov. 2000
New Cost = 1
1
1 v
u
4
(1)
u
4
SLP: only the cell in
lower region moved
Global Net View vs. Terminal
Propagation
• Terminal propagation may
be inaccurate for wire
length reduction
• With a global net view we
can do better (e.g., moving
left is better in the figure
shown as it can shrink the
BB, while the right move
expands BB)
Zhong & Dutt, UIC, Nov. 2000
Dummy
Possible moves:
dummy position
does not help
De-coupled Regions: a Caveat
• Suitable for row-based designs
• Property: For a hor. cut, WL
change due to cell moves in
regions in one side of the
previous-level cutline does not
affect WL of the subcircuits in
regions on the other side
• Sequential partitioning of
regions separated by previouslevel horizontal cutlines justified
• Reduced run time at NO cost of
wire length
Zhong & Dutt, UIC, Nov. 2000
c
d
c’
Two segments can be
shrunk separately;
Regions spanning
cutline c is de-coupled
from those spanning c’
by previous cutline d
Wire-length Based Gain
• Pin coordinates (x or y) of each net along the
direction orthogonal to current cutline are
stored in a binary search tree
• SPADE-FM: A cell move can have non-zero
gain only when it changes global boundingboxes of connected nets
Zhong & Dutt, UIC, Nov. 2000
Illustration of Gain Computation
u
v
g(v)=5L
du
x
3L
d'
d''
8L
w
d
SPADE-FM: gain(u) = gain(w) = 0; since neither move can
change bounding box by itself; only gain(v)=5L is positive and
all others have gain zero as “internal” nodes.
SPADE-PROP: gain(u) = (d'-d)•p(u)•p(w)/p(u) + (d'' - d')•p(x), where
p(y) is the probability of y. The gain is of two parts: single-step
PROP gain of moving u and w, and multi-step gain for moving cells
not on the boundary of BB (e.g., x) from same side as u.
Zhong & Dutt, UIC, Nov. 2000
Global Gain Update
• Every move may
entail out-of-region
update of cell gains
• Total time taken for
such update per
pass is bounded by
O(p*log(p)), where p
is the pin number
Zhong & Dutt, UIC, Nov. 2000
cell move
1
0
0
Gain update needed
1
Maximum Row Length Control
• A decisive factor in die-area utilization
• Gradually increase row-balance deviations w/
partitioning tree levels to max allowable
– cannot use the prescribed max. row-length devn, as it
can freeze moves for future cuts (see figure below)
Initial devn set as max
allowed value
Devn
avail.
Max devn reached, further
partitioning badly hampered
• Row devn assigned inversely proportional to logarithm
of # of rows of target regions
Zhong & Dutt, UIC, Nov. 2000
Local Region Balance Control
• Relaxed local balance but strict row-balance control
• Local Deviation (from closest possible balance to 5050) = Row Deviation overconstrains the problem
• Allow Local Deviation = (Row Deviation),  > 1, but
maintain overall row deviation
Zhong & Dutt, UIC, Nov. 2000
Circuit Partitioning Engine
• CLIP-FM variation (SHRINK-FM) or SHRINKPROP algorithm at the core
– shrinking initial gain helps cluster removal
– iterative mode: shrink factor gradually enlarged to
get independent gains after most clusters are
removed through earlier passes
• Two-level gain tree structure
– local binary search tree for each region
– top-gain cells of local trees sorted into global tree
• Efficient global cell selection strategy
– row-balance violation: search opposite global tree
– local violation: switch to opposite local tree
– tie-breaking: following latest move
Zhong & Dutt, UIC, Nov. 2000
Post-processing
• Intra-row horizontal neighbor swap
• Intra-row clustering based on int/ext nets ratio
• Inter-row vertical swap
– some cells have to be shifted due to cell overlap
• Results in about 1-2% improvement
Horizontal neighbor swap
Zhong & Dutt, UIC, Nov. 2000
Vertical cell swap
Experimental Evaluation
• MCNC standard cell benchmarks: up to 100k cells
• Compared with prior methods
–
–
–
–
TimberWolf 7.0 [Sun et al, TCAD ‘95]
FD-98 [Eisenmann et al, DAC ‘98]
QUAD [Huang et al, ISPD ‘97]
Snap-On [Yang et al, ISPD ‘00]
• Same number of rows as TimberWolf 7.0
• Part of IBM-PLACE circuits also tested (ibm11 ibm15) and compared to iTools [internetCAD]
• Experiments conducted on 550 MHz Pentium-III
Linux workstations
Zhong & Dutt, UIC, Nov. 2000
Comparison with Previous Methods
SLP vs Seq.
SPADE-FM Sequentail WL imprv.
Total WL (6 ckts) 52.86
Total time (6 ckts) 7052
Circuit
primary1
struct
primary2
biomed
industry2
industry3
avqsmall
avqlarge
golem3
Total (8/8 ckts)
Total (5/7 ckts)
SPADE-FM imprv.
SPADE-PROP imprv.
run time (8 ckts)
run time (6 ckts)
scaled time ratio
SPADE-FM
0.74
0.291
3.13
1.43
11.9
35.37
5.59
6.16
19.84
84.16 / 64.61
15.94 / 64.32
TW 7.0
0.83
3.53
1.61
13.3
41.53
5.08
5.65
22.6
94.13 /
10.60%
11.92%
15001
14710
1
Zhong & Dutt, UIC, Nov. 2000
19034
0.69
65.57
1719
FD-98
0.87
0.338
3.72
1.78
14.6
45.1
4.91
5.38
19.38%
QUAD
0.9
0.378
3.68
6.29
6.59
Snap-On SPADE-PROP
0.95
3.66
1.84
14.48
44.7
5.15
5.21
/ 76.70
15.80%
17.13%
7173
17.84 /
10.70%
15.81%
57920
0.26
1.16
/ 75.99
15.30%
16.74%
0.74
0.285
3.07
1.38
12.07
35.09
5.31
5.61
19.64
82.91/63.56
15.02/63.27
18108
18071
1.21
Other Experimental Results
• Trade-off between run time and solution quality of
SPADE-FM with 8 and 16 runs for the MCNC suite
Trade-off SPADE-FM/8 SPADE-FM/16 Best WL
Total WL
Total time
89.65
29117
84.45
37738
82.87
• Results for IBM-PLACE Benchmarks
Circuit
ibm11
ibm12
ibm13
ibm14
ibm15
Total WL
imprv. vs. itools
Zhong & Dutt, UIC, Nov. 2000
SPADE-FM SPADE-PROP
37.27
66.52
42.94
121.38
134.68
402.79
1.24%
36.28
64.92
42.4
121.17
130.45
395.22
3.10%
iTools
39.76
69.56
49.11
118.8
130.6
407.83
16 vs 8
5.81%
1.3 x
Conclusions and Future Work
• Introduced novel concepts of:
– SLP
– global net view
– bounding-box based gain computation
• PDP alone can be competitive (in fact better)
– up to 15.8% better in aggregate result than s-of-art
– among large circuits:
• best-known result for largest MCNC ckt - golem3
• best-known results for ibm11-ibm13
• Run time reasonable, but can be reduced
– early-stop per pass
– multilevel clustering
• On-going work
– timing-driven PDP
– multi-constraint PDP (congestion, thermal distr, mult obj)
Zhong & Dutt, UIC, Nov. 2000