A k-feasible - People @ EECS at UC Berkeley

Combinational and Sequential
Mapping with Priority Cuts
Alan Mishchenko
Sungmin Cho
Satrajit Chatterjee
Robert Brayton
UC Berkeley
Outline
1.
2.
3.
4.
5.
Traditional cut-based LUT mapping
Improved technology mapping with priority cuts
Sequential mapping
Other applications of priority cuts
Experimental results
2
Technology Mapping
Input: A Boolean network
(And-Inverter Graph)
f
Output: A netlist of K-LUTs implementing
the Boolean network optimizing some
cost function
f
Technology
Mapping
a
b
c
d
The subject graph
e
a b
c d e
The mapped netlist
3
k-feasible Cuts
r
A cut of a node n is a set of
nodes in transitive fan-in
such that
every path from the node to PIs
is blocked by nodes in the cut.
A k-feasible cut means the size
of the cut must be k or less.
p
a
q
b
c
The set {p, b, c} is a 3-feasible cut of
node r. (It is also a 4-feasible cut.)
k-feasible cuts are important in FPGA mapping since the logic between a
node and the nodes in its cut can be replaced by a k-LUT.
4
k-feasible Cut Computation
The set of cuts of a node is a ‘cross product’ of the sets of cuts of its children
{ {r}, {p, q}, {p, b, c}, {a, b, q}, {a, b, c} }
r
{ {p}, {a, b} }
{ {q}, {b, c} }
p
Computation is
done bottom-up
{ {a} }
a
q
{ {b} }
{ {c} }
b
c
Any cut that is of size greater than k is discarded
(P. Pan et al, FPGA ’98; J. Cong et al, FPGA ’99)
5
Basic Mapping Algorithm
Depth-optimal LUT mapping of a DAG using all cuts at
each node
Input: And-Inverter Graph
1. Compute K-feasible cuts for each node
2. Compute best arrival time at each node
• In topological order (from PI to PO)
• Compute the depth of all cuts and choose the best one
3. Perform area recovery
• Using area flow
• Using exact local area
4. Chose the best cover
• In reverse topological order (from PO to PI)
Output: Mapped Netlist
6
Area Recovery Summary
• Area recovery heuristics
– Area-flow (global view)
• Chooses cuts with better logic sharing
– Exact local area (local view)
• Minimizes the number of LUTs needed to map each node
• The results of area recovery depends on
– The order of processing nodes
– The order of applying two passes
– The number of iterations
• This scheme works for the constant-delay model
– Any change off the critical path doesn’t affect critical path
7
Drawbacks of Traditional Mapping
Based on Exhaustive Cut Enumeration
• For large designs, there may be
many k-feasible cuts
– Order of millions
• Previous ways of dealing with the
problem
– Detect and remove cut dominance
– Perform cut pruning
– Store only cuts on the frontier of
mapping
k
Average
number
of cuts
per node
4
6
5
20
6
80
7
150
8
240
8
Outline
1.
2.
3.
4.
5.
Traditional cut-based technology mapping
Improved technology mapping
Sequential mapping
Other applications of priority cuts
Experimental results
9
New Mapping Algorithm
Near-depth-optimal LUT mapping of a DAG using several cuts
at each node
Input: And-Inverter Graph
1. Compute K-feasible cuts for each node
2. Compute arrival time at each node
•
•
•
In topological order (from PI to PO)
Compute the depth of all cuts and choose the best one
Compute at most C good cuts and choose the best one
3. Perform area recovery
•
•
•
Using area flow
Using exact local area
Re-compute at most C good cuts and choose the best one in
each iteration
4. Chose the best cover
•
In reverse topological order (from PO to PI)
Output: Mapped Netlist
10
Computing Priority Cuts
• Consider nodes in a topological order
– At each node, merge two sets of fanin cuts (each containing C cuts)
getting (C+1) * (C+1) + 1 cuts
– Sort these cuts using a given cost function, select C best cuts, and use
them for computing priority cuts of the fanouts
– Select one best cut, and use it to map the node
• Sorting criteria
Mapping pass
Depth
Area flow
Exact area
Primary metric
depth
area flow
exact area
Tie-breaker 1
cut size
fanin refs
fanin refs
Tie-breaker 2
area flow
depth
depth
11
Discussion
• Complexity analysis
– Traditional mapping algorithm
•
•
•
•
K - max cut size
C - max number of cuts
n - number of nodes
m – number of edges
• FlowMap O(Kmn) (J. Cong et al, TCAD ’94)
• CutMap O(2KmnK) (J. Cong et al, FPGA ’95)
– Proposed mapping algorithm
• O(KC2n)
12
Priority Cuts: A Bag of Tricks








Compute and use priority cuts (a subset of all cuts)
Dynamically update the cuts in each mapping pass
Use different sorting criteria in each mapping pass
Include the best cut from the previous pass into the set
of candidate cuts of the current pass
Consider several depth-oriented mappings to get a good
starting point for area recovery
Use complementary heuristics for area recovery
Perform cut expansion as part of area recovery
Use efficient memory management
13
Outline
1.
2.
3.
4.
5.
Traditional cut-based technology mapping
Improved technology mapping
Sequential mapping
Other applications of priority cuts
Experimental results
14
Sequential Mapping
 That is, combinational mapping and retiming combined
 Minimizes clock period in the combined solution space
 Previous work:
 Pan et al, FPGA’98
 Cong et al, TCAD’98
 Our contribution: divide sequential mapping into steps
 Find the best clock period via sequential arrival time computation
(Pan et al, FPGA’98)
 Run combinational mapping with the resulting arrival/required
times of the register outputs/inputs
 Perform final retiming to bring the circuit to the best clock period
computed in Step 1
15
Sequential Mapping (continued)
• Advantages
– Uses priority cuts (L=1) for computing sequential arrival times
• very fast
– Reuses efficient area recovery available in combinational mapping
• almost no degradation in LUT count and register count
– Greatly simplifies implementation
• due to not computing sequential cuts (cuts crossing register boundary)
• Quality of results
– Leads to quality that is better (by ~15%) than combinational mapping
followed by retiming
• due to searching the combined search space
– Achieves almost the same (-1%) clock period as the general sequential
mapping with sequential cuts
• due to using transparent register boundary without computing sequential
cuts
16
Outline
1.
2.
3.
4.
5.
Traditional cut-based technology mapping
Improved technology mapping
Sequential mapping
Other applications of priority cuts
Experimental results
17
Speeding Up SAT Solving
• Perform technology mapping into K-LUTs for area
– Define area as the number of CNF clauses needed to represent
the Boolean function of the cut
– Run several iterations of area recovery
• Reduced the number of CNF clauses by ~50%
– Compared to a smart circuit-to-CNF translation (M. Velev)
• Improves SAT solver runtime by 3-10x
– Experimental results will be given later
18
Minimizing the Total Number of BDD Nodes
Needed to Represent a Boolean Network
• Perform technology mapping into K-LUTs for minimizing
area under delay constraints
– Define area of a cut as the number of BDD nodes needed to
represent the Boolean function of the cut
– Run delay-oriented mapping, followed by several iterations of
area recovery
19
Cut Sweeping
• Reduce the circuit by detecting and merging shallow
equivalences (proposed by Niklas Een)
– By “shallow” equivalences, we mean equivalent points, A and B,
for which there exists a K-cut C (K < 16) such that FA(C) = FB(C)
– A subset of “good” K-input priority cuts can be computed
– The quality of a cut is determined by the number of fanouts of
the cut leaves
• The more fanouts, the more likely the cut is a common cut for two
nodes
• Cut sweeping quickly reduces the circuit
– Typically ~50% gain of SAT sweeping (Fraiging)
• Cut sweeping is much faster than SAT sweeping
– Typically 10-100x, for large designs
• Can be used as a fast preprocessing to (or a low-cost
substitute for) SAT sweeping
20
Sequential Resynthesis for Delay
• Restructure logic along the tightest
sequential loops to reduce delay after
retiming (Soviani/Edwards, TCAD’07)
– Similar to sequential mapping
– Computes seq arrival times for the circuit
– Uses the current logic structure, as well as
logic structure, transformed using Shannon
expansion w.r.t. the latest variables
– Accepts transforms leading to delay reduction
– In the end, retimes to the best clock period
• The improvement is 7-60% in delay with
1-12% area degradation (ISCAS circuits)
• This algorithm could benefit from the use
of priority cuts
21
Outline
1.
2.
3.
4.
5.
Traditional cut-based technology mapping
Improved technology mapping
Sequential mapping
Other applications of priority cuts
Experimental results
22
Experimental Comparison
• Compare the new mapping against the traditional
mapping in terms of
–
–
–
–
Delay
Area
Runtime
Memory
• Compare on large industrial benchmarks with choices
• Analyze the performance of the new mapping for
– Large designs
– Large LUTs
• Explore the potential of sequential mapping
• Computer used for experiments
– IBM ThinkPad laptop with 1.6GHz and 2Gb RAM
23
Priority cuts vs. Cut enumeration (C=8)
Ratio
Depth
Area
Memory
Runtime
K=4
old
new
1.00
1.00
1.00
0.99
1.00
0.12
1.00
0.78
K=6
old
new
1.00
1.00
1.00
1.00
1.00
0.06
1.00
0.15
K=8
old
new
1.00
0.93
1.00
0.96
1.00
0.05
1.00
0.02
K = 10
old
new
1.00
0.82
1.00
0.84
1.00
0.05
1.00
0.03
Used a set of the large public benchmarks
24
Priority Cuts vs. Cut Enumeration (K=6, C = 16)
Priority cuts
Cut enumeration
Priority cuts
Mapping with choices
Mapping w/o choices
Cut enumeration
25
Used a set of large industrial benchmarks
Performance on Large Designs (C=1)
Number
of
frames
1
20
40
60
80
100
AIG statistics
Levels
18
284
564
844
1124
1404
Nodes
40381
808135
1616285
2424435
3232585
4040735
FPGA mapping
statistics
Depth
Number of
LUTs
4
11069
61
205143
121
409149
181
613155
241
817161
301
1021167
Computer resources
Memory,
Mb
2.21
42.68
85.28
127.88
170.48
213.09
Runtime,
sec
0.02
0.42
0.84
1.35
1.77
2.25
Using design wb_conmax.v (part of IWLS 2005 benchmarks)
This is a WISHBONE Interconnect Matrix IP core. It can
interconnect up to 8 Masters and 16 Slaves
Source: http://www.opencores.org
26
Performance for Large LUTs (C=1)
LUT
size
4
6
8
10
12
14
16
FPGA mapping
statistics
Depth
Number
of LUTs
602
2279062
451
1704400
352
1205319
301
1021167
276
1044370
227
799618
202
694954
Computer resources
Memory,
Mb
114.74
147.52
180.30
213.09
245.87
278.65
311.43
Runtime,
sec
1.89
2.00
2.19
2.24
2.50
2.55
2.62
Using 100 timeframes of design wb_conmax.v
27
Sequential Mapping (K=6, C=8)
Name
s13207
s1423
s15850.1
s15850
s35932
s382
s38417
s38584.1
s38584
s9234.1
s9234
Ratio
PI
31
17
77
14
35
3
28
38
12
36
19
Statistics
PO AIG
121 2136
5
441
150 2755
87 2760
320 8129
6
100
106 8171
304 9967
278 9989
39 1349
22 1349
Depth (LUTs)
M
M+R MR
6
5
4
10
10
9
9
7
6
9
7
5
3
3
2
3
3
2
6
6
5
6
6
5
6
6
5
5
5
3
5
4
3
1.00 0.93 0.71
Area (LUTs)
M
M+R
1047
1047
131
131
1012
1012
1002
1002
2320
2320
36
36
2623
2623
2491
2491
2504
2504
319
319
321
321
1.00
1.00
MR
1056
146
1042
1015
2320
34
2901
2558
2517
332
330
1.03
Area (registers)
M
M+R
MR
648
666
733
74
74
80
516
552
533
563
640
640
1728
1728
1872
21
21
22
1564
1564
1636
1276
1276
1299
1301
1301
1327
145
145
171
160
181
182
1.00
1.03
1.08
Time, sec
M
MR
0.06
0.23
0.01
0.04
0.09
0.38
0.09
0.43
0.19
0.45
0.00
0.04
0.28
3.02
0.31
0.81
0.31
0.92
0.03
0.10
0.02
0.14
1.00
4.54
Used a subset of ISCAS benchmarks, for which retiming reduced delay
28
Summary
• Reviewed traditional technology mapping
– Cut computation
– Optimum-depth mapping
– Area recovery
• Presented an improved approach to mapping
– Computes a small number of cuts at each node
– Uses new ideas to dramatically reduce memory and runtime
• Reported experimental results
– Compared priority cuts with exhaustive cut enumeration
• Delay and area are comparable or better by 1-3%
• Memory and runtime are greatly reduced (5x for 6-LUTs)
– Showed performance on very large designs (2 sec to map 1M)
– Compared combinational and sequential mapping
• Implemented in ABC
– Google: “abc berkeley” (package “if”)
29
The End
30