ppt - UCLA.edu

Design, Synthesis and Evaluation
of Heterogeneous FPGA
with Mixed LUTs and Macro-Gates
Yu Hu1, Satyaki Das2, Steve Trimberger2, and Lei He1
1. Electrical Engineering Dept., UCLA
2. Research Labs, Xilinx Inc.
Presented by Yu Hu
Address comments to [email protected]
Outline
 Introduction
 Design of the Macro-gates
 Synthesis for the Proposed FPGA Architecture
 Comparison of Heterogeneous FPGA Architectures
 Conclusions and Future Work
Heterogeneity in FPGA Architectures
 Heterogeneity among SLICEs


Programmable logic and routing
Tiles are not identical



soft logic fabric [Kaviani, FPGA’96]]
hard structures [Jamieson, FPL’05]
Dedicated hard structures


e.g. DSP
e.g memory block
 Heterogeneity within a SLICE



Programmable logic and routing
Tiles (SLICEs) are identical
Different logics exist within a SLICE



e.g. LUTs with different size [Cong, FPGA’99]
e.g. mixed PLAs and LUTs [Cong, TODAES’05]
e.g. mixed macro-gates and LUTs
(source: Jamieson@FPL’05)
Heterogeneous FPGA with Macro-Gates
 There exists programmability and cost trade-off
between LUTs and macrogates

Xilinx V4 benefits from small gates (MUX2, XOR2) built in
SLICEs.
 The benefit of wider macro-gates

Effectiveness of the incorporation of wider logic functions (macro
gates) is not clear.
 Our contributions



Design a new FPGA architecture with mixed LUTs and macrogates
Propose a new automatic synthesis flow for mapping a circuit to
the proposed FPGA architecture
Evaluate the architecture and show that the proposed
architecture reduces delay and area by 16.5% and 30%,
respective, compared to the LUT-only architecture.
Outline
 Introduction
 Design of the Macro-gates
 Synthesis for the Proposed FPGA Architecture
 Comparison of Heterogeneous FPGA Architectures
 Conclusions and Future Work
Overview of Macro-Gate Design
 Key problem

Select the logic functions for the macro-gate
 Problem formulation:



Input: a set of training circuits, which have been
mapped to K-input LUTs
Output: N K-input Boolean functions: f1 , … , fN
Objective: Maximize the number of logics (in the
training circuit set) which can be implemented by
f1 , … , fN
 The proposed solution

Ranking of the logic functions for a set of training
circuits
NPN-Class Diagram: Organization of Logics
 Canonical and efficient representation of all NPN classes


NPN-Equivalent: functional equivalency under inputs negation,
permutation or output negation
E.g., f(a,b,c)=a+bc, g(a,b,c)=b’a+b’c
 NPN-Cofactor relationship is indicated
 DAG: easy to manipulate
 It becomes impractical to compute for more than 6-input functions!

Solution: Utilization NPN-Class Diagram
Level2: 2-input
Level1: 1-input
Level0: constant
Wider inputs
Level3: 3-input
UND: Utilization NPN-Class Diagram
 UND is an DAG, sub-graph of NCD
 Help for scoring and ranking functions
ab’c’+a’bc’
ab’c’+a’bc’ / 1 / xx%
abc/ 1 / xx%
abc
ab’+a’b
ab’+a’b / 0 / xx%
ab / 0 / xx%
a
a / 0 / xx%
Implementation
capability
-0- / 0 / xx%
functionality
Appearance
frequency
UND: Utilization NPN-Class Diagram
ab’c’+a’bc’
ab’c’+a’bc’ / 1 / xx%
abc/ 1 / xx%
abc
ab’+a’b
1 / xx%
ab’+a’b / 0
ab / 0 / xx%
a/0
1 / xx%
-0- / 0 / xx%
a
UND: Utilization NPN-Class Diagram
 Calculate Implementation Capability
ab’c’+a’bc’
ab’c’+a’bc’ / 1 / 75%
abc/ 1 / 50%
abc
ab’+a’b
ab’+a’b / 1 / 50%
ab / 0 / 25%
The topology
property (DAG) of
UND enables us to
efficiently explore
different metrics for
functionality ranking,
e.g., utilization rate.
a
a / 1 / 25%
-0- / 0 / xx%
Fanout cone of
ab’c+a’bc’
Recap: Overall Flow for Macro-Gate Design
f
LUT
g
and2(3)
LUT
d
F
e
h
b
a
Map with
LUT-N
0000001000000000
0000010000000000
0000100000000000
0001000000000000
0010000000000000
0100000000000000
……
LUT
Extract logic
functions
nand2(2)
c
Generate Utilization
NPN Diagram
inv(1)
abc/ 1 / xx%
ab’c’+a’bc’ / 1 / xx%
1 / xx%
ab’+a’b / 0
ab / 0 / xx%
Calculate score
For logic functions
ab’c’+a’bc’ / 1 / 75%
abc/ 1 / 50%
1+1*2/3+1*1/3=2
1+1*1/3=1.33
ab’+a’b / 1 / 50%
ab / 0 / 25%
1*1/2=0.5
a/0
1 / xx%
1+1*1/2=1.5
-0- / 0 / xx%
Rank logic
functions
a / 1 / 25%
1
-0- / 0 / xx%
Best function: ab’c’+a’bc’
Proposed Macro-Gates and FPGA Architecture
 For IWLS’05 benchmarks, the following four 6-input functions have
the highest ranks




GI1=a b c d e f
GI2=a’ b’ c’ + b c f’ + b c’ d’ + b’ c e
GI3=a b' c d' e + b c e f + d e f
GI4=a b' + a' c d' + b' c' + e' + f‘
(AND-6)
(MUX-4)
 It can implement over 50% of logic functions in IWLS’05 benchmarks.
 The architecture of the proposed macro-gate and FPGA SLICE are
Outline
 Design of the Embedded Macro-gates
 Synthesis for the Proposed FPGA Architecture

Technology Mapping for Heterogeneous FPGAs

SAT-based Packing

Place and Routing
 Comparison of Heterogeneous FPGA Architectures
 Conclusions and Future Work
Functional & Structural Cut Enumeration
w z
x
a=(x+y)’
b=y+wz
y
c
4-input macro gate lib
0000001000000000
0000010000000000
0000100000000000
0001000000000000
0010000000000000
0100000000000000
……
a
d=ab=(x+y)’(y+wz)
=x’y’wz
b
d
Yes
Is x’v’wz in
library?
 Phase1:Enumerate and label cuts from PIs to Pos

Check the feasibility of a cut w.r.t. the macro-gate
 Phase2:Select best choice from POs to Pis
 A general yet efficient solution is SAT based Boolean matching

Exploiting Symmetry in SAT-Based Boolean Matching for Heterogeneous FPGA Technology
Mapping , Session 5C.1, ICCAD 07
Key in Technology Mapping:
Balance Resource Utilization
 Asymmetric architecture causes problem to resource utilization
 Exclusively use of one logic resource leads to lots of unused fabric
 Simple yet effective solution :


Change LUT-MG ratio by adjusting their area weights.
Precise calibration is hard to reach by this approach.
Total# too
large!
6000
Objective
architecture:
LUT6:MacroGate6
=1:1
MG#
LUT6#
5000
Hard to obtain
precise calibration
4000
3000
Best LUT-MG ratio
= 1:1
2000
1000
LUT-MG ratio = LUT#/MG#
0
1:1
1:0.95
1:0.9
1:0.8
1:0.5
1:0.1
Post-Mapping Area Recovery (motivation example)
 Given:




Target architecture = LUT6 + MG6
LUT-MG ratio in target architecture = 1:1
LUT# < MG# in the mapped design
Intrinsic delay (LUT6 : MG6) = 5:4
 Objective: balance LUT MG number without increasing delay
5/5
9 / 13
LUT6
MG6
17 / 17
9/9
13 / 13
MG6
PI
MG6
MG6
4/5
MG6
MG6
8/9
PO
Post-Mapping Area Recovery (motivation example)
 Given:




Target architecture = LUT6 + MG6
LUT-MG ratio in target architecture = 1:1
LUT# < MG# in the mapped design
Intrinsic delay (LUT6 : MG6) = 5:4
 Objective: balance LUT MG number without increasing delay
5/5
10 / 13
LUT6
LUT6
17 / 17
9/9
13 / 13
MG6
PI
MG6
MG6
4/5
MG6
MG6
8/9
PO
Post-Mapping Area Recovery (motivation example)
 Given:




Target architecture = LUT6 + MG6
LUT-MG ratio in target architecture = 1:1
LUT# < MG# in the mapped design
Intrinsic delay (LUT6 : MG6) = 5:4
Timing slack
budgeting is
necessary!
 Objective: balance LUT MG number without increasing delay
5/5
10 / 13
LUT6
LUT6
18 / 17
9/9
14 / 13
MG6
PI
MG6
MG6
5/5
LUT6
LUT6
10 / 9
Timing target
violation!
PO
Post Mapping Area Recovery by Timing Budgeting
 Formulated as an Integer Linear Programming (ILP) Problem
 Objective (minimize gap between target and actual LUT-MG ratios):
min |m2+…+m7-7/2|
 Arrival time constraints: ai+dj+bj<=aj
 Clock period target: ai<=17
 LUT assignment with given timing slack: (5-4)*mj<=bj, mj={0,1}
PI
a1
a2
LUT6
MG6
Easy to be generalized to handle
a5
arch
a4
 with multiple macro gates
MG6
 with different input pinMG6
numbers
a6
MG6
MG6
a7
a3
MG6
PO
Outline
 Design of the Embedded Macro-gates
 Synthesis for the Proposed FPGA Architecture

Technology Mapping for Heterogeneous FPGAs

SAT-based Packing
 Comparison of Heterogeneous FPGA Architectures
 Conclusions and Future Work
SAT-Based Packing
 Motivation

Traditional packing tools, e.g., T-VPack, hard-codes the architecture
specification of a SLICEs….


Re-impalement from scratch when architecture changes
Propose a unified implementation of the packers for different
architectures: easy to perform architecture exploration!
 The architecture dependent sub-problem in packing

Structural feasibility checking for a sub-circuit to the SLICE
 Solution


Solve the problem of validating SLICE packing as a local
place&route problem
A SAT solver is used to carry out the validation checking
Example of SAT-Based SLICE Packing
 Examples of constraints: (for each classes of constraint…)
 Placement and routing choice variables: X@A, X@B, U5@N10
 Exclusively constraint: (¬X@A) ∨ (¬X@B)
 Presence constraint: (X@A) ∨ (¬X@B)
 Input/Output constraint: X@A → U5@N10
 Routing constraint: G0 →out ∧ U5@N10) → U5@N12
Recap: Overall Synthesis Flow
f
LUT
g
Area weight
Setting
LUT
LUT
d
F
e
h
b
a
Cut-based
Mapping
LUT
LUT
c
LUT
LUT6
Area-Balance
Trade-off?
MG
6
LUT6
Y
LUT6
MG
6
N
MG
6
LUT6
M
G6
M
G6
M
G6
M
G6
packing
LUT6
M
G6
Post-mapping
Area recovery
MG
6
M
G6
MG
6
LUT6
MG
6
MG
6
Outline
 Motivation and Objectives
 Methodology for Logic Function Exploration
 Technology Mapping for Heterogeneous FPGAs
 Evaluation of Heterogeneous FPGA Architectures
 Conclusions and Future Work
Experimental Setting
 Design library parameters [Cong, TODAES’05]
 Benchmark set: IWLS 2005
 Four architectures are compared:

LUT4, LUT4 + macro gate, LUT6, and LUT6 + macro gate

Synthesize the proposed macro-gate by SIS1.2
Delay and area model

 Interconnect delay is igonired
Delay Comparisons
 Compared to LUT4, LUT4+MG reduces both logic depth and
delay by 9.2%.
 Compared to LUT6, LUT6+MG reduces delay by 30% while
increasing logic depth by 36.5%.
A LUT6 can implement more logics than a macro-gate
delay
Logic depth
G
T6
+M
G
T6
+M
LU
T6
LU
G
T4
+M
LU
LU
T4
0
LU
2
T6
4
7.14
G
6
7.86
9.14
T4
+M
5.48
LU
7.48
7.14
T4
7.86
8
10.95
12
10
8
6
4
2
0
LU
10
LU

Logic Area Comparisons
 Compared to LUT4, LUT4+MG reduces logic area by 12.5%.
 Compared to LUT6, LUT6+MG reduces logic area by 16.9%.
Area
PLB#
7000
6000
5000
4000
3000
2000
1000
0
6406
3711
2985
2142
T4
U
L
T
LU
M
4+
G
T6
U
L
T
LU
M
6+
G
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
16816
11849
7346
T4
U
L
6408
+M
4
T
LU
G
T6
U
L
+M
6
T
LU
G
Outline
 Motivation and Objectives
 Methodology for Logic Function Exploration
 Technology Mapping for Heterogeneous FPGAs
 Comparison of Heterogeneous FPGA Architectures
 Conclusions and Future Work
Conclusions
 Conclusions



A novel FPGA architecture with the mixed LUTs and macrogates is proposed
A synthesis flow for the proposed architecture is implemented
The preliminary experimental results show the effectiveness of
the proposed architecture for the area and delay reduction
 Future Work



Perform the physical design for the synthesized circuits and
compare the routing costs, architecture evaluation considering
interconnect delay
Study the effectiveness of the power reduction for the proposed
architecture
Macro-gates with wider inputs will be examined