Design, Synthesis and Evaluation of Heterogeneous FPGA with Mixed LUTs and Macro-Gates Yu Hu1, Satyaki Das2, Steve Trimberger2, and Lei He1 1. Electrical Engineering Dept., UCLA 2. Research Labs, Xilinx Inc. Presented by Yu Hu Address comments to [email protected] Outline Introduction Design of the Macro-gates Synthesis for the Proposed FPGA Architecture Comparison of Heterogeneous FPGA Architectures Conclusions and Future Work Heterogeneity in FPGA Architectures Heterogeneity among SLICEs Programmable logic and routing Tiles are not identical soft logic fabric [Kaviani, FPGA’96]] hard structures [Jamieson, FPL’05] Dedicated hard structures e.g. DSP e.g memory block Heterogeneity within a SLICE Programmable logic and routing Tiles (SLICEs) are identical Different logics exist within a SLICE e.g. LUTs with different size [Cong, FPGA’99] e.g. mixed PLAs and LUTs [Cong, TODAES’05] e.g. mixed macro-gates and LUTs (source: Jamieson@FPL’05) Heterogeneous FPGA with Macro-Gates There exists programmability and cost trade-off between LUTs and macrogates Xilinx V4 benefits from small gates (MUX2, XOR2) built in SLICEs. The benefit of wider macro-gates Effectiveness of the incorporation of wider logic functions (macro gates) is not clear. Our contributions Design a new FPGA architecture with mixed LUTs and macrogates Propose a new automatic synthesis flow for mapping a circuit to the proposed FPGA architecture Evaluate the architecture and show that the proposed architecture reduces delay and area by 16.5% and 30%, respective, compared to the LUT-only architecture. Outline Introduction Design of the Macro-gates Synthesis for the Proposed FPGA Architecture Comparison of Heterogeneous FPGA Architectures Conclusions and Future Work Overview of Macro-Gate Design Key problem Select the logic functions for the macro-gate Problem formulation: Input: a set of training circuits, which have been mapped to K-input LUTs Output: N K-input Boolean functions: f1 , … , fN Objective: Maximize the number of logics (in the training circuit set) which can be implemented by f1 , … , fN The proposed solution Ranking of the logic functions for a set of training circuits NPN-Class Diagram: Organization of Logics Canonical and efficient representation of all NPN classes NPN-Equivalent: functional equivalency under inputs negation, permutation or output negation E.g., f(a,b,c)=a+bc, g(a,b,c)=b’a+b’c NPN-Cofactor relationship is indicated DAG: easy to manipulate It becomes impractical to compute for more than 6-input functions! Solution: Utilization NPN-Class Diagram Level2: 2-input Level1: 1-input Level0: constant Wider inputs Level3: 3-input UND: Utilization NPN-Class Diagram UND is an DAG, sub-graph of NCD Help for scoring and ranking functions ab’c’+a’bc’ ab’c’+a’bc’ / 1 / xx% abc/ 1 / xx% abc ab’+a’b ab’+a’b / 0 / xx% ab / 0 / xx% a a / 0 / xx% Implementation capability -0- / 0 / xx% functionality Appearance frequency UND: Utilization NPN-Class Diagram ab’c’+a’bc’ ab’c’+a’bc’ / 1 / xx% abc/ 1 / xx% abc ab’+a’b 1 / xx% ab’+a’b / 0 ab / 0 / xx% a/0 1 / xx% -0- / 0 / xx% a UND: Utilization NPN-Class Diagram Calculate Implementation Capability ab’c’+a’bc’ ab’c’+a’bc’ / 1 / 75% abc/ 1 / 50% abc ab’+a’b ab’+a’b / 1 / 50% ab / 0 / 25% The topology property (DAG) of UND enables us to efficiently explore different metrics for functionality ranking, e.g., utilization rate. a a / 1 / 25% -0- / 0 / xx% Fanout cone of ab’c+a’bc’ Recap: Overall Flow for Macro-Gate Design f LUT g and2(3) LUT d F e h b a Map with LUT-N 0000001000000000 0000010000000000 0000100000000000 0001000000000000 0010000000000000 0100000000000000 …… LUT Extract logic functions nand2(2) c Generate Utilization NPN Diagram inv(1) abc/ 1 / xx% ab’c’+a’bc’ / 1 / xx% 1 / xx% ab’+a’b / 0 ab / 0 / xx% Calculate score For logic functions ab’c’+a’bc’ / 1 / 75% abc/ 1 / 50% 1+1*2/3+1*1/3=2 1+1*1/3=1.33 ab’+a’b / 1 / 50% ab / 0 / 25% 1*1/2=0.5 a/0 1 / xx% 1+1*1/2=1.5 -0- / 0 / xx% Rank logic functions a / 1 / 25% 1 -0- / 0 / xx% Best function: ab’c’+a’bc’ Proposed Macro-Gates and FPGA Architecture For IWLS’05 benchmarks, the following four 6-input functions have the highest ranks GI1=a b c d e f GI2=a’ b’ c’ + b c f’ + b c’ d’ + b’ c e GI3=a b' c d' e + b c e f + d e f GI4=a b' + a' c d' + b' c' + e' + f‘ (AND-6) (MUX-4) It can implement over 50% of logic functions in IWLS’05 benchmarks. The architecture of the proposed macro-gate and FPGA SLICE are Outline Design of the Embedded Macro-gates Synthesis for the Proposed FPGA Architecture Technology Mapping for Heterogeneous FPGAs SAT-based Packing Place and Routing Comparison of Heterogeneous FPGA Architectures Conclusions and Future Work Functional & Structural Cut Enumeration w z x a=(x+y)’ b=y+wz y c 4-input macro gate lib 0000001000000000 0000010000000000 0000100000000000 0001000000000000 0010000000000000 0100000000000000 …… a d=ab=(x+y)’(y+wz) =x’y’wz b d Yes Is x’v’wz in library? Phase1:Enumerate and label cuts from PIs to Pos Check the feasibility of a cut w.r.t. the macro-gate Phase2:Select best choice from POs to Pis A general yet efficient solution is SAT based Boolean matching Exploiting Symmetry in SAT-Based Boolean Matching for Heterogeneous FPGA Technology Mapping , Session 5C.1, ICCAD 07 Key in Technology Mapping: Balance Resource Utilization Asymmetric architecture causes problem to resource utilization Exclusively use of one logic resource leads to lots of unused fabric Simple yet effective solution : Change LUT-MG ratio by adjusting their area weights. Precise calibration is hard to reach by this approach. Total# too large! 6000 Objective architecture: LUT6:MacroGate6 =1:1 MG# LUT6# 5000 Hard to obtain precise calibration 4000 3000 Best LUT-MG ratio = 1:1 2000 1000 LUT-MG ratio = LUT#/MG# 0 1:1 1:0.95 1:0.9 1:0.8 1:0.5 1:0.1 Post-Mapping Area Recovery (motivation example) Given: Target architecture = LUT6 + MG6 LUT-MG ratio in target architecture = 1:1 LUT# < MG# in the mapped design Intrinsic delay (LUT6 : MG6) = 5:4 Objective: balance LUT MG number without increasing delay 5/5 9 / 13 LUT6 MG6 17 / 17 9/9 13 / 13 MG6 PI MG6 MG6 4/5 MG6 MG6 8/9 PO Post-Mapping Area Recovery (motivation example) Given: Target architecture = LUT6 + MG6 LUT-MG ratio in target architecture = 1:1 LUT# < MG# in the mapped design Intrinsic delay (LUT6 : MG6) = 5:4 Objective: balance LUT MG number without increasing delay 5/5 10 / 13 LUT6 LUT6 17 / 17 9/9 13 / 13 MG6 PI MG6 MG6 4/5 MG6 MG6 8/9 PO Post-Mapping Area Recovery (motivation example) Given: Target architecture = LUT6 + MG6 LUT-MG ratio in target architecture = 1:1 LUT# < MG# in the mapped design Intrinsic delay (LUT6 : MG6) = 5:4 Timing slack budgeting is necessary! Objective: balance LUT MG number without increasing delay 5/5 10 / 13 LUT6 LUT6 18 / 17 9/9 14 / 13 MG6 PI MG6 MG6 5/5 LUT6 LUT6 10 / 9 Timing target violation! PO Post Mapping Area Recovery by Timing Budgeting Formulated as an Integer Linear Programming (ILP) Problem Objective (minimize gap between target and actual LUT-MG ratios): min |m2+…+m7-7/2| Arrival time constraints: ai+dj+bj<=aj Clock period target: ai<=17 LUT assignment with given timing slack: (5-4)*mj<=bj, mj={0,1} PI a1 a2 LUT6 MG6 Easy to be generalized to handle a5 arch a4 with multiple macro gates MG6 with different input pinMG6 numbers a6 MG6 MG6 a7 a3 MG6 PO Outline Design of the Embedded Macro-gates Synthesis for the Proposed FPGA Architecture Technology Mapping for Heterogeneous FPGAs SAT-based Packing Comparison of Heterogeneous FPGA Architectures Conclusions and Future Work SAT-Based Packing Motivation Traditional packing tools, e.g., T-VPack, hard-codes the architecture specification of a SLICEs…. Re-impalement from scratch when architecture changes Propose a unified implementation of the packers for different architectures: easy to perform architecture exploration! The architecture dependent sub-problem in packing Structural feasibility checking for a sub-circuit to the SLICE Solution Solve the problem of validating SLICE packing as a local place&route problem A SAT solver is used to carry out the validation checking Example of SAT-Based SLICE Packing Examples of constraints: (for each classes of constraint…) Placement and routing choice variables: X@A, X@B, U5@N10 Exclusively constraint: (¬X@A) ∨ (¬X@B) Presence constraint: (X@A) ∨ (¬X@B) Input/Output constraint: X@A → U5@N10 Routing constraint: G0 →out ∧ U5@N10) → U5@N12 Recap: Overall Synthesis Flow f LUT g Area weight Setting LUT LUT d F e h b a Cut-based Mapping LUT LUT c LUT LUT6 Area-Balance Trade-off? MG 6 LUT6 Y LUT6 MG 6 N MG 6 LUT6 M G6 M G6 M G6 M G6 packing LUT6 M G6 Post-mapping Area recovery MG 6 M G6 MG 6 LUT6 MG 6 MG 6 Outline Motivation and Objectives Methodology for Logic Function Exploration Technology Mapping for Heterogeneous FPGAs Evaluation of Heterogeneous FPGA Architectures Conclusions and Future Work Experimental Setting Design library parameters [Cong, TODAES’05] Benchmark set: IWLS 2005 Four architectures are compared: LUT4, LUT4 + macro gate, LUT6, and LUT6 + macro gate Synthesize the proposed macro-gate by SIS1.2 Delay and area model Interconnect delay is igonired Delay Comparisons Compared to LUT4, LUT4+MG reduces both logic depth and delay by 9.2%. Compared to LUT6, LUT6+MG reduces delay by 30% while increasing logic depth by 36.5%. A LUT6 can implement more logics than a macro-gate delay Logic depth G T6 +M G T6 +M LU T6 LU G T4 +M LU LU T4 0 LU 2 T6 4 7.14 G 6 7.86 9.14 T4 +M 5.48 LU 7.48 7.14 T4 7.86 8 10.95 12 10 8 6 4 2 0 LU 10 LU Logic Area Comparisons Compared to LUT4, LUT4+MG reduces logic area by 12.5%. Compared to LUT6, LUT6+MG reduces logic area by 16.9%. Area PLB# 7000 6000 5000 4000 3000 2000 1000 0 6406 3711 2985 2142 T4 U L T LU M 4+ G T6 U L T LU M 6+ G 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 16816 11849 7346 T4 U L 6408 +M 4 T LU G T6 U L +M 6 T LU G Outline Motivation and Objectives Methodology for Logic Function Exploration Technology Mapping for Heterogeneous FPGAs Comparison of Heterogeneous FPGA Architectures Conclusions and Future Work Conclusions Conclusions A novel FPGA architecture with the mixed LUTs and macrogates is proposed A synthesis flow for the proposed architecture is implemented The preliminary experimental results show the effectiveness of the proposed architecture for the area and delay reduction Future Work Perform the physical design for the synthesized circuits and compare the routing costs, architecture evaluation considering interconnect delay Study the effectiveness of the power reduction for the proposed architecture Macro-gates with wider inputs will be examined
© Copyright 2025 Paperzz