NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture Wei Zhang†, Li Shang‡ and Niraj K. Jha† Dept. of Electrical Engineering Princeton University† Dept. of Electrical and Computer Engineering Queen’s University ‡ Outline Temporal Logic Folding Background on NRAMs Overview for hybrid NAnoTUbe/CMOS REconfigurable architecture (NATURE) (DAC 2006) NanoMap: Design Optimization Flow Experimental Results Conclusions Input Design NanoMap NATURE Temporal Logic Folding Basic idea: Use run-time reconfiguration to realize different functions in the same resource LUT3 every few cycles d g LUT1 a b OUT i e c l h f LUT2 b c d e a LUT 1 i f h LUT 2 l g LUT 3 OUT e ad bil cgf h LUT LUT 1 2 3 OUT MEM i =abc’ l =(I’+e’+f’)h’ OUT =d’g’+l Overview of NATURE CMOS fabrication compatible NRAM-based Run-time reconfiguration Temporal logic folding NATURE Distributed non-volatile nanotube RAMs (NRAMs): main storage for reconfiguration bits Fine-grain reconfiguration (even cycle-by-cycle) and logic folding Design flexibility Logic density Area-delay trade-off flexibility More than an order of magnitude increase in logic density More than an order of magnitude reduction in areatime product Comparisons assume NRAMs/ CMOS logic implemented in the same technology Non-volatility: useful in low power & secure processing Overview of NATURE (Contd.) Challenges in nano-circuits/architectures Many programmable nanofabrics proposed: Nanowire PLA (Dehon, 2004), CMOL (Strukov, 2005), etc. Lack of a mature fabrication process Fabrication defects and run-time failures (between 1% and 10%) Regular, reconfigurable architectures, such as an FPGA, favored Facilitates fabrication Fault tolerance through reconfiguration NATURE: fabricatable using CMOS-compatible fabrication process NRAMTM by Nantero Source: http://www.nantero.com/nram.html Non-volatile nanotube random-access memory (NRAM) Mechanically bent or not: determines bistable on/off states Same/opposite voltage added to change the state CMOS-compatible fabrication process 10 Gbit NRAMs already fabricated: ready to be commercialized in the near future NRAMs Properties of NRAMs Non-volatile Similar speed to SRAM Similar density to DRAM Chemically and mechanically stable NATURE not tied to NRAMs Phase change RAM Magnetoresistive RAM Ferroelectric RAM Architecture of NATURE Length-1 Length-4 wire wire LB Long wire Switch box Connection block Length-4 wire Direct link S1 Switch matrix S1: Switch box between length-1 wires S2: Switch box between length-4 wires SMB Switch matrix: Local routing network S1 S1 Length-1 wire Island-style logic blocks (LBs) connected by various levels of interconnects An LB contains a super macroblock (SMB) and a local switch matrix Switch block Long wire S1 Architecture of a Super Macroblock (SMB) NRAM MB ---1 ---1 20 20 MB ---8 NRAM ---8 n1 macroblocks (MBs) comprise an SMB: here n1 = 4 SRAM bits SRAM bits 20 44X1 MUX 0 20 44X1 MUX ---2 0 ---2 From Switch matrix From Switch matrix ---2 20 44X1 MUX 20 44X1 MUX CLK and Global signals Reconfiguration bits ---1 ---8 MB ---8 NRAM SRAM bits 20 20 SRAM bits From Switch matrix 0 ---2 0 Output to Interconnect ---1 MB NRAM CLK and Global signals Reconfiguration bits Architecture of a Macroblock (MB) 7 NRAM 5 LE ---1 ---2 ---2 LE ---6 ---6 5 NRAM ---1 7 n2 logic elements (LEs) comprise an MB: here n2 = 4 65 SRAM bits 65 SRAM bits 13 to 5 crossbar ---5 ---5 13 to 5 crossbar Inputs to MB 8 Outputs of MB ---5 ---5 Inputs to MB Inputs to MB 13 to 5 crossbar 13 to 5 crossbar 65 SRAM bits 5 5 65 SRAM bits CLK and Global signals Reconfiguration bits 7 LE ---1 ---2 LE ---2 NRAM ---1 7 ---6 ---6 NRAM CLK and Global signals Reconfiguration bits Logic Element (Basic Configuration) An LE implements a computation and contains: An m-input look-up table (LUT) l flip-flops Input to flip-flop selected between LUT output and a primary input SRAM cell m-input LUT CLK DFF DFF Folding Levels Logic folding at different levels of granularity, providing flexibility to perform area-delay trade-offs Level-p folding: LE reconfiguration after the execution of p LUT computations Reconfiguration time: 160ps Larger folding level, typically delay decrease, area increase z0 z1 z2 y0 y1 y2 y3 a0 b0 x0 x1 x2 x3 e0 LUT node c0 Reconfiguration x0 x1 x2 x3 d0 g0 y0 y1 y2 y3 a0 z0 z1 z2 b0 c0 x0 x1 x2 x3 y0 y1 y2 y3 x0 x1 x2 x3 y0 y1 y2 y3 f0 d0 e0 f0 a2 a3 a4 a6 h0 Reconfiguration a2 a3 a4 a6 h0 g0 i0 i0 d Output (a) level-1 folding d (b) level-2 folding Output Design Optimization Flow: NanoMap Optimize and implement design on NATURE Integrate temporal logic folding Choose a proper folding level Use force-directed scheduling (FDS) technique to balance resource usage across folding cycles Input design specified in register-transfer level (RTL) and/or gate-level VHDL Motivational Example input 2 input 1 4 4 Level 1 register L1 reg1 reg2 4 LUT 1 + 4 s1 LUT 2 4 × LUT 3 4 Level 2 register reg3 Folding stage Logic in Plane Folding cycle Plane s0 Plane cycle 4 L2 LUT 4 L3 Different planes should have same number of folding stages to guarantee global synchronization Key issue: how to achieve the optimization objective Appropriate folding level Assign the logic to folding stages Motivational Example (Contd.) input 2 input 1 4 4 L1 reg1 8 LUTs Logic depth: 4 Plane depth: 9 reg2 4 4 + 4 s0 s1 LUT 1 LUT 2 50 LUTs 14 flip-flops 4 × 38 LUTs Logic depth: 7 L2 LUT 3 4 reg3 LUT 4 L3 Example optimization objective Minimize circuit delay under an area constraint of 32 LEs Assume each LE contains one LUT and two flipflops: 32 LEs provide 32 LUTs and 64 flip-flops Iterative Design Flow Start with initial guess for folding level and iteratively refine it Large folding level -> better circuit delay, but large area cost 9 Initial #folding stages: 2 5 50 32 2 Initial folding levels: Partition RTL modules into a series of connected LUT clusters logic depth at most equal to the folding level Significantly speeds up the mapping procedure Iterative Design Flow (Contd.) Cluster size should be smaller than the area constraint b3 0 b2 0 b1 0 b3 0 b0 0 0 0 0 0 a1 0 P1 Cluster 1 0 a2 0 0 34 LUTs > 32 LUTs 0 0 P1 0 a2 0 0 P2 a3 0 0 P3 P3 0 Cluster 2 FA b j sum in ai P5 P7 P6 carry out FA sum out Level-5 folding carry in Cluster 2 FA P4 P0 a1 P2 FA 0 0 a3 FA b0 0 0 P0 Cluster 1 0 b1 0 a0 a0 0 b2 0 FA P4 P5 FA P7 0 P6 Level-4 folding Solution for the Example folding cycle 1 Choose folding level 8LEs add 4LEs s0, s1 reg1-3 Module partition Decrease folding level folding cycle 2 storage 1-4 storage add mul: c1 32LEs reg1-3 Constraint satisfied? LUT1-4 s0, s1 No folding cycle 3 Yes 6LEs 6LEs mul: c2 reg1-3 storage 1-4 s0, s1 FDS to balance resource usage Constraint satisfied? Yes No Solution Three folding stages using level-4 folding 32 LEs required for mapping the RTL circuit; area constraint satisfied Circuit delay = 3 * folding cycle delay NanoMap: Flow Diagram Input network 1 Optimization objective Output reconfiguration bits Module library Circuit parameter search 16 Final routing using VPR router 2 Folding level computation User constraint 15 3 Final placement using modified VPR placer RTL module partition Logic Mapping 4 No Perform logic folding? Yes No 5 Yes Schedule each LUT/ LUT cluster using FDS Satisfy delay constraints? 14 12 Delay estimation 6 11 Yes Temporal clustering Routing Map each 7 LUT/LUT cluster to SMBs No Placement routable? 10 7 Satisfy area constraints? Yes No 8 No Refine placement? Yes 13 Fast placement using modified VPR placer 9 Temporal placement Force-Directed Scheduling Perform FDS on RTL modules partitioned into LUTs/LUT clusters Iteratively schedule LUT/(LUT cluster) to minimize overall resource usage Model resource usage as a force: F = Kx K: distribution graphs (DGs) that describe the probability of resource usage Aim of FDS: minimize force, indicating minimum increase in resource usage LE usage depends on LUT computations and register storage operations: two DGs needed Temporal Clustering For each folding stage, a constructive algorithm used to assign LUTs to LEs and pack LEs into MBs and SMBs le1 B ing cyc A C le2 Attractions depend on timing criticality and input pin sharing Considers attractions across all the folding cycles cyc Fo ld Unpacked LUT with a maximal number of inputs selected as initial seed New LUTs with high attractions to the seed selected and assigned to the SMB D E F ing Fo ld C D Routing using VPR router performed hierarchically, considering direct link, length-1, length-4 and global interconnects cyc ing le2 C D SMB 4 D cyc Simulated annealing approach Cost function computed across the folding stages Fo ld SMB 1 ing VPR (U. Toronto) modified to perform placement and support temporal logic folding Fo ld le1 Placement and Routing C Experimental Setup Instance of architecture: 4 MBs in an SMB 4 LEs in an MB LEs contain a 4-input LUT and 2 flip-flops Impact of fixing k at 16 vs. allowing a high enough k to show design trade-offs Results based on 100nm technology parameters to implement CMOS logic and NRAMs 23 Experimental Results (Contd.) #LE * Delay adv. for AT opt. Delay (ns) for AT optimization No folding 1.4 1 1.2 2 1 2 1 k enough 1 1 No folding k = 16 1 2 2 1 1 2 2 1 0.8 0.6 0.4 0.2 k = 16 1 1 1 2 12 1 1 2 2 2 2 1 1 (normalized to no-folding) ASPP4 Paulin Biquad c5315 ex2 FIR ex1 (normalized to no-folding) ASPP4 Paulin Biquad c5315 ex2 FIR ex1 0 18 16 14 12 10 8 6 4 2 0 k enough Experimental Results (Contd.) Improvement under AT optimization for RTL Benchmarks Reduction in #LEs Maximum AT improvement Average AT improvement Circuit delay increase k enough 14.8X 16.2X 11.0X 31.8% k = 16 9.2X 9.3X 7.8X 19.4% LE utilization around 100% 50% reduced need for a deep interconnect hierarchy for level-1 vs. no-folding – indicates trading interconnect area for NRAM area advantageous Experimental Results (Contd.) Flexibility in choosing the best folding level and performing area-delay trade-offs Mapping results for typical optimizations using Paulin benchmark as an example Mapping results for typical optimizations Typical optimizations Opt. obj. Area const. (#LEs) Delay const. (ns) Folding level case 1 case 2 case 3 10000 Case1 AT No No 1 1000 Case2 Delay No No No 100 Case3 Area No 27 4 10 Case4 Delay 210 No 3 1 Delay (ns) Area (#LEs) case 4 Conclusions NATURE: A new high-performance run-time reconfigurable architecture NanoMap: an integrated optimization design flow for NATURE Introduction of NRAMs into the architecture enables cycle-by-cycle reconfiguration and logic folding: leading to significant logic density and area-time product advantages Can be very useful for cost-conscious embedded systems and improvement of future FPGAs Non-volatility: helpful in secure and low power processing
© Copyright 2026 Paperzz