ppt

NanoMap: An Integrated Design Optimization Flow
for a Hybrid Nanotube/CMOS Dynamically
Reconfigurable Architecture
Wei Zhang†, Li Shang‡ and Niraj K. Jha†
Dept. of Electrical Engineering
Princeton University†
Dept. of Electrical and Computer Engineering
Queen’s University ‡
Outline






Temporal Logic Folding
Background on NRAMs
Overview for hybrid
NAnoTUbe/CMOS
REconfigurable architecture
(NATURE) (DAC 2006)
NanoMap: Design
Optimization Flow
Experimental Results
Conclusions
Input Design
NanoMap
NATURE
Temporal Logic Folding

Basic idea: Use run-time reconfiguration to
realize different functions in the same resource
LUT3
every few cycles
d
g
LUT1
a
b
OUT
i
e
c
l
h
f
LUT2
b
c
d
e
a
LUT
1
i
f
h
LUT
2
l
g
LUT
3
OUT
e
ad
bil
cgf
h
LUT
LUT
1
2
3
OUT
MEM
i =abc’
l =(I’+e’+f’)h’
OUT =d’g’+l
Overview of NATURE
CMOS fabrication
compatible

NRAM-based

Run-time
reconfiguration
Temporal
logic folding
NATURE
Distributed non-volatile
nanotube RAMs (NRAMs):
main storage for
reconfiguration bits
Fine-grain reconfiguration
(even cycle-by-cycle) and logic
folding



Design
flexibility
Logic
density


Area-delay trade-off flexibility
More than an order of
magnitude increase in logic
density
More than an order of
magnitude reduction in areatime product
Comparisons assume NRAMs/
CMOS logic implemented in
the same technology
Non-volatility: useful in low
power & secure processing
Overview of NATURE (Contd.)

Challenges in nano-circuits/architectures




Many programmable nanofabrics proposed:
Nanowire PLA (Dehon, 2004), CMOL (Strukov,
2005), etc.
Lack of a mature fabrication process
Fabrication defects and run-time failures
(between 1% and 10%)
Regular, reconfigurable architectures,
such as an FPGA, favored



Facilitates fabrication
Fault tolerance through reconfiguration
NATURE: fabricatable using CMOS-compatible
fabrication process
NRAMTM by Nantero
Source: http://www.nantero.com/nram.html

Non-volatile nanotube random-access memory
(NRAM)




Mechanically bent or not: determines bistable
on/off states
Same/opposite voltage added to change the state
CMOS-compatible fabrication process
10 Gbit NRAMs already fabricated: ready to be
commercialized in the near future
NRAMs

Properties of NRAMs





Non-volatile
Similar speed to SRAM
Similar density to DRAM
Chemically and mechanically stable
NATURE not tied to NRAMs



Phase change RAM
Magnetoresistive RAM
Ferroelectric RAM
Architecture of NATURE
Length-1 Length-4
wire
wire
LB
Long wire
Switch box
Connection block
Length-4 wire
Direct link
S1
Switch
matrix
S1: Switch box between
length-1 wires
S2: Switch box between
length-4 wires
SMB
Switch matrix: Local routing
network
S1
S1
Length-1 wire
Island-style logic
blocks (LBs)
connected by
various levels of
interconnects

An LB contains a
super macroblock
(SMB) and a local
switch matrix
Switch block
Long wire
S1

Architecture of a Super Macroblock
(SMB)
NRAM
MB
---1
---1
20
20
MB
---8
NRAM
---8
n1 macroblocks (MBs) comprise an SMB:
here n1 = 4
SRAM
bits
SRAM
bits
20 44X1 MUX
0
20 44X1 MUX
---2
0
---2
From
Switch matrix
From
Switch matrix
---2
20 44X1 MUX
20 44X1 MUX
CLK and Global
signals
Reconfiguration
bits
---1
---8
MB
---8
NRAM
SRAM
bits
20
20
SRAM
bits
From
Switch matrix
0
---2
0
Output to
Interconnect
---1

MB
NRAM
CLK and Global
signals
Reconfiguration
bits
Architecture of a Macroblock (MB)
7
NRAM
5
LE
---1
---2
---2
LE
---6
---6
5
NRAM
---1
7
n2 logic elements (LEs) comprise an MB:
here n2 = 4
65 SRAM
bits
65 SRAM
bits
13 to 5
crossbar
---5
---5
13 to 5
crossbar
Inputs to MB
8 Outputs
of MB
---5
---5
Inputs to MB
Inputs to MB
13 to 5
crossbar
13 to 5
crossbar
65 SRAM
bits
5
5
65 SRAM
bits
CLK and
Global signals
Reconfiguration
bits
7
LE
---1
---2
LE
---2
NRAM
---1
7
---6
---6

NRAM
CLK and
Global signals
Reconfiguration
bits
Logic Element (Basic Configuration)

An LE implements a computation and
contains:



An m-input look-up table (LUT)
l flip-flops
Input to flip-flop selected between LUT output
and a primary input
SRAM cell
m-input
LUT
CLK
DFF
DFF
Folding Levels


Logic folding at different levels of granularity, providing
flexibility to perform area-delay trade-offs
Level-p folding: LE reconfiguration after the execution of p
LUT computations
Reconfiguration time: 160ps


Larger folding level, typically delay decrease, area increase
z0 z1 z2
y0 y1 y2 y3
a0
b0
x0 x1 x2 x3
e0
LUT
node
c0
Reconfiguration
x0 x1 x2 x3
d0
g0
y0 y1 y2 y3
a0
z0 z1 z2
b0
c0
x0 x1 x2 x3 y0 y1 y2 y3 x0 x1 x2 x3
y0 y1 y2 y3
f0
d0
e0
f0
a2 a3 a4 a6
h0
Reconfiguration
a2 a3 a4 a6
h0
g0
i0
i0
d
Output
(a) level-1 folding
d
(b) level-2 folding
Output
Design Optimization Flow: NanoMap


Optimize and implement design on
NATURE
Integrate temporal logic folding



Choose a proper folding level
Use force-directed scheduling (FDS) technique
to balance resource usage across folding cycles
Input design specified in register-transfer
level (RTL) and/or gate-level VHDL
Motivational Example
input 2
input 1
4
4
Level 1 register
L1
reg1
reg2
4
LUT
1
+
4
s1
LUT
2
4
×
LUT
3
4
Level 2 register


reg3
Folding
stage
Logic
in Plane
Folding
cycle
Plane
s0
Plane cycle
4
L2
LUT
4
L3
Different planes should have same number of folding
stages to guarantee global synchronization
Key issue: how to achieve the optimization objective


Appropriate folding level
Assign the logic to folding stages
Motivational Example (Contd.)
input 2
input 1
4
4
L1
reg1
8 LUTs
Logic depth: 4
Plane depth: 9
reg2
4
4
+
4

s0
s1
LUT
1
LUT
2
50 LUTs
14 flip-flops
4
×
38 LUTs
Logic depth: 7
L2
LUT
3
4
reg3
LUT
4
L3
Example optimization objective


Minimize circuit delay under an area constraint
of 32 LEs
Assume each LE contains one LUT and two flipflops: 32 LEs provide 32 LUTs and 64 flip-flops
Iterative Design Flow

Start with initial guess for folding level
and iteratively refine it




Large folding level -> better circuit delay, but
large area cost
9
Initial #folding stages:  2   5
 50 
 32   2
Initial folding levels:
Partition RTL modules into a series of
connected LUT clusters


logic depth at most equal to the folding level
Significantly speeds up the mapping procedure
Iterative Design Flow (Contd.)
Cluster size should be smaller than the area
constraint

b3 0
b2 0
b1 0
b3 0
b0 0
0
0
0
0
a1
0
P1
Cluster 1
0
a2
0
0
34 LUTs
> 32 LUTs
0
0
P1
0
a2
0
0
P2
a3
0
0
P3
P3
0
Cluster 2
FA
b j sum
in
ai
P5
P7
P6
carry
out
FA
sum
out
Level-5 folding
carry
in
Cluster 2
FA
P4
P0
a1
P2
FA
0
0
a3
FA
b0 0
0
P0
Cluster 1
0
b1 0
a0
a0
0
b2 0
FA
P4
P5
FA
P7
0
P6
Level-4 folding
Solution for the Example
folding
cycle 1
Choose
folding level
8LEs
add
4LEs
s0, s1
reg1-3
Module
partition
Decrease
folding level
folding
cycle 2
storage 1-4
storage add
mul: c1
32LEs
reg1-3
Constraint
satisfied?
LUT1-4
s0, s1
No
folding
cycle 3
Yes
6LEs
6LEs
mul: c2
reg1-3
storage 1-4
s0, s1
FDS to balance
resource usage
Constraint
satisfied?
Yes
No


Solution

Three folding stages using level-4 folding
32 LEs required for mapping the RTL
circuit; area constraint satisfied
Circuit delay = 3 * folding cycle delay
NanoMap: Flow Diagram
Input network
1
Optimization
objective
Output
reconfiguration bits
Module
library
Circuit parameter
search
16
Final routing
using VPR router
2
Folding level
computation
User
constraint
15
3
Final placement
using modified VPR
placer
RTL module partition
Logic
Mapping
4
No
Perform logic
folding?
Yes
No
5
Yes
Schedule each LUT/
LUT cluster
using FDS
Satisfy delay
constraints?
14
12
Delay estimation
6
11
Yes
Temporal
clustering
Routing
Map each
7
LUT/LUT cluster to
SMBs
No
Placement
routable?
10
7
Satisfy area
constraints?
Yes
No
8
No
Refine
placement?
Yes
13
Fast placement
using modified VPR
placer
9
Temporal
placement
Force-Directed Scheduling



Perform FDS on RTL modules partitioned
into LUTs/LUT clusters
Iteratively schedule LUT/(LUT cluster) to
minimize overall resource usage
Model resource usage as a force: F = Kx



K: distribution graphs (DGs) that describe the
probability of resource usage
Aim of FDS: minimize force, indicating
minimum increase in resource usage
LE usage depends on LUT computations
and register storage operations:
two DGs needed
Temporal Clustering
For each folding stage, a constructive algorithm used to
assign LUTs to LEs and pack LEs into MBs and SMBs
le1
B
ing
cyc
A
C
le2

Attractions depend on timing criticality and input pin
sharing
Considers attractions across all the folding cycles
cyc

Fo
ld

Unpacked LUT with a maximal number of inputs selected
as initial seed
New LUTs with high attractions to the seed selected and
assigned to the SMB
D
E
F
ing

Fo
ld

C
D

Routing using VPR router
performed hierarchically,
considering direct link,
length-1, length-4 and
global interconnects
cyc
ing
le2
C
D
SMB
4
D
cyc

Simulated annealing
approach
Cost function computed
across the folding stages
Fo
ld

SMB
1
ing
VPR (U. Toronto) modified
to perform placement and
support temporal logic
folding
Fo
ld

le1
Placement and Routing
C
Experimental Setup

Instance of architecture:





4 MBs in an SMB
4 LEs in an MB
LEs contain a 4-input LUT and 2 flip-flops
Impact of fixing k at 16 vs. allowing a
high enough k to show design trade-offs
Results based on 100nm technology
parameters to implement CMOS logic
and NRAMs
23
Experimental Results (Contd.)
#LE * Delay adv. for AT opt.
Delay (ns) for AT optimization
No folding
1.4
1
1.2
2
1
2
1
k enough
1 1
No folding
k = 16
1
2
2
1
1
2
2
1
0.8
0.6
0.4
0.2
k = 16
1
1
1
2
12
1
1
2
2
2
2
1 1
(normalized to no-folding)
ASPP4
Paulin
Biquad
c5315
ex2
FIR
ex1
(normalized to no-folding)
ASPP4
Paulin
Biquad
c5315
ex2
FIR
ex1
0
18
16
14
12
10
8
6
4
2
0
k enough
Experimental Results (Contd.)
Improvement under AT optimization for RTL Benchmarks


Reduction
in #LEs
Maximum AT
improvement
Average AT
improvement
Circuit delay
increase
k enough
14.8X
16.2X
11.0X
31.8%
k = 16
9.2X
9.3X
7.8X
19.4%
LE utilization around 100%
50% reduced need for a deep interconnect
hierarchy for level-1 vs. no-folding – indicates
trading interconnect area for NRAM area
advantageous
Experimental Results (Contd.)


Flexibility in choosing the best folding level and performing
area-delay trade-offs
Mapping results for typical optimizations using Paulin
benchmark as an example
Mapping results for typical
optimizations
Typical optimizations
Opt.
obj.
Area
const.
(#LEs)
Delay
const.
(ns)
Folding
level
case 1
case 2
case 3
10000
Case1
AT
No
No
1
1000
Case2
Delay
No
No
No
100
Case3
Area
No
27
4
10
Case4
Delay
210
No
3
1
Delay
(ns)
Area
(#LEs)
case 4
Conclusions





NATURE: A new high-performance run-time
reconfigurable architecture
NanoMap: an integrated optimization design flow
for NATURE
Introduction of NRAMs into the architecture
enables cycle-by-cycle reconfiguration and logic
folding: leading to significant logic density and
area-time product advantages
Can be very useful for cost-conscious embedded
systems and improvement of future FPGAs
Non-volatility: helpful in secure and low power
processing