NanoMap: An Integrated Design NanoMap: An Integrated Design Optimization Flow Optimization Flow for a Hybrid for a Hybrid Nanotube/CMOS Dynamically Nanotube/CMOS Dynamically Reconfigurable Reconfigurable Architecture Architecture Wei Zhang†, Li Shang‡ and Niraj K. Jha† Dept. of Electrical Engineering Princeton University† Dept. of Electrical and Computer Engineering Queen’s University ‡
NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture. Wei Zhang † , Li Shang ‡ and Niraj K. Jha † Dept. of Electrical Engineering Princeton University † - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
NanoMap: An Integrated Design Optimization Flow NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture Reconfigurable Architecture
Wei Zhang†, Li Shang‡ and Niraj K. Jha†
Dept. of Electrical EngineeringPrinceton University†
Dept. of Electrical and Computer EngineeringQueen’s University ‡
Outline
Temporal Logic Folding Background on NRAMs Overview for hybrid
Basic idea: Use run-time reconfiguration to realize different functions in the same resource every few cycles
Temporal Logic Folding
LUT3
OUT
dg
l
a
b
c
e
f
h
i
d
g
l OUT
a
b
cOUT
e
f
h
id
g
l
ab
c
LUT1
e
f h
LUT2
i
i =abc’
LUT1
LUTLUT1
LUT2
LUT3
MEM
l =(I’+e’+f’)h’
OUT =d’g’+l
LUT2
LUT3
LUT3
LUT2
LUT1
NATURE
CMOS fabricationcompatible
CMOS fabricationcompatible NRAM-basedNRAM-based
Run-timereconfiguration
Run-timereconfiguration
Temporallogic folding
Temporallogic folding
Designflexibility
Designflexibility
Logicdensity
Logicdensity
Overview of NATUREOverview of NATURE
Distributed non-volatile nanotube RAMs (NRAMs): main storage for reconfiguration bits
Fine-grain reconfiguration (even cycle-by-cycle) and logic folding
Area-delay trade-off flexibility More than an order of
magnitude increase in logic density
More than an order of magnitude reduction in area-time product
Comparisons assume NRAMs/ CMOS logic implemented in the same technology
Non-volatility: useful in low power & secure processing
Overview of NATURE (Contd.)
Challenges in nano-circuits/architectures Many programmable nanofabrics proposed:
Nanowire PLA (Dehon, 2004), CMOL (Strukov, 2005), etc.
Lack of a mature fabrication process Fabrication defects and run-time failures
(between 1% and 10%) Regular, reconfigurable architectures,
such as an FPGA, favored Facilitates fabrication Fault tolerance through reconfiguration NATURE: fabricatable using CMOS-compatible
fabrication process
Source: http://www.nantero.com/nram.html
Non-volatile nanotube random-access memory (NRAM) Mechanically bent or not: determines bistable
on/off states Same/opposite voltage added to change the state CMOS-compatible fabrication process 10 Gbit NRAMs already fabricated: ready to be
commercialized in the near future
NRAMTM by Nantero
NRAMs
Properties of NRAMs Non-volatile Similar speed to SRAM Similar density to DRAM Chemically and mechanically stable
NATURE not tied to NRAMs Phase change RAM Magnetoresistive RAM Ferroelectric RAM
Length-1wire
Length-4wire Long wire Switch boxLB
Switchmatrix SMB
S1
S1
Long wireLength-4 wire
Length-1 wire
Direct link
S1
S1 S1: Switch box between length-1 wires
S2: Switch box betweenlength-4 wires
Switch matrix: Local routingnetwork
Connection block Switch block
Island-style logic blocks (LBs) connected by various levels of interconnects
An LB contains a super macroblock (SMB) and a local switch matrix
Architecture of NATURE
n1 macroblocks (MBs) comprise an SMB:
here n1 = 4
Architecture of a Super Macroblock Architecture of a Super Macroblock (SMB)(SMB)
MB MBNRAM
MB NRAMNRAM MB
SRAMbits
SRAMbits
---- 2
0---
- 20
---- 2
0
---- 2
0
CLK and Global signals
---- 8
---- 8
---- 8
---- 8
---- 1
20
---- 1
20
---- 1
20
NRAM
SRAMbits
SRAMbits
---- 1
20
CLK and Global signals
ReconfigurationbitsReconfiguration
bits
From Switch matrix
From Switch matrix
From Switch matrix
Output to Interconnect
20 44X1 MUX 20 44X1 MUX
20 44X1 MUX 20 44X1 MUX
n2 logic elements (LEs) comprise an MB:
here n2 = 4
Architecture of a Macroblock (MB)Architecture of a Macroblock (MB)
NRAM LE LE
13 to 5crossbar
13 to 5crossbar
NRAM
LE
13 to 5crossbar
NRAMNRAM LE
65 SRAMbits
65 SRAMbits
65 SRAMbits
65 SRAMbits
---- 5 ---
- 5
---- 5
---- 5
---- 1
7
---- 1
7
---- 1
7
---- 1
7
13 to 5crossbar
---- 2
---- 2
---- 2
---- 2
CLK and Global signals
---- 6
5
---- 6
5
---- 6
5
---- 6
5
8 Outputsof MB
CLK and Global signals
Inputs to MB
Inputs to MB
Inputs to MB
Reconfiguration bits
Reconfiguration bits
Logic Element (Basic Configuration)
An LE implements a computation and contains: An m-input look-up table (LUT) l flip-flops Input to flip-flop selected between LUT output
and a primary input
m-input LUT
DFF
SRAM cell
DFF
CLK
Folding Levels
Logic folding at different levels of granularity, providing flexibility to perform area-delay trade-offs
Level-p folding: LE reconfiguration after the execution of p LUT computations
Reconfiguration time: 160ps Larger folding level, typically delay decrease, area increase
(a) level-1 folding (b) level-2 folding
a0
y0 y1 y2 y3
b0 c0
z0 z1 z2
d0 g0
x0 x1 x2 x3
e0
x0 x1 x2 x3
f0
y0 y1 y2 y3
h0
LUT node
Outputd
i0
a2 a3 a4 a6
Reconfiguration
Reconfiguration
a0
y0 y1 y2 y3
b0 c0
z0 z1 z2
d0e0
x0 x1 x2 x3
f0
y0 y1 y2 y3
g0
x0 x1 x2 x3
h0
d
i0
a2 a3 a4 a6
Output
Design Optimization Flow: NanoMap
Optimize and implement design on NATURE
Integrate temporal logic folding Choose a proper folding level Use force-directed scheduling (FDS) technique
to balance resource usage across folding cycles
Input design specified in register-transfer level (RTL) and/or gate-level VHDL
Motivational Example
Different planes should have same number of folding stages to guarantee global synchronization
Key issue: how to achieve the optimization objective Appropriate folding level Assign the logic to folding stages
reg1 reg2
+
reg3
×
L2L1
L3
s0 s1
input 1 input 2
LUT1
LUT3
LUT2
4 4
44
4 4
4
LUT4
Level 1 register
Level 2 register
Plane Logic in Plane
Pla
ne
cycle
Foldingstage
Fold
ing
cycle
Motivational Example (Contd.)
Example optimization objective Minimize circuit delay under an area constraint
of 32 LEs Assume each LE contains one LUT and two flip-
flops: 32 LEs provide 32 LUTs and 64 flip-flops
reg1 reg2
+
reg3
×
L2L1
L3
s0 s1
input 1 input 2
LUT1
LUT3
LUT2
4 4
44
4 4
4
LUT4
50 LUTs
14 flip-flops
8 LUTsLogic depth: 4
38 LUTsLogic depth: 7
Plane depth: 9
Iterative Design Flow
Start with initial guess for folding level and iteratively refine it Large folding level -> better circuit delay, but
large area cost Initial #folding stages: Initial folding levels:
Partition RTL modules into a series of connected LUT clusters logic depth at most equal to the folding level Significantly speeds up the mapping procedure
232
50
52
9
Iterative Design Flow (Contd.)
Cluster size should be smaller than the area constraint
b3 0 0 0
P7 P6 P5
P4
a0
0
a1
a2
a3
P0
P1
P2
P3
FA
FA
FA 0
0
0
0
0
0
0
000
Clu
ster
1C
lust
er 2
FA
bj sum
sum
carryout
ai
0 b2 b1 b0
carry in
out
in
34 LUTs> 32 LUTs
b3 0 0 0
P7 P6
P5
P4
a0
0
a1
a2
a3
P0
P1
P2
P3
FA
FA
FA 0
0
0
0
0
0
0
000
Clu
ster
1C
lust
er 2
0 b2 b1 b0
Level-5 folding Level-4 folding
Solution for the Example
Three folding stages using level-4 folding 32 LEs required for mapping the RTL
Improvement under AT optimization for RTL Benchmarks
LE utilization around 100% 50% reduced need for a deep interconnect
hierarchy for level-1 vs. no-folding – indicates trading interconnect area for NRAM area advantageous
Experimental Results (Contd.)Experimental Results (Contd.) Flexibility in choosing the best folding level and performing
area-delay trade-offs Mapping results for typical optimizations using Paulin
benchmark as an example
Opt.
obj.
Area
const.
(#LEs)
Delay
const.
(ns)
Folding
level
Case1 AT No No 1
Case2 Delay No No No
Case3 Area No 27 4
Case4 Delay 210 No 31
10
100
1000
10000
Delay(ns)
Area(#LEs)
Mapping results for typical optimizations
case 1 case 2 case 3 case 4
Typical optimizations
Conclusions
NATURE: A new high-performance run-time reconfigurable architecture
NanoMap: an integrated optimization design flow for NATURE
Introduction of NRAMs into the architecture enables cycle-by-cycle reconfiguration and logic folding: leading to significant logic density and area-time product advantages
Can be very useful for cost-conscious embedded systems and improvement of future FPGAs
Non-volatility: helpful in secure and low power processing