-
Architecture and Synthesis for Power-Efficient FPGAs
Jason CongUniversity of California, Los Angeles
[email protected]
Partially supported by NSF Grants CCR-0096383, and CCR-0306682,
and Altera under the California MICRO program
UCLAUCLA
Joint work with Deming Chen, Lei He, Fei Li, Yan Lin
-
Outline
IntroductionUnderstanding Power Consumption in FPGAsArchitecture
Evaluation and Power OptimizationLow Power SynthesisConclusions
-
Why? FPGA is Known to be Power Inefficient!
FPGA consumes 50-100X more powerWhy do we care about power
optimization for FPGAs ?!
Source:[Zuchowski, et al, ICCAD02]
-
FPGA Advantages
Short TAT (total turnaround time)No or very low NRE
-
ASICs Become Increasingly Expensive
Traditional ASIC designs are facing rapid increase of NRE and
mask-set costs at 90nm and below
Source: EETimes
7.512
40
60
$0.0
$0.5
$1.0
$1.5
$2.0
$2.5
250nm 180nm 130nm 100nm
Tot
al C
ost f
or M
ask
Set (
$M)
0
$10
$20
$30
$40
$50
$60
Cos
t/Mas
k ($
K)
Process (um) 2.0 … 0.8 0.6 0.35 0.25 0.18 0.13 0.10
Single Mask cost ($K) 1.5 1.5 2.5 4.5 7.5 12 40 60
# of Masks 12 12 12 16 20 26 30 34
Mask Set cost ($K) 18 18 30 72 150 312 1,000 2,000
-
Our Research
Power EfficientFPGAs
Circuit Design
Fabric Design
System Design
Synthesis Tools
-
Outline
IntroductionUnderstanding Power Consumption in FPGAsArchitecture
Evaluation and Power OptimizationLow Power SynthesisConclusions
-
FPGA Architecture
Programmable IO
KLUTInputs D FF
Clock
Out
BLE# 1
BLE# N
NOutputs
I Inputs
Clock
I
N
Programmable Logic
Programmable Routing
-
BC-Netlist
BC-NetlistGenerator
Power Simulator
Power
BLIF
Logic Optimization(SIS)
Tech-Mapping (RASP)
Timing-Driven Packing (TV-Pack)
Placement & Routing (VPR)
SLIF
DelayArea
Arch Spec
BLIF
Logic Optimization(SIS)
Tech-Mapping (RASP)
Timing-Driven Packing (TV-Pack)
Placement & Routing (VPR)
SLIF
DelayArea
Evaluation Framework – fpgaEva-LP
fpgaEva-LP [Li, et al, FPGA’03]
-
BC-Netlist Generator
Mapped Netlist Layout
Buffer Extraction
Netlist Generation for Logic Clusters
Capacitance Extraction
Delay Calculation
BC-Netlist
Back-annotation
-
Mixed-level Power Model – Overview
Dynamic powerSwitching power Short-circuit power
Related to signal transitions
Functional switchGlitch
Dynamic
Interconnect & clock
Macro-modelMacro-modelStatic
Switch-level model
Macro-model
Logic Blockcomponents
power sources
Static PowerSub-threshold leakage Gate leakageReverse biased
leakage
Depending on the input vector
-
Cycle-Accurate Power Simulator
Mixed-level Power Model
Post-layout extracted delay & capacitance
Random Vector Generation
BC-Netlist
Cycle Accurate Power Simulation with Glitch Analysis
All cycles finished?
No
Power Values
Yes∑ ∑∈ ∈
+=activei idlej
sacycle nEnEE
)()(
-
Logic Block Power19%
Interconnect Power59%
Clock Power22%
Power Breakdown
Interconnect power is dominant
Cluster Size = 12, LUT Size = 4
Clock Power15%
Interconnect Power45%
Logic Block Power40%
Cluster Size = 12, LUT Size = 6
-
Power Breakdown (cont’d)
Leakage Power42%
Dynamic Power58%
Dynamic Power48%
Leakage Power52%
Leakage power becomes increasingly important (100nm)
Cluster Size = 12, LUT Size = 4 Cluster Size = 12, LUT Size =
6
-
Outline
IntroductionUnderstanding Power Consumption in FPGAsArchitecture
Evaluation and Power Optimization
Architecture Parameter SelectionDual-Vdd/Dual-Vt FPGA
Architecture
Low Power Synthesis with Dual-VddConclusion
-
Total Power along LUT and Cluster Size Changes
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
3 4 5 6 7
LUT Size
Tota
l FPG
A P
ower
(nor
mal
ized
ge
omet
ric m
ean)
Cluster Size = 4Cluster Size = 6Cluster Size = 8Cluster Size =
10Cluster Size = 12
Routing architecture: segmented wire with length of 4, and 50%
tri-state buffers in routing switches
-
Routing Architecture Evaluation
-
Architecture of Low-power and High-performance
0.78651.02680.88651.0502
Cluster size 12,LUT size 4,
Wire segment length 4,100% buffered routing
switches
High-performance
(Et3)
1.00800.89090.99040.9653
Cluster size 10, LUT size 4,
wire segment length 4,25% buffered routing switches
Low-power(E3t)
Et3E3tDelay (t)
Energy (E)
Best FPGA architectureApplications
Arch. Parameter selection leads to 10% power/delay
trade-offUniform FPGA fabrics provide limited power-performance
tradeoffNeed to explore heterogeneous FPGA fabrics, e.g. dual-Vt
and dual-Vdd fabrics
-
Outline
IntroductionUnderstanding Power Consumption in FPGAsArchitecture
Evaluation and Power Optimization
Architecture Parameter SelectionDual-Vdd/Dual-Vt FPGA
Architecture [Li, et al, FPGA’04]
Low Power Synthesis with Dual-VddConclusion
-
Dual-Vdd LUT DesignDual-Vdd technique makes use of the timing
slack to reduce power
VddH devices on critical path performanceVddL devices on
non-critical paths powerAssume uniform Vdd for one LUT
Threshold voltage Vt should be adjusted carefullyfor different
Vdd levels
To compensate delay increaseTo avoid excessive leakage power
increase
-
Vdd/Vt-Scaling for LUTsThree scaling schemes
Constant-Vt scalingFixed-Vdd/Vt-ratio scalingConstant-leakage
scaling
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1.3v 1.0v 0.9v 0.8vVdd (V)
Del
ay (n
s)
constant Vtfixed-Vdd/Vt-ratioconstant leakage
0
1
2
3
4
5
6
7
8
9
10
1.3v 1.0v 0.9v 0.8v
Vdd (V)
Leak
age
Pow
er (
uW)
constant Vtfixed-Vdd/Vt-ratioconstant leakage
Constant-leakage scaling obtains a good tradeoffuseful for both
single-Vddscaling and dual-Vdd design
-
Dual-Vt LUT DesignLUT is divided into two parts
Part I: configuration cells high VtPart II: MUX tree and input
buffers normal Vt (decided by constant-leakage Vdd-scaling)
Configuration SRAM cellsContent remains unchanged after
configurationRead/write delay is not related to FPGA
performance
Use high Vt ~40% of VddMaintain signal integrityReduce SRAM
leakage by 15Xand LUT leakage by 2.4XIncrease configuration time by
13%
-
Pre-Defined Dual-Vt FabricPower saving
11.6% for combinational circuits14.6% for sequential
circuits
12.4%0.180spla9.4%0.0927seq
power savingpower (watt)
11.6%Avg.
14.7%0.256pdc9.4%0.0753misex311.6%0.059ex5p17.3%0.179ex101010.7%0.234des12.3%0.0536apex49.3%0.108apex28.5%0.0798alu4
arch-SVDT (Dual Vt)
arch-SVST (Single Vt)Circuit
Table1 Combinational circuits
14.0%0.0351tseng10.2%0.261s38484
power savingpower (watt)
14.6%Avg.
11.7%0.307s3841713.4%0.0736s29819.2%0.190frisc16.3%0.140elliptic14.5%0.134dsip19.7%0.0391diffeq14.8%0.632clma12.3%0.148bigkey
arch-SVDT (Dual Vt)
arch-SVST(Single Vt)circuit
Table2 Sequential circuits
-
Dual-Vdd FPGA FabricGranularity: logic block (i.e., cluster of
LUTs)
Smaller granularity => intuitively more power savingBut a
larger implementation overhead
Layout pattern: pre-defined dual-Vdd patternRow-based or
interleaved patternRatio of VddL/VddH blocks is 2:1 (benchmark
profiling)
Interconnect uses uniform VddH
L-block: VddL
H-block: VddH
-
Simple Design Flow for Dual-Vdd FabricBased on traditional
design flow, but with new steps
Step I: LUT mapping (FlowMap) + P & Rassuming uniform VddH
(using VPR)
Step II: Dual-Vdd assignment based on sensitivity
Setp III: Timing driven P & R considering pre-defined
dual-Vdd pattern (modified VPR)
-
Comparison Between Vdd-Scaling and Dual-Vdd
For high clock frequency, dual Vdd achieves ~6% total power
saving (~18% logic power saving)For low clock frequency, single-Vdd
scaling is betterStill a large gap between ideal dual-Vdd and real
case
Ideal dual-Vdd is the result without layout pattern
constraint
circuit: alu4
0.03
0.04
0.05
0.06
0.07
0.08
0.09
65 75 85 95 105 115 125
Max. Clock Frequency (MHz)
Power (watt)
arch-SVDT (Vdd Scaling)
arch-DVDT(ideal case)
arch-DVDT(pre-defined Vdd)
1.3v
1.0v
0.9v
1.3v/0.8v
1.0v/0.8v0.9v/0.8v
1.5v
1.5/1.0v
1.3/1.0v
1.0/0.9v
1.5v/1.0v
1.3/0.9v
-
Vdd-Programmable Logic BlockPower switches for Vdd selection and
power gatingOne-bit control is needed for Vdd selection, but
two-bit control power gating
-
Experimental Results with Vdd-Programmable Blocks
Power v.s. performanceCircuit: alu4
0.03
0.04
0.05
0.06
0.07
0.08
0.09
65 75 85 95 105 115 125
clock frequency (MHz)
total power (watt)
arch-SV (Vdd scaling)arch-DV (configurable Vdd)arch-DV (ideal
case)arch-DV (pre-defined Vdd)
1.3
v
1.0v
1.5v/1.0v
1.3v/0.8v
1.0v/0.8v
1.5v/1.0v
1.3v/0.9v
1.0v/0.8v
1.5v/0.8v
1.3v/0.8v
1.0v/0.9v
1.5v
0.9v/0.8v
1.0v/0.8v
1.3v/0.8v
1.5v/1.0v
-
Outline
IntroductionUnderstanding Power Consumption in FPGAsArchitecture
Evaluation and Power OptimizationLow Power SynthesisConclusions
-
Low Power Synthesis for Dual Vdd FPGAs
FPGA architecture with dual-Vdds adds new layout constraints for
synthesis toolsNovel synthesis tools are required to support the
architecture
Technology mapping [Chen, et al, FPGA’04]Circuit clustering
[Chen, et al, ISLPED’04]
-
Technology Mapping for Low-Power FPGAs with Dual Vdds
ac
d
yxz
b
w
e
fg
Cut Enumeration:
Topological Order from PIs to POs.
Delay 1, Power 1
Delay 2, Power 2
Optimal Delay = 1
Power = 1.5
Optimal Delay = 2
Power = 2.5
Delay 2, Power 3.2
Delay 2, Power 3.5
Delay 2, Power 2.5
Optimal Delay = 1
Power = 1
Optimal Delay = 1
Power = 1
Represent 1 case: single high Vdd case
-
Dual-Vdd Cases
Consider:Converter delay & powerVddL LUT delay &
powerVddH LUT delay & power
a c
d
yxz
b
w
eTargetLUT
Cases Input LUT Target LUT Converter1 VddL VddL No2 VddL VddH
Yes3 VddH VddL No4 VddH VddH No
Input LUT
Four extra cases for dual-Vddconsideration
Produce these four cases for each cut and nodeMore tradeoff
solution points Smaller power requires larger delaySmaller delay
requires larger power
-
Low Vdd LUTHigh Vdd LUT
Mapping Solution Generation
From POs to PIsCritical path driven by VddHLUTNon-critical paths
can be driven by VddL LUT, guided by low power
ac
d
yxz
b
w
e
fg
-
Two Types of Required Times
VddL VddH
33.2
R
x y
If R is using VddH:
converter
Req’d times
Mapped LUTs
1.7 = 2.0 - 0.3
Critical path
If R is using VddL:
Critical path
1.8 2.0 Req’d times propagated back
Req’d time of R is 1.7
Req’d time of R is 1.8
To be mapped
Each node maintains two req’d times:
Propagated separately
Interact with each other
-
Experimental Results
- 2.10%- 1.29%0.56%- 4.04%
Real powerEst'ed powerTotal edgesMapping area
SVmap (Single high Vdd) compared to Emap [Lamoureux,
ICCAD03]
Mapping area considerably betterEstimated power very close to
the real power reported after P&R
- 9.44%- 10.72%- 11.63%v1.3 - v1.0v1.3 - v0.9v1.3 - v0.8v1.3
DVmapSVmap
DVmap (dual Vdd) compared to SVmap
v1.3 as VddH and v0.8 as VddL is the best combination
-
Circuit Clustering with Dual VddsGiven:
A mapped FPGA designAn FPGA architecture with Dual-
Vdd configurable logic blocksGoal:
Cluster the LUTs into logic blocksAssign voltages to the logic
blocks
such that the design hasOptimal delay Minimum power
Constraints:Logic Block Inputs ≤ KLogic Block Size ≤ MLogic
Block Outputs ≤ MLUT delay = dL or dHInter-block edge delay = D
Input = 5Size = 3Output = 2
LUT
LUT LUT
LUTLUT
LUTLUT
-
Cluster Enumeration – An Example
m n o p q
r s
t To get a cluster of size 6 on LUT tGet 1 node on r, 4 on s,
then
merge with t …., and
Get 2 nodes on r, 3 on s …
Common nodesPIs to POs
Dynamic Programming
Get 3 on r …
-
Solution Generation
m n o p q
r s
t
Cluster s1
Cluster s2
Solution propagation similar as [Vaishnav, ICCAD’99]
Delay, power and voltage (form solution points) propagate
through the clusters and nodes iteratively
Try to get solutions for Cluster t1
Get solutions for sGet solutions for r
-
Solution Curve on Node rGood solutions: Any two delay-power-vdd
points
(D1, P1, V1) and (D2, P2, V2)if D1 > D2, then P1 < P2if D1
< D2, then P1 > P2
02468
1012
0 1 2 3 4 5 6 7
H LH
Lpower
Delay
H105L9.85.4H8.246.1L86.2
VddPowerDelay
Good delay-power-vdd points The corresponding solution curve
-
Solution Propagation
02468
1012
0 1 2 3 4 5 6 7
H LH Lp
ower
Delay
02468
1012
0 1 2 3 4 5 6 7
H LH Lp
ower
Delay
Delay-power-vdd curve for r Delay-power-vdd curve for s
Consider:Converter VddL LUTVddH LUTEdge delay
All the good solutions are generated
All the inferior solutions are pruned away0
2
4
6
8
10
12
14
16
18
0 1 2 3 4 5 6 7 8 9
Delay
power
LL
LLH
H
Delay-power-vdd curve for cluster t1
-
Two Theorems
The algorithm gets the minimum number of solution points, W,
optimally for each node
W is upper bounded by Lwhere L = level(v) for node v
The algorithm is delay and power optimal for trees and delay
optimal for directed acyclic graphs (DAGs) with dual-Vdd FPGAs
22 )1( +L
-
Experimental Results Summary
- 18.4%- 19.5%- 20.3%v1.3 - v1.0v1.3 - v0.9v1.3 - v0.8v1.3
Dual VddSingle Vdd
Dual-Vdd Clustering results compared to the Single high-Vdd
Clustering results
v1.3 as high Vdd and v0.8 as low Vdd is the best combination
among the three
-
Outline
IntroductionUnderstanding Power Consumption in FPGAsArchitecture
Evaluation and Power OptimizationLow Power SynthesisConclusions
-
ConclusionsFPGA power consumption
Majority on programmable interconnectsLeakage is significant
FPGA architecture optimization for powerArchitecture parameter
tuning has a limited impactUsing high Vt for configuration SRAM
cells is helpfulUsing programmable dual Vdd for logic blocks is
helpful
Power-efficient FPGA architectures introduce interesting CAD
problems
Dual-Vdd mappingDual-Vdd clustering
Up to 20% power saving reported using these algorithms