Christopher LaFrieda and Rajit Manohar Computer Systems Laboratory Cornell University Reducing Power Consumption with Relaxed Quasi Delay- Insensitive Circuits
Christopher LaFrieda and Rajit ManoharComputer Systems LaboratoryCornell University
Reducing Power Consumption with Relaxed Quasi Delay-Insensitive Circuits
Outline
Motivation / BackgroundContributions
Relaxed Quasi Delay-Insensitive (RQDI)RQDI Voltage ScalingRQDI Two Phase Circuits
ResultsSummary
Motivation:How Does Dynamic Power Scale?
α – activity factor (1x)N – total number of transistors (2x)CL – average load capacitance per transistor (.7x)
Vdd – doesn’t scale well anymoreScaled by 17-20% from 130nm to 65nm.Scaled by 10% at 45nm and 5.5% at 32nm.
0
1
0
1
0
12
2
4.1f
f
V
V
P
P
dd
dd
D
D
fVCNP ddLD 2
Motivation:Power Scaling With Fixed Frequency
Motivation:Process Variations Getting WorseProcess Variation in 65nm:
FO4 delays across corners:
FF is 70% faster than SS.Circuits need to be robust w.r.t. process
variations.QDI is a logical place to start.
SS Corner TT Corner FF Corner
13.6 ps 18.2 ps 22.6 ps
Background:QDI – WCHB Buffer• Simple buffer.• Neutrality is
checked in the pull-up stack of the c-element.
• Timing assumption?
RQDI:Staticizer Timing Assumption I• Data is neutral
and enable is high.
RQDI:Staticizer Timing Assumption II• Data is neutral
and enable is high.
• Data becomes valid which sets _R0 low. If R0 inverter is slow, R0 will remain low.
RQDI:Staticizer Timing Assumption III• Data is neutral
and enable is high.
• Data becomes valid which sets _R0 low. If R0 inverter is slow, R0 will remain low.
• Nothing is fighting the weak feedback, _R0 can go high.
RQDI:Half Cycle Timing AssumptionThe half cycle timing assumption (HCTA):
A small amount of combinational logic (1-2 transitions) will always switch within one half cycle of a process.
There is a 4.5x (@ 18 t.p.c.) timing margin.With worst case corners, 2.7x margin in 65nm.Wire delays make the assumption even more
conservative.QDI has an HCTA in staticizers. RQDI allows them everywhere.
RQDI:HCHB Template• N tracks
neutrality. • Check N+,
but assume N- happens in the first half cycle.
• Two transition latency.
• 14 transition cycle time.
• Validity must be checked by pull-down.
RQDI Voltage Scaling:Scaling Scenarios• Two possible
scenarios for voltage scaling.
• Top: mismatched slack. Lower pipeline can run slower.
• Bottom: Token limited loop. Latency through loop should be minimal, but cycle time can scale.
• In some applications these can’t be avoided.
Mismatched slack
Token limited
loop
RQDI Voltage Scaling:Slack Mismatch In An FPGA• Logic blocks (LB)
for logic.• Switch boxes
(SB) for routing.• Limited routing
resources.• Imperfect slack
matching.• Can scale
voltage on blue path.
RQDI Voltage Scaling:DVHB: Dual Voltage Template• Data rails are full
swing.• Acknowledges
are low swing.• Latency remains
constant through voltage scaling.
• Cycle time can be adjusted through voltage scaling.
RQDI Two Phase Circuits:Two Phase Buffer (HCFB2P)• An HCTA exists
on the right pair of XORs.
• Two transition latency.
• Seven transition cycle time.
• Twice the area of a WCHB. However, it can replace two stages.
RQDI Two Phase Circuits:Two Phase In An FPGA• Replace routing
(SB) with two phase logic.
• Logic (LB) remains four phase.
• Phase converters are placed around logic blocks.
• Routing makes up over half the area in an asynchronous FPGA, so power savings can be large.
Width N Switch
RQDI Two Phase Circuits:ConvertersNeed to convert between two phase (for
routing) and four phase (for logic).The 4:2 converter is 3x larger than a
WCHB.The 2:4 converter is 3.25x larger than a
WCHB.
Experimental Setup• Simulated in
HSpice with a 65nm bulk technology.
• Circuits are sized to the drive strength of a 20/10 lambda inverter.
Name
Description
Inputs
Outputs
ImpliesValidity?
and2 And 2 1 No
or2 Or 2 1 No
xor2 Exclusive Or
2 1 Yes
fa Full Adder 3 2 Yes
benc Booth Encoder
3 2 No
Results :HCHB – Energy Per Cycle• HCHB
consumes 32% less energy than PCHB.
• HCHB consumes 36% less energy than PCEHB.
• Slight frequency improvement.
• Negligible latency penalty.
Results:HCHB – Total Transistor Area• Despite the
additional transistors to check validity, HCHB is smaller.
• HCHB is about 20% smaller than PCHB.
• HCHB is about 15% smaller than PCEHB.
Results:DVHB – Low voltage vs. Dual Voltage
Results:HCFB2P Switch – Energy Reduction vs. WCHB
• Wider switches means larger MUXes and larger PCs.
• The associated caps switch half as much.
• Over 50% reduction in power. Due to replacing two stages.
RQDI Two Phase Circuits:Results – Area Overhead• Typically, there
is about of 8 stages of 4-wide switches between logic blocks.
• Area overhead is 15%.
• With direct connections, there are about 10 stages with an overhead of 10%.
SummaryRQDI allows half cycle timing assumptions
outside of staticizers. With RQDI, we can simplify the PCHB logic
template. The resulting template, HCHB, consumes 32% less energy.
The dual voltage logic template can be used to adjust the dynamic slack of a stage. This allows us to save energy with a minimal throughput penalty in token limited loops.
Replacing the routing in an FPGA with two phase logic can reduce energy consumption by 50%. Using the RQDI two phase buffer and converters will achieve this with a 10-15% area overhead.
Questions?