Review: Designing Inverters for Performance Reduce C L internal diffusion capacitance of the gate itself interconnect capacitance fanout Increase W/L ratio of the transistor the most powerful and effective performance optimization tool in the hands of the designer watch out for self-loading! Increase V DD only minimal improvement in performance at the cost of increased energy dissipation Slope engineering - keeping signal rise and fall times smaller than or equal to the gate propagation delays and of approximately equal values good for performance
62
Embed
Review: Designing Inverters for Performance Reduce C L l internal diffusion capacitance of the gate itself l interconnect capacitance l fanout Increase.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Review: Designing Inverters for Performance Reduce CL
internal diffusion capacitance of the gate itself interconnect capacitance fanout
Increase W/L ratio of the transistor the most powerful and effective performance optimization
tool in the hands of the designer watch out for self-loading!
Increase VDD only minimal improvement in performance at the cost of
increased energy dissipation
Slope engineering - keeping signal rise and fall times smaller than or equal to the gate propagation delays and of approximately equal values good for performance good for power consumption
Switch Delay Model
A
Req
A
Rp
A
Rp
A
Rn CL
A
Cint
CintCL
A
Rn
A
Rp
B
Rp
B
Rn
NAND
INVERTER
B
Rp
A
Rp
A
Rn
B
Rn CL
NOR
Input Pattern Effects on Delay
Delay is dependent on the pattern of inputs
Low to high transition both inputs go low
- delay is 0.69 Rp/2 CL since two p-resistors are on in parallel
one input goes low
- delay is 0.69 Rp CL
High to low transition both inputs go high
- delay is 0.69 2Rn CL
Adding transistors in series (without sizing) slows down the circuit
Propagation delay deteriorates rapidly as a function of fan-in – quadratically in the worst case.
tp as a Function of Fan-In
0
250
500
750
1000
1250
2 4 6 8 10 12 14 16
tpHL
tpLH
t p (
pse
c)
fan-in
quadratic function of fan-in
linear function of fan-in
Gates with a fan-in greater than 4 should be avoided.
tp
Fast Complex Gates: Design Technique 1 Transistor sizing
as long as fan-out capacitance dominates
Progressive sizing
InN CL
C3
C2
C1In1
In2
In3
M1
M2
M3
MN
Distributed RC line
M1 > M2 > M3 > … > MN
(the fet closest to the output should be the smallest)
Can reduce delay by more than 20%; decreasing gains as technology shrinks
Fast Complex Gates: Design Technique 2
Input re-ordering when not all inputs arrive at the same time
C2
C1In1
In2
In3
M1
M2
M3 CL
C2
C1In3
In2
In1
M1
M2
M3 CL
critical path critical path
charged1
01charged
charged1
delay determined by time to discharge CL, C1 and C2
delay determined by time to discharge CL
1
1
01 charged
discharged
discharged
Sizing and Ordering Effects
DCBA
D
C
B
A CL
C3
C2
C1
Progressive sizing in pull-down chain gives up to a 23% improvement.
Input ordering saves 5% critical path A – 23% critical path D – 17%
3 3 3 3
4
4
4
4
4
5
6
7
= 100 fF
Fast Networks: Design Technique 5 - Logical Effort The optimum fan-out for a chain of N inverters driving a
load CL isf = (CL/Cin)
so, if we can, keep the fan-out per stage around 4.
Can the same approach (logical effort) be used for any combinational circuit?
For a complex gate, we expand the inverter equation
tp = tp0 (1 + Cext/ Cg) = tp0 (1 + f/) to
tp = tp0 (p + g f/)
- tp0 is the intrinsic delay of an inverter
- f is the effective fan-out (Cext/Cg) – also called the electrical effort
- p is the ratio of the instrinsic (unloaded) delay of the complex gate and a simple inverter (a function of the gate topology and layout style)
- g is the logical effort
N
Intrinsic Delay Term, p
The more involved the structure of the complex gate, the higher the intrinsic delay compared to an inverter
Gate Type p
Inverter 1
n-input NAND n
n-input NOR n
n-way mux 2n
XOR, XNOR n 2n-1
Ignoring second order effects such as internal node capacitances
Logical Effort Term, g g represents the fact that, for a given load, complex gates
have to work harder than an inverter to produce a similar (speed) response
the logical effort of a gate tells how much worse it is at producing an output current than an inverter (how much more input capacitance a gate presents to deliver it same output current)
Gate Type g (for 1 to 4 input gates)
1 2 3 4
Inverter 1
NAND 4/3 5/3 (n+2)/3
NOR 5/3 7/3 (2n+1)/3
mux 2 2 2
XOR 4 12
Example of Logical Effort Assuming a pmos/nmos ratio of 2, the input capacitance
of a minimum-sized inverter is three times the gate capacitance of a minimum-sized nmos (Cunit)
A + B
A
B
A B
A
B
A • B
A B
AA
A 2
1
Cunit = 3
2 2
2
2
Cunit = 4
4
4
1 1
Cunit = 5
Delay as a Function of Fan-Out
The slope of the line is the logical effort of the gate
The y-axis intercept is the intrinsic delay
0
1
2
3
4
5
6
7
0 1 2 3 4 5
nor
ma
lize
d d
ela
y
fan-out f
NAND2: g=4/3, p
= 2
INV: g=1, p=1
intrinsic delay
effort delay Can adjust the delay by
adjusting the effective fan-out (by sizing) or by choosing a gate with a different logical effort
Gate effort: h = fg
Path Delay of Complex Logic Gate Network Total path delay through a combinational logic block
tp = tp,j = tp0 (pj + (fj gj)/ )
So, the minimum delay through the path determines that each stage should bear the same gate effort
f1g1 = f2g2 = . . . = fNgN
Consider optimizing the delay through the logic network
how do we determine a, b, and c sizes?
1a b c
CL5
Path Delay Equation Derivation The path logical effort, G = gi
And the path effective fan-out (path electrical effort) is F = CL/g1
The branching effort accounts for fan-out to other gates in the network
b = (Con-path + Coff-path)/Con-path
The path branching effort is then B = bi
And the total path effort is then H = GFB
So, the minimum delay through the path is
D = tp0 ( pj + (N H)/ )
N
Path Delay of Complex Logic Gates, con’t
For gate i in the chain, its size is determined by
si = (g1 s1)/gi (fj/bj)j=1
i-1
1a b c
CL5
For this network F = CL/Cg1 = 5 G = 1 x 5/3 x 5/3 x 1 = 25/9 B = 1 (no branching) H = GFB = 125/9, so the optimal stage effort is H = 1.93
- Fan-out factors are f1=1.93, f2=1.93 x 3/5 = 1.16, f3 = 1.16, f4 = 1.93
So the gate sizes are a = f1g1/g2 = 1.16, b = f1f2g1/g3 = 1.34 and c = f1f2f3g1/g4 = 2.60
4
Fast Complex Gates: Design Technique 6
Reducing the voltage swing
linear reduction in delay also reduces power consumption requires use of “sense amplifiers” on the receiving end to
restore the signal level (will look at their design when covering memory design)
tpHL = 0.69 (3/4 (CL VDD)/ IDSATn )
= 0.69 (3/4 (CL Vswing)/ IDSATn )
TG Logic Performance Effective resistance of the TG is modeled as a parallel
connection of Rp (= (VDD – Vout)/(-IDp)) and Rn (=VDD – Vout)/IDn)
0
5
10
15
20
25
30
0 1 2Vout, V
Res
ista
nce,
k
Rp
Rn
2.5V
0V
2.5V VoutRp
Rn
Req = Rn || Rp W/Ln=0.50/0.25
W/Lp=0.50/0.25
So, the assumption that the TG switch has a constant resistive value, Req, is acceptable
2)/2 PDP is the average energy consumed per switching event
(Watts * sec = Joule) lower power design could simply be a slower design
allows one to understand tradeoffs better
0
5
10
15
0.5 1 1.5 2 2.5
Vdd (V)
Energ
y-Dela
y (no
rmali
zed)
energy-delay
energy
delay
Energy-delay product (EDP) = PDP * tp = Pav * tp2
EDP is the average energy consumed multiplied by the computation time required
takes into account that one can trade increased delay for lower energy/operation (e.g., via supply voltage scaling that increases delay, but decreases energy consumption)
Understanding Tradeoffs
Ene
rgy
1/Delay
a
b
c
d
Lower EDP
Which design is the “best” (fastest, coolest, both) ?
Pdyn = Energy/transition * f = CL * VDD2 * P01 * f
Pdyn = CEFF * VDD2 * f where CEFF = P01 CL
Not a function of transistor sizes!Data dependent - a function of switching activity!
Vin Vout
CL
Vdd
f01
Lowering Dynamic Power
Pdyn = CL VDD2 P01 f
Capacitance:Function of fan-out, wire length, transistor sizes
Supply Voltage:Has been dropping with successive generations
Clock frequency:Increasing…
Activity factor:How often, on average, do wires switch?
Short Circuit Power Consumption
Finite slope of the input signal causes a direct current path between VDD and GND for a short period of time during switching when both the NMOS and PMOS transistors are conducting.
Vin Vout
CL
Isc
Short Circuit Currents Determinates
Duration and slope of the input signal, tsc
Ipeak determined by the saturation current of the P and N transistors which
depend on their sizes, process technology, temperature, etc. strong function of the ratio between input and output slopes
- a function of CL
Esc = tsc VDD Ipeak P01
Psc = tsc VDD Ipeak f01
Impact of CL on Psc
Vin Vout
CL
Isc 0
Vin Vout
CL
Isc Imax
Large capacitive load
Output fall time significantly larger than input rise time.
Small capacitive load
Output fall time substantially smaller than the input rise
time.
Ipeak as a Function of CL
-0.5
0
0.5
1
1.5
2
2.5
0 2 4 6
I pea
k (A
)
time (sec)
x 10-10
x 10-4
CL = 20 fF
CL = 100 fF
CL = 500 fF
500 psec input slope
Short circuit dissipation is minimized by matching the rise/fall times of the input and output signals - slope engineering.
When load capacitance is small, Ipeak is large.
Psc as a Function of Rise/Fall Times
0
1
2
3
4
5
6
7
8
0 2 4
P n
orm
aliz
ed
tsin/tsout
VDD= 3.3 V
VDD = 2.5 V
VDD = 1.5V
normalized wrt zero input rise-time dissipation
When load capacitance is small (tsin/tsout > 2 for VDD > 2V) the power is dominated by Psc
If VDD < VTn + |VTp| then Psc is eliminated since both devices are never on at the same time.
Continued scaling of supply voltage and the subsequent scaling of threshold voltage will make subthreshold conduction a dominate component of power dissipation.
An 90mV/decade VT roll-off - so each 255mV increase in VT gives 3 orders of magnitude reduction in leakage (but adversely affects performance)
Device sizing affects dynamic energy consumption gain is largest for networks with large overall effective fan-outs (F
= CL/Cg,1) The optimal gate sizing factor
(f) for dynamic energy is smaller than the one for performance, especially for large F’s
e.g., for F=20, fopt(energy) = 3.53 while fopt(performance) = 4.47
If energy is a concern avoid oversizing beyond the optimal 1 2 3 4 5 6 7
0
0.5
1
1.5
f
norm
aliz
ed e
nerg
y
F=1
F=2
F=5
F=10
F=20
From Nikolic, UCB
Dynamic Power Consumption is Data Dependent
A B Out
0 0 1
0 1 0
1 0 0
1 1 0
2-input NOR Gate
With input signal probabilities PA=1 = 1/2 PB=1 = 1/2
Static transition probability P01 = Pout=0 x Pout=1
= P0 x (1-P0)
Switching activity, P01, has two components A static component – function of the logic topology A dynamic component – function of the timing behavior (glitching)
NOR static transition probability = 3/4 x 1/4 = 3/16
NOR Gate Transition Probabilities
CL
A
B
BA
P01 = P0 x P1 = (1-(1-PA)(1-PB)) (1-PA)(1-PB)
PA
PB
0
1 0 1
Switching activity is a strong function of the input signal statistics PA and PB are the probabilities that inputs A and B are one
Transition Probabilities for Some Basic Gates
P01 = Pout=0 x Pout=1
NOR (1 - (1 - PA)(1 - PB)) x (1 - PA)(1 - PB)
OR (1 - PA)(1 - PB) x (1 - (1 - PA)(1 - PB))
NAND PAPB x (1 - PAPB)
AND (1 - PAPB) x PAPB
XOR (1 - (PA + PB- 2PAPB)) x (PA + PB- 2PAPB)
B
AZ
X0.5
0.5
For Z: P01 = P0 x P1 = (1-PXPB) PXPB
For X: P01 = P0 x P1 = (1-PA) PA
= 0.5 x 0.5 = 0.25
= (1 – (0.5 x 0.5)) x (0.5 x 0.5) = 3/16
Inter-signal Correlations
B
A
Z
X
P(Z=1) = P(B=1) & P(A=1 | B=1)
0.5
0.5
(1-0.5)(1-0.5)x(1-(1-0.5)(1-0.5)) = 3/16
(1- 3/16 x 0.5) x (3/16 x 0.5) = 0.085Reconvergent
Determining switching activity is complicated by the fact that signals exhibit correlation in space and time reconvergent fan-out
Have to use conditional probabilities
Logic Restructuring
Chain implementation has a lower overall switching activity than the tree implementation for random inputs
Ignores glitching effects
Logic restructuring: changing the topology of a logic network to reduce transitions
A
BC
D F
AB
CD Z
FW
X
Y0.5
0.5
(1-0.25)*0.25 = 3/16
0.50.5
0.5
0.5
0.5
0.5
7/64
15/256
3/16
3/16
15/256
AND: P01 = P0 x P1 = (1 - PAPB) x PAPB
Input Ordering
Beneficial to postpone the introduction of signals with a high transition rate (signals with signal probability close to 0.5)
Gates have a nonzero propagation delay resulting in spurious transitions or glitches (dynamic hazards) glitch: node exhibits multiple transitions in a single cycle before
settling to the correct logic value
Glitching in an RCA
S0S1S2S14S15
Cin
0
1
2
3
0 2 4 6 8 10 12
Time (ps)
S O
utp
ut
Vo
ltag
e (
V)
Cin
S0
S1
S2
S3
S4
S5S10
S15
Balanced Delay Paths to Reduce Glitching
So equalize the lengths of timing paths through logic
F1
F2
F3
0
0
0
0
1
2
F1
F2
F3
0
0
0
0
1
1
Glitching is due to a mismatch in the path lengths in the logic network; if all input signals of a gate change simultaneously, no glitching occurs
Power and Energy Design Space
Constant Throughput/Latency
Variable Throughput/Latency
Energy Design Time Non-active Modules Run Time
Active
Logic Design
Reduced Vdd
Sizing
Multi-Vdd
Clock Gating
DFS, DVS
(Dynamic Freq, Voltage
Scaling)
Leakage + Multi-VT
Sleep Transistors
Multi-Vdd
Variable VT
+ Variable VT
Dynamic Power as a Function of VDD
Decreasing the VDD
decreases dynamic energy consumption (quadratically)
But, increases gate delay (decreases performance)
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
VDD (V) t p
( no
r ma
l ize
d)
Determine the critical path(s) at design time and use high VDD for the transistors on those paths for speed. Use a lower VDD on the other gates, especially those that drive large capacitances (as this yields the largest energy benefits).
Multiple VDD Considerations How many VDD? – Two is becoming common
Many chips already have two supplies (one for core and one for I/O)
When combining multiple supplies, level converters are required whenever a module at the lower supply drives a gate at the higher supply (step-up)
If a gate supplied with VDDL drives a gate at VDDH, the PMOS never turns off
- The cross-coupled PMOS transistors do the level conversion
- The NMOS transistor operate on a reduced supply
Level converters are not needed for a step-down change in voltage
Overhead of level converters can be mitigated by doing conversions at register boundaries and embedding the level conversion inside the flipflop (see Figure 11.47)
VDDH
Vin
VoutVDDL
Dual-Supply Inside a Logic Block Minimum energy consumption is achieved if all logic
paths are critical (have the same delay)
Clustered voltage-scaling Each path starts with VDDH and switches to VDDL (gray logic
gates) when delay slack is available Level conversion is done in the flipflops at the end of the paths
Power and Energy Design Space
Constant Throughput/Latency
Variable Throughput/Latency
Energy Design Time Non-active Modules Run Time
Active
Logic Design
Reduced Vdd
Sizing
Multi-Vdd
Clock Gating
DFS, DVS
(Dynamic Freq, Voltage
Scaling)
Leakage + Multi-VT
Sleep Transistors
Multi-Vdd
Variable VT
+ Variable VT
Stack Effect Leakage is a function of the circuit topology and the value
of the inputs
VT = VT0 + (|-2F + VSB| - |-2F|)
where VT0 is the threshold voltage at VSB = 0; VSB is the source- bulk (substrate) voltage; is the body-effect coefficient
A B
B
A
Out
VX
A B VX ISUB
0 0 VT ln(1+n) VGS=VBS= -VX
0 1 0 VGS=VBS=0
1 0 VDD-VT VGS=VBS=0
1 1 0 VSG=VSB=0
Leakage is least when A = B = 0
Leakage reduction due to stacked transistors is called the stack effect
Short Channel Factors and Stack Effect In short-channel devices, the subthreshold leakage
current depends on VGS,VBS and VDS. The VT of a short-channel device decreases with increasing VDS due to DIBL (drain-induced barrier loading).
Typical values for DIBL are 20 to 150mV change in VT per voltage change in VDS so the stack effect is even more significant for short-channel devices.
VX reduces the drain-source voltage of the top nfet, increasing its VT and lowering its leakage
For our 0.25 micron technology, VX settles to ~100mV in steady state so VBS = -100mV and VDS = VDD -100mV which is 20 times smaller than the leakage of a device with VBS = 0mV and VDS = VDD
Leakage as a Function of Design Time VT
Reducing the VT increases the sub-threshold leakage current (exponentially)
90mV reduction in VT increases leakage by an order of magnitude
Determine the critical path(s) at design time and use low VT devices on the transistors on those paths for speed. Use a high VT on the other logic for leakage control.
A careful assignment of VT’s can reduce the leakage by as much as 80%
Dual-Thresholds Inside a Logic Block
Minimum energy consumption is achieved if all logic paths are critical (have the same delay)
Use lower threshold on timing-critical paths Assignment can be done on a per gate or transistor basis; no
clustering of the logic is needed No level converters are needed
Variable VT (ABB) at Run Time VT = VT0 + (|-2F + VSB| - |-2F|)
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
-2.5 -2 -1.5 -1 -0.5 0
VSB (V)
VT (
V)
A negative bias on VSB causes VT to increase
Adjusting the substrate bias at run time is called adaptive body-biasing (ABB)
Requires a dual well fab process
For an n-channel device, the substrate is normally tied to ground (VSB = 0)