Scaling, Power and the Future of CMOS Mark Horowitz, Elad Alon, Dinesh Patil, Stanford Samuel Naffziger, Rajesh Kumar, Intel Kerry Bernstein, IBM
Scaling, Power
and the Future of CMOS
Mark Horowitz, Elad Alon, Dinesh Patil, Stanford
Samuel Naffziger, Rajesh Kumar, Intel
Kerry Bernstein, IBM
Slide 2
A Long Time Ago
In a building far away
A man made a prediction
On surprisingly little data
That has defined an industry
Slide 3
Moore’s Law
Slide 4
CMOS Computer Performance
1.00
10.00
100.00
1000.00
10000.00
85 87 89 91 93 95 97 99 01 03 05 07
intel 386
intel 486intel pentium
intel pentium 2
intel pentium 3
intel pentium 4intel itanium
Alpha 21064
Alpha 21164
Alpha 21264Sparc
SuperSparc
Sparc64Mips
HP PA
Power PC
AMD K6AMD K7
AMD x86-64
Slide 5
Moore’s Original Issues
• Design cost
• Power dissipation
• What to do with all the functionality possible
ftp://download.intel.com/research/silicon/moorespaper.pdf
Slide 6
Outline
• How designers will deal with poor power scaling
• Origins of the power problem
• An optimization perspective
• Low power circuits and architectures
• Cost of variability
• Future scenarios
• What device characteristics matter (to me)
Slide 7
The 80’s Power Problem
• Until mid 80s technology was mixed
• nMOS, bipolar, some CMOS
• Supply voltage was not scaling / power was rising
• nMOS, bipolar gates dissipate static power
From Roger Schmidt, IBM Corp
Slide 8
Solution: Move to CMOS
• And then scale Vdd
0.1
1
10
Jan-85 Jan-88 Jan-91 Jan-94 Jan-97 Jan-00 Jan-03
Feat Size (um)
Vdd
Slide 9
Scaling MOS Devices
• In this ideal scaling• V scales to αV, L scales to αL
• So C scales to αC, i scales to αi (i/µ is stable)
• Delay = CV/I scales as α
• Energy = CV2 scales as α3
JSSC Oct 74, pg 256
Slide 10
Processor Power
• Continued to grow, even when Vdd was scaled
1
10
100
85 87 89 91 93 95 97 99 01 03
Slide 11
Why Power Increased
• Growing die size, fast frequency scaling
Clock Frequency (MHz)
10
100
1000
10000
85 87 89 91 93 95 97 99 01 03 05
Slide 12
Good News
• Die growth & super frequency scaling have stopped
Cycle in FO4
10
100
85 87 89 91 93 95 97 99 01 03 05
Slide 13
Processor Power
• They were high power too
1
10
100
85 87 89 91 93 95 97 99 01 03
Slide 14
Bad News
• Voltage scaling has
stopped as well
• kT/q does not scale
• Vth scaling has
power consequences
• If Vdd does not scale
• Energy scales slowlyEd Nowak, IBM
Slide 15
Energy – Performance Space
• Every design is a point on a 2-D plane
Performance
Energy
Slide 16
Energy – Performance Space
• Every design is a point on a 2-D plane
Performance
Energy
Slide 17
Energy – Performance Space
• Every design is a point on a 2-D plane
Performance
Energy
Slide 18
Trade-offs for an Adder
101
102
103
104
105
Stat.Carr.Chain Stat.Carr.Sel
Stat. KS
Dom. BK
Dom. KS
Dom. LF
Stat. LF
Stat. BK
Stat. 84421
Slide 19
( )
*dd dd
dd
dd
dd V V
EV
Sens VDV
=
∂∂
= −∂
∂
Key Observation:
• Define the Energy/Delay sensitivity of parameter
• For example Vdd:
• At optimal point, all sensitivities should be the same
• Must equal the slope of the Pareto optimal curve
Slide 20
What This Means
• Vdd and Vth are not directly set by scaling
• Instead set by slope of Pareto optimal curve
• Leakage rose to lower total system power!
101
102
103
104
105
[Vdd=0.68, Vt
N=0.31]
[Vdd=1 Vt
N=0.30]
[Vdd=1.2 Vt
N=0.26]
[Vdd=1.3 Vt
N=0.22]
Slide 21
Low Power Design Techniques
Three main classes of methods to reduce energy:
• Cheating
• Reducing the performance of the design
• Reducing waste
• Stop using energy for stuff that does not produce results
• Stop waiting for stuff that you don’t need (parallelism)
• Problem reformulation
• Reduce work (less energy and less delay)
Slide 22
Cheating
• Many low-power papers talk only about energy
• Don’t consider performance
• Reducing performance can always reduce energy
• But there are many ways to reduce performance
• Good technique must lower the optimal curve
• “Sensitivity” of technique
• Must be better than current curve
• This depends on location on the curve
Slide 23
Reducing Energy Waste
• Clock gating
• If a section is idle, remove clock
• Removes clock power
• Prevents any internal node from transitioning
• Create system power states
• Turn on subsystems only when they are needed
• Can have different “off” states
• Power vs. wakeup time
• Disk (do you stop it from spinning?)
Slide 24
Embedded Power Gating
• Can reduce leakage• 250x reported
• But costs• Performance
• Drop in Vdd, Gnd
Embedded
Power
Switches
Rows of
Standard
Cells
Power Switch
Control Signals
• Since transistors still leak when power is off
Royannez, et al, 90nm Low Leakage SoC Design Techniques
for Wireless Applications, ISSCC 2005
Slide 25
Range of Applicability
• Power supply gating
• Done to remove leakage power
• But slows down the circuit
• Adds series resistance to the supply
Performance
Energy/op
Makes circuit worse
when energy
sensitivity is high
Slide 26
Parallelism
• If the application has data parallelism
• Parallelism is a way to improve performance
• With low additional energy cost
Slide 27
Existing Processors
0.01
0.1
1
1 10 100 1000Spec2000*L
Watts/(Spec*L*Vdd2)
10 processors
working in parallel
Slide 28
Parallel Server Chip
• Power 5 from IBM
Slide 29
Problem Reformulation
• Best way to save energy is to do less work
• Energy directly reduced by the reduction in work
• But required time for the function decreases as well
• Convert this into extra power gains
• Shifts the optimal curve down and to the right
User Performance
Energy/op
Slide 30
Cost of Variation
• Variability changes position of the optimal curves
• Need to margin Vth, Vdd to ensure circuit always works
10-1
100
10-1
100
101
Performance
Energy
∆ Vth = 0 mV
∆ Vth = 120 mV
Slide 31
Partial Compensation
• Adjust Vdd after you get part back
• Compensates very well for small deviations in Vth
10-1
100
10-1
100
101
Performance
Energy
∆ Vth = 0 mV
∆ Vth = 120 mV
Slide 32
Reducing Voltage Margins
• At test time determine Vdd for that part
• Have private DC-DC converter already
20%
Slide 33
Variable Application Demands
• Try to provide a couple of operating points
• Application can control speed and energy
• Hard question is what are valid Vdd, F pairs
• Usually determined during test
• Dynamic voltage scaling
• Intel Speed Step in laptop processors
• 2 performance/power points
• Transmeta Long Run Technology
• Many operating points. Test data + formula
Slide 34
Constant Power Scaling
• Foxton controller on next-gen. Itanium II
• Raises Vdd/boosts F when most units idle
• Lowers Vdd for parallel code to stay in budget
Slide 35
Self Checking Hardware
• Razor (Austin/Blaauw, U of Mich)
• Use the actual hardware to check for errors
• Latch the input data twice
• Once on the clock edge, and then a little later
• If the data is not the same, you are going too fast
Din 0
clk_del
Q[0]Din 1Din 2
error
clk
Error_L
comparator
RAZOR FF
Main
Flip-Flop
Shadow
Latch
01
Slide 36
With Error Recovery
• Can run the chip so it makes some errors
• Chip gets right answer 99.9% of the time
• 0.1% of the time, the chip must rerun operation
1.4 1.5 1.6 1.7 1.8
1.4
1.5
1.6
1.7
1.8 Chips
Linear Fit
y=0.78685x + 0.22117
Voltage at First FailureVoltage at 0.1%Error Rate
Slide 37
Adjusting Vth
• In theory want to adjust Vth too
• Very hard to do with modern transistors
0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.35
10
1520
2530
3540
4550
5560
6570
Leakage vs Vdd with Different Body Bias
Vbb=0.8
Vbb=0.7
Vbb=0.6
Vbb=0.5
Vbb=0.4
Vbb=0.3
Vbb=0.2
Vbb=0.1
Vbb=0
Vdd (V)
Leakage (
A)
Slide 38
Future Systems
• Some simple math
• Assume scaling continues
• Dies don’t shrink in size
• Average power/gate must decrease by 2x / generation
• Since gates are shrinking in size
• Get 1.4x from capacitive reduction
• Where is the other factor of 1.4x ?
Slide 39
Exploit Parallelism / Scale Vdd
• If you have parallelism
• Add more function units
• Fill up new die (2x)
• Lower energy/op
• ∆E/∆P will decrease
• Vdd, sizes, etc will reduce
• Build simpler architectures
• Works well when ∆E/∆P is large
• Per unit performance decrease is small
Performance
Energy/op
Slide 40
Exploit Specialization
• Optimize execution units for different applications
• Reformulate the hardware to reduce needed work
• Can improve energy efficiency for a class of applications
• Stream / Vector processing is a current example
• Exploit locality, reuse
• High compute densityµ-controller
Clusters
SRFMem
ory System
HISC NI
Bill Dally et al, Stanford
Imagine
Slide 41
Exploit Integration
• If both those techniques don’t work
• Still can increase integration by at least 1.4x
• Moving units onto one chip
• Reduces the number of I/Os on system
• I/O can take significant power today
• Allows even larger integration
Slide 42
TI - OMAP2420
• Specialization
• And power domains• Most units are off
• OMAP 2420 • 5 Power Domains
• #1: MCU Core
• #2: DSP Core
• #3: Graphic Accelerator
• #4: Core + Periph.
• #5: Always On logicRoyannez, et al, 90nm Low Leakage SoC Design Techniques
for Wireless Applications, ISSCC 2005
Slide 43
Low-Power PowerPC
Nowka et al., Low-
power PowerPC,
ISSCC
Slide 44
What All This Means
• As long as $/function and cap continue to scale
• Moving to the new technology will be profitable
• And will allow designs to be better systems
• In the worst case, active die area will decrease
• Scale gates by the decrease in gate capacitance
• In most cases, we will do much better
• But how to optimize devices in this new domain?
Slide 45
Radical Idea:
• Scaling channel length may no longer be critical
• I still want small (i.e. dense) devices
• But I also want lower variations & external control of Vth
• Longer Leff may actually improve energy efficiency
• Less variability � lower energy penalty
• Especially as move to lower performance (parallelism)
10-1
100
100
101
Performance
Energy
120 mV110 mV
55 mV60 mV
Lnom
Lnom
+ 10%
Slide 46
Conclusions
• Unfortunately power is an old problem
• Magic bullets have mostly been spent
• Power will be addressed by application-level optimization, parallelism/specialized functional units, and more adaptive control
• Need to rethink scaling
• Still makes things cheaper
• But what do we want from scaled transistors?
Slide 47
1.5
Technology Scaling
Seems simple,
• Every 1 years
• Number of transistors double
• Transistors get faster
• Gates become lower power (CMOS)
• Life just gets better and better
2
Slide 48
Reality is a Little Different
• While scaling has been smooth
• Almost nothing else has been
• Device and circuit technology has changed
• DTL, ECL, TTL, pMOS, nMOS, CMOS
• Power periodically becomes a critical issue
• It is critical again
Slide 49
nMOS, TTL, ECL Were King
• 1978 – Started in VLSI
• First design was bipolar/ECL
• 3µm nMOS was hot
Intel 8086 Intel 286
DEC µVaxBIT Sparc HP Focus
Slide 50
MOS Scaling Was Understood
• MOS devices operate on electric fields
• If E fields are the same
• Relation between E and J is the same
• So if all voltages and lengths scale
• iV curve retains the same shape, scaled in V
Bob Dennard worked all the math in 74JSSC Oct 74, pg 256
Slide 51
Dilemma
• Processors today are power limited
• As are many other chips
• Technology scaling will not save us
• With Vdd fixed, energy scaling will be modest
• How does one build more powerful processors?
• Or other types of chips
When constrained, optimize!
Slide 52
Optimizing the Right Thing
• Given systems are power limited
• Highest performance system is not interesting
• Will dissipate too much power
• Lowest energy solution is also not interesting
• Will not have enough performance
• Want constrained optimization
• Highest performance for 20 Watts
• Lowest power for 100 SPEC
Slide 53
Leakage Trends
Gate Length (um)
Active Power Density
Subthreshold Power Density
Slide 54
Performance
Energy/op
Design Parameters To Adjust
• Circuit
(sizing, supply, threshold)
• Circuit topology
(adder: CLA, CSA, …)
• Logic style
(domino, pass-gate, …)
• Micro-architecture
(pipelining, cache design, branch architecture, etc)
Slide 55
Energy Efficient Designs
• Are on the Pareto optimal curve
• On this curve design parameters are constrained
Performance
Energy/op
infeasible
wasting energy
Slide 56
Leakage Energy
• Matching marginal costs for Vdd and Vth
10-2
10-1
0
0.1
0.2
0.3
0.4
0.5
Activity Factor
Leakage Ratio
10-2
10-1
0
0.2
0.4
0.6
0.8
1
1.2
Voltage
Vdd
Vth
Leakage Ratio
Slide 57
Measured Leakage Data
0
0.1
0.2
0.3
0.4
0.5
Leakage R
atio
Slide 58
IBM Cell Processor
Slide 59
Vth Variation
• Since leakage is exponential on Vth
• Average Vth for leakage is not the expected Vth
5 10 15 20 25 30 35 400
0.2
0.4
0.6
0.8
1
Relative Leakage
Cumulative Probability
0.2 0.3 0.4 0.50
0.5
1
1.5
2
2.5
Vth
Relative Leakage Contribution
Leakage
Vth
Slide 60
How Else to Save Energy?
• Running faster than needed wastes energy
• Forces you to run higher on performance curve
• Why do you run faster than needed?
• Need margins to account for variability
• From application, environment, or technology
Variations cause waste
Slide 61
Dynamic Voltage Scaling
Burd et al ISSCC 2000
Slide 62
Dynamic Voltage Scaling
• Dynamic voltage scaling
• Adjusts Vdd to the “right” value for desired performance
• Big problem is how to find the “right” Vdd
• Need to know the relationship between Vdd and F
• Need to have a circuit that matches the critical path
• How do you do this with variations?