Is There Anything More to Is There Anything More to Learn about High Performance Learn about High Performance Processors? Processors? J. E. Smith
Is There Anything More to Learn Is There Anything More to Learn about High Performance about High Performance
Processors?Processors?
J. E. Smith
June 2003 copyright J. E. Smith, 2003 3
The State of the ArtThe State of the Art
Multiple instructions per cycle Out-of-order issue Register renaming Deep pipelining Branch prediction Speculative execution Cache memories Multi-threading
June 2003 copyright J. E. Smith, 2003 4
History QuizHistory Quiz
Superscalar processing was invented by
a) Intel in 1993
b) RISC designers in the late ’80s, early ’90s
c) IBM ACS in late ’60s; Tjaden and Flynn 1970
June 2003 copyright J. E. Smith, 2003 5
History QuizHistory Quiz
Out-of-order issue was invented by
a) Intel in 1993
b) RISC designers in the late ’80s, early ’90s
c) Thornton/Cray in the 6600, 1963
June 2003 copyright J. E. Smith, 2003 6
History QuizHistory Quiz
Register renaming was invented by
a) Intel in 1995
b) RISC designers in the late ’80s, early ’90s
c) Tomasulo in late ’60s; also Tjaden and Flynn 1970
What Keller said in1975:
June 2003 copyright J. E. Smith, 2003 7
History QuizHistory Quiz
Deep pipelining was invented bya) Intel in 2001b) RISC designers in the late ’80s, early ’90s c) Seymour Cray in 1976
1969: 7600 12 gates/stage (?)1976: Cray-1 8 gates/stage1985: Cray-2 4 gates/stage1991: Cray-3 6 gates/stage (?)
June 2003 copyright J. E. Smith, 2003 8
History QuizHistory Quiz
Branch prediction was invented bya) Intel in 1995
b) RISC designers in the late ’80s, early ’90s
c) Stretch 1959 (static); Livermore S1(?) 1979 or
earlier at IBM(?)
June 2003 copyright J. E. Smith, 2003 9
History QuizHistory Quiz
Speculative Execution was invented bya) Intel in 1995
b) RISC designers in the late ’80s, early ’90s
c) CDC 180/990 (?) in 1983
June 2003 copyright J. E. Smith, 2003 10
History QuizHistory Quiz
Cache memories were invented bya) Intel in 1985
b) RISC designers in the late ’80s, early ’90s
c) Maurice Wilkes in 1965
June 2003 copyright J. E. Smith, 2003 11
History QuizHistory Quiz
Multi-threading was invented by
a) Intel in 2001
b) RISC designers in the ’80s
c) Seymour Cray in 1964
June 2003 copyright J. E. Smith, 2003 12
SummarySummary
Multiple instructions per cycle -- 1969 Out-of-order issue -- 1964 Register renaming -- 1967 Deep pipelining -- 1975 Branch prediction -- 1979 Speculative Execution -- 1983 Cache memories -- 1965 Multi-threading -- 1964
All were done as part of a development project and immediately put into practice.
After introduction, only a few remained in common use
June 2003 copyright J. E. Smith, 2003 13
The 1970s & 80s – Less ComplexityThe 1970s & 80s – Less Complexity
Level of integration wouldn’t support it• Not because of transistor counts, but because of
small replaceable units. Cray went toward simple issue, deep
pipelining Microprocessor development first used high
complexity then drove pipelines deeper Limits to Wide Issue Limits to Deep Pipelining
June 2003 copyright J. E. Smith, 2003 14
Typical Superscalar PerformanceTypical Superscalar Performance
Your basic superscalar processor:4-way issue, 32 window16K I-cache and D-Cache8K gshare branch predictor
Wide performance range Performance typically
much less than peak (4)0
0.5
1
1.5
2
2.5
3
3.5
4
bzip
craf
tyeo
nga
pgc
cgz
ipm
cf
pars
er perl
twolf
vorte
xvp
r
IPC
June 2003 copyright J. E. Smith, 2003 15
Superscalar Processor PerformanceSuperscalar Processor Performance
Compare4-way issue, 32 windowIdeal I-cache, D-cache, Branch predictor Non-ideal I-cache, D-cache, Branch predictor
Peak performance would be achievableIF it weren’t for “bad” events
I Cache missesD Cache missesBranch mispredictions 0
0.5
1
1.5
2
2.5
3
3.5
4
bzip
craf
tyeo
nga
pgc
cgz
ipm
cf
pars
er perl
twolf
vorte
xvp
rIP
C
0
0.5
1
1.5
2
2.5
3
3.5
4
bzip
craf
tyeo
nga
pgc
cgz
ipm
cf
pars
er perl
twolf
vorte
xvp
r
IPC
June 2003 copyright J. E. Smith, 2003 16
Performance ModelPerformance Model
Consider profile of dynamic instructions issued per cycle:
Background "issue-width" near-peak IPC • With never-ending series of transient events
determine performance with ideal caches & predictors then account for “bad” transient events
time
IPC
branch mispredictsi-cache miss
long d-cache miss
June 2003 copyright J. E. Smith, 2003 17
Backend: Ideal ConditionsBackend: Ideal Conditions
Key Result (Michaud, Seznec, Jourdan):• Square Root relationship between Issue Rate
and Window sizeWR
June 2003 copyright J. E. Smith, 2003 18
Branch Misprediction PenaltyBranch Misprediction Penalty
1) lost opportunity• performance lost by issuing soon-to-be flushed instructions
2) pipeline re-fill penalty• obvious penalty; most people equate this with the penalty
3) window fill penalty• performance lost due to window startup
lostopportunity
pipelinere-fill window fill
June 2003 copyright J. E. Smith, 2003 19
Calculate Mispredict PenaltyCalculate Mispredict Penalty
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
clock cycle
inst
ruct
ion
s is
sued
8.5 insts/4 = 2.1 cp 9 insts/4 = 2.2 cp
19.75 insts/4 = 4.9 cp
Total Penalty = 9.2 cp
June 2003 copyright J. E. Smith, 2003 20
Importance of Branch PredictionImportance of Branch Prediction
0
200
400
600
800
1000
1200
1400
1600
1800
10 20 30 40 50
Percent time at 3.5, 7, 14 issues per cycle
Inst
ruct
ion
s b
etw
een
mis
pre
dic
tio
ns
Issue width 4 Issue width 8 issue width 16
June 2003 copyright J. E. Smith, 2003 21
Importance of Branch PredictionImportance of Branch Prediction
Doubling issue width means predictor has to be four times better for similar performance profile Assumes everything else is ideal
• I-caches & D-caches Research State of the Art:
about 5 percent mispredicts on average (perceptron predictor)
=> one misprediction per 100 instructions
June 2003 copyright J. E. Smith, 2003 22
Next Generation Branch PredictionNext Generation Branch Prediction
Memory computation
PC
GlobalHistory
PredictionMemoryComput-ation
PC
GlobalHistory
Prediction
Classic Memory/Computation Tradeoff Conventional Branch Predictors
• Heavy on memory; light on computation
HistoryMemory
Comput-ation
PC
GlobalHistory
PredictionPredictionMemory
Perceptron Predictor• Add heavier computation• Also adds latency to prediction
Future predictors should balance memory, computation, prediction latency
June 2003 copyright J. E. Smith, 2003 23
Implication of Deeper PipelinesImplication of Deeper Pipelines
Assume 1 misprediction per 96 instructions Vary fetch/decode/rename section of pipe
Advantage of wide issue diminishes as pipe deepens
This ignores implementation complexity
Graph also ignores longer execution latencies
0
1
2
3
4
5
IP C
0 2 4 6 8 10 12 14 16 Fetch/Decode Pipe Length
issue=8
issue=4
issue=2
June 2003 copyright J. E. Smith, 2003 24
Deep Pipelining: the Optimality of EightDeep Pipelining: the Optimality of Eight
Hrishikesh et al. : 8 F04s Kunkel et me : 8 gates Cray-1: 8 4/5 NANDS
We’re getting there!
June 2003 copyright J. E. Smith, 2003 25
Deep PipeliningDeep Pipelining
Consider time per instruction (TPI) versus pipeline depth (Hartstein and Puzak)
The curve is very flat near the optimum
Good engineering Good sales
June 2003 copyright J. E. Smith, 2003 26
Transistor Radios and High MHzTransistor Radios and High MHz
A lesson from transistor radios…
Wonderful new technology in the late ’50s
Clearly, the more transistors, the better the radio!
=> Easy way to improve sales6 transistors, 8 transistors, 14 transistors…
Use transistors as diodes…
Lesson: Eventually, people caught on
June 2003 copyright J. E. Smith, 2003 27
The Optimality of EightThe Optimality of Eight
8 Transistors!
June 2003 copyright J. E. Smith, 2003 28
So, Processors are Dead for Research?So, Processors are Dead for Research?
Of course not
BUT IPC oriented research may be on life support
June 2003 copyright J. E. Smith, 2003 29
Consider Car Engine DevelopmentConsider Car Engine Development
Conclusion: We should be driving cars with 48 cylinders!
0
2
4
6
8
10
12
14
16
18
1890 1895 1900 1905 1910 1915 1920 1925 1930 1935
Year
Nu
mb
er C
ylin
der
s
Don’t focus (obsess) on one aspect of performance
And don’t focus only on performancePower efficiencyReliabilitySecurityDesign Complexity
June 2003 copyright J. E. Smith, 2003 30
Co-Designed VMsCo-Designed VMs Move hardware/software boundary Give “hardware” designer some software in
concealed memory Hardware does what it does best: speed Software does what it does best: manage complexity
OperatingSystem
VMM
ApplicationProg.
Profiling HW Configuration HW
Vis
ible
Me
mo
ry
Co
nc
ea
led
Me
mo
ryH
ard
wa
re
DataTables
June 2003 copyright J. E. Smith, 2003 31
Co-Designed VMs: Micro-OSCo-Designed VMs: Micro-OS Manage processor with micro-OS VMM software
• Manage processor resources in an integrated way• Identify program phase changes• Save/restore implementation contexts• A microprocessor-controlled microprocessor
Configurable I-cache size
Simultaneousmultithreading
Variable branchpredictor globalhistory
ConfigurableInstruction window
Configurable D-Cache size
Variable D-cacheprefetch algorithm
ConfigurableReorder Buffer
Pipeline
June 2003 copyright J. E. Smith, 2003 32
Co-Designed VMsCo-Designed VMs
Other Applications• Binary Translation (e.g. Transmeta)
Enables new ISAs• Security (Dynamo/RIO)
Traditional ISA program
VMM
IFRename & steer
. . .
Integer unit 0
Integer unit 1
Integer unit N
. . .
D cache unit 0
D cache unit 1
D Cache unit N
Translate Dynamic profiling
Conventional ISA program
VMM
IFRename & steer
. . .
Integer unit 0
Integer unit 1
Integer unit N
. . .
D cache unit 0
D cache unit 1
D Cache unit N
Translate Dynamic profiling
June 2003 copyright J. E. Smith, 2003 33
Speculative Multi-threadingSpeculative Multi-threading
Reasons for skepticism• Complex
Incompatible w/ deep pipelining
The devil will be in the details
researcher: 4 instruction types
designer: 100(s) of instruction types• High Power Consumption• Performance advantages tend to be focused on specific
programs (benchmarks)• Better to push ahead with the real thread
June 2003 copyright J. E. Smith, 2003 34
The Memory Wall: D-Cache MissesThe Memory Wall: D-Cache Misses
Divide into:• Short misses
– handle like long latency functional unit• Long misses
– need special treatment
Things that can reduce performance1) Structural hazards
ROB fills up behind load and dispatch stallsWindow fills with instructions dependent on load and issue
stops2) Control dependences
Mispredicted branch dependent on load data Instructions beyond branch wasted
June 2003 copyright J. E. Smith, 2003 35
Structural and Data BlockagesStructural and Data Blockages
Experiment:• Window size 32, Issue width 4• Ideal branch prediction• Cache miss delay 1000 cycles• Separate Window and ROB 4K entries each• Simulate single cache miss and see what happens
June 2003 copyright J. E. Smith, 2003 36
ResultsResults
Issue continues at full speed
Typical dependent instructions: about 30
Usually dependent instructions follow load closely
Benchmark Avg. # insts Avg. #insts issued after in windowmiss dep. on load
Bzip2 3950 17.8Crafty 3747 20.1Eon 3923 22.4Gap 3293 31.6Gcc 3678 17.2Mcf 3502 96.2Gzip 3853 11.5Parser 3648 32.6Perl 3519 30.3Twolf 3673 44.7Vortex 3606 7.8Vpr 2371 34.0
June 2003 copyright J. E. Smith, 2003 37
Control DependencesControl Dependences
Non-ideal Branch prediction• How many cache misses lead to branch mispredict
and when?• Use 8K gshare
June 2003 copyright J. E. Smith, 2003 38
ResultsResults
• Bimodal behavior; for some programs, branch mispredictions are crucial
• In many cases 30-40% cache miss data leads to mispredicted branch
• Inhibits ability to overlap data cache misses• One more reason to worry about branch
prediction
fract. loads #insts before Benchmark driving mispredict
mispredictBzip2 .01 33.5Crafty .30 20.3Eon .18 30.6Gap .33 27.0Gcc .35 32.4Mcf .01 27.7Gzip .44 32.4Parser .08 35.9Perl .40 30.2Twolf .37 65.6Vortex .16 41.2Vpr .47 31.3
June 2003 copyright J. E. Smith, 2003 39
Dealing with the Memory WallDealing with the Memory Wall
Don’t speculate about itRun through it
ROB grows as nD• issue width is n ; miss delay D cycles
miss delay of 200 cycles; four-issue machine ROB of about 800 entries
Window grows as dm• m outstanding misses; d dependent instructions each• Example:
6 outstanding misses and 30 dependent instructions
then the window should be enlarged by 180 slots
June 2003 copyright J. E. Smith, 2003 40
Future High Performance ProcessorsFuture High Performance Processors
Fast clock cycle: 8 gates per stage Less speculation
• Deciding what to take out more important than new things to put in
Return to Simplicity• Leave the baroque era behind
ILP less important
June 2003 copyright J. E. Smith, 2003 41
Research in the deep pipeline domainResearch in the deep pipeline domain
When there are 40 gate levels, we can be sloppy about adding gadgets
When there are 8 gate levels, a gadget requiring even one more level slows clock by 12.5%
logic
logic
logic
logic
logic
logic
latch latch
Neat Gadget
logic
To really evaluate performance impact of adding a gadget, we need a detailed logic design
Future research should be focused in jettisoning gadgets, not adding them
June 2003 copyright J. E. Smith, 2003 42
Conclusion: Important Research AreasConclusion: Important Research Areas
Processor simplicity Power efficiency Security Reliability Reduced design times Systems (on a chip) balancing threads and on-chip
RAM Many very simple processors on a chip
• Look at architecture of Denelcor HEP…
June 2003 copyright J. E. Smith, 2003 43
Attack of Killer Game ChipsAttack of Killer Game Chips
OR: The most important thing I learned at Cray Rsch. OR: What happened to SSTs?
•It isn’t enough that we can build them•It isn’t enough that there are interested customers•Volume rules!
Researchers have made a supercomputer - which is powerful enough to rival the top systems in the world - out of PlayStation 2 components A US research centre has clustered 70 Sony PlayStation 2 game consoles into a Linux supercomputer that ranks among the 500 most powerful in the world. According to the New York Times, the National Centre for Supercomputing Applications (NCSA) at the University of Illinois assembled the $50,000 (£30,000) machine out of components bought in retail shops. In all, 100 PlayStation 2 consoles were bought but only 70 have been used for this project.
June 2003 copyright J. E. Smith, 2003 44
AcknowledgementsAcknowledgements
Performance ModelTejas Karkhanis
FundingNSF, SRC, IBM, Intel
Japanese Transistor RadioRadiophile.com