Is There Anything More to Learn about High Performance Processors? J. E. Smith.

Is There Anything More to Learn Is There Anything More to Learn about High Performance about High Performance

Processors?Processors?

J. E. Smith

June 2003 copyright J. E. Smith, 2003 3

The State of the ArtThe State of the Art

Multiple instructions per cycle Out-of-order issue Register renaming Deep pipelining Branch prediction Speculative execution Cache memories Multi-threading


History QuizHistory Quiz

Superscalar processing was invented by

a) Intel in 1993

b) RISC designers in the late ’80s, early ’90s

c) IBM ACS in late ’60s; Tjaden and Flynn 1970



Out-of-order issue was invented by

a) Intel in 1993


c) Thornton/Cray in the 6600, 1963



Register renaming was invented by

a) Intel in 1995


c) Tomasulo in late ’60s; also Tjaden and Flynn 1970

What Keller said in1975:



Deep pipelining was invented bya) Intel in 2001b) RISC designers in the late ’80s, early ’90s c) Seymour Cray in 1976

1969: 7600 12 gates/stage (?)1976: Cray-1 8 gates/stage1985: Cray-2 4 gates/stage1991: Cray-3 6 gates/stage (?)



Branch prediction was invented bya) Intel in 1995


c) Stretch 1959 (static); Livermore S1(?) 1979 or

earlier at IBM(?)



Speculative Execution was invented bya) Intel in 1995


c) CDC 180/990 (?) in 1983



Cache memories were invented bya) Intel in 1985


c) Maurice Wilkes in 1965



Multi-threading was invented by

a) Intel in 2001

b) RISC designers in the ’80s

c) Seymour Cray in 1964


SummarySummary

Multiple instructions per cycle -- 1969 Out-of-order issue -- 1964 Register renaming -- 1967 Deep pipelining -- 1975 Branch prediction -- 1979 Speculative Execution -- 1983 Cache memories -- 1965 Multi-threading -- 1964

All were done as part of a development project and immediately put into practice.

After introduction, only a few remained in common use


The 1970s & 80s – Less ComplexityThe 1970s & 80s – Less Complexity

Level of integration wouldn’t support it• Not because of transistor counts, but because of

small replaceable units. Cray went toward simple issue, deep

pipelining Microprocessor development first used high

complexity then drove pipelines deeper Limits to Wide Issue Limits to Deep Pipelining


Typical Superscalar PerformanceTypical Superscalar Performance

Your basic superscalar processor:4-way issue, 32 window16K I-cache and D-Cache8K gshare branch predictor

Wide performance range Performance typically

much less than peak (4)0

0.5

1

1.5

2

2.5

3

3.5

4

bzip

craf

tyeo

nga

pgc

cgz

ipm

cf

pars

er perl

twolf

vorte

xvp

r

IPC


Superscalar Processor PerformanceSuperscalar Processor Performance

Compare4-way issue, 32 windowIdeal I-cache, D-cache, Branch predictor Non-ideal I-cache, D-cache, Branch predictor

Peak performance would be achievableIF it weren’t for “bad” events

I Cache missesD Cache missesBranch mispredictions 0

0.5

1

1.5

2

2.5

3

3.5

4

bzip

craf

tyeo

nga

pgc

cgz

ipm

cf

pars

er perl

twolf

vorte

xvp

rIP

C

0

0.5

1

1.5

2

2.5

3

3.5

4

bzip

craf

tyeo

nga

pgc

cgz

ipm

cf

pars

er perl

twolf

vorte

xvp

r

IPC


Performance ModelPerformance Model

Consider profile of dynamic instructions issued per cycle:

Background "issue-width" near-peak IPC • With never-ending series of transient events

determine performance with ideal caches & predictors then account for “bad” transient events

time

IPC

branch mispredictsi-cache miss

long d-cache miss


Backend: Ideal ConditionsBackend: Ideal Conditions

Key Result (Michaud, Seznec, Jourdan):• Square Root relationship between Issue Rate

and Window sizeWR


Branch Misprediction PenaltyBranch Misprediction Penalty

1) lost opportunity• performance lost by issuing soon-to-be flushed instructions

2) pipeline re-fill penalty• obvious penalty; most people equate this with the penalty

3) window fill penalty• performance lost due to window startup

lostopportunity

pipelinere-fill window fill


Calculate Mispredict PenaltyCalculate Mispredict Penalty

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

clock cycle

inst

ruct

ion

s is

sued

8.5 insts/4 = 2.1 cp 9 insts/4 = 2.2 cp

19.75 insts/4 = 4.9 cp

Total Penalty = 9.2 cp


Importance of Branch PredictionImportance of Branch Prediction

0

200

400

600

800

1000

1200

1400

1600

1800

10 20 30 40 50

Percent time at 3.5, 7, 14 issues per cycle

Inst

ruct

ion

s b

etw

een

mis

pre

dic

tio

ns

Issue width 4 Issue width 8 issue width 16


Importance of Branch PredictionImportance of Branch Prediction

Doubling issue width means predictor has to be four times better for similar performance profile Assumes everything else is ideal

• I-caches & D-caches Research State of the Art:

about 5 percent mispredicts on average (perceptron predictor)

=> one misprediction per 100 instructions


Next Generation Branch PredictionNext Generation Branch Prediction

Memory computation

PC

GlobalHistory

PredictionMemoryComput-ation

PC

GlobalHistory

Prediction

Classic Memory/Computation Tradeoff Conventional Branch Predictors

• Heavy on memory; light on computation

HistoryMemory

Comput-ation

PC

GlobalHistory

PredictionPredictionMemory

Perceptron Predictor• Add heavier computation• Also adds latency to prediction

Future predictors should balance memory, computation, prediction latency


Implication of Deeper PipelinesImplication of Deeper Pipelines

Assume 1 misprediction per 96 instructions Vary fetch/decode/rename section of pipe

Advantage of wide issue diminishes as pipe deepens

This ignores implementation complexity

Graph also ignores longer execution latencies

0

1

2

3

4

5

IP C

0 2 4 6 8 10 12 14 16 Fetch/Decode Pipe Length

issue=8

issue=4

issue=2


Deep Pipelining: the Optimality of EightDeep Pipelining: the Optimality of Eight

Hrishikesh et al. : 8 F04s Kunkel et me : 8 gates Cray-1: 8 4/5 NANDS

We’re getting there!


Deep PipeliningDeep Pipelining

Consider time per instruction (TPI) versus pipeline depth (Hartstein and Puzak)

The curve is very flat near the optimum

Good engineering Good sales


Transistor Radios and High MHzTransistor Radios and High MHz

A lesson from transistor radios…

Wonderful new technology in the late ’50s

Clearly, the more transistors, the better the radio!

=> Easy way to improve sales6 transistors, 8 transistors, 14 transistors…

Use transistors as diodes…

Lesson: Eventually, people caught on


The Optimality of EightThe Optimality of Eight

8 Transistors!


So, Processors are Dead for Research?So, Processors are Dead for Research?

Of course not

BUT IPC oriented research may be on life support


Consider Car Engine DevelopmentConsider Car Engine Development

Conclusion: We should be driving cars with 48 cylinders!

0

2

4

6

8

10

12

14

16

18

1890 1895 1900 1905 1910 1915 1920 1925 1930 1935

Year

Nu

mb

er C

ylin

der

s

Don’t focus (obsess) on one aspect of performance

And don’t focus only on performancePower efficiencyReliabilitySecurityDesign Complexity


Co-Designed VMsCo-Designed VMs Move hardware/software boundary Give “hardware” designer some software in

concealed memory Hardware does what it does best: speed Software does what it does best: manage complexity

OperatingSystem

VMM

ApplicationProg.

Profiling HW Configuration HW

Vis

ible

Me

mo

ry

Co

nc

ea

led

Me

mo

ryH

ard

wa

re

DataTables


Co-Designed VMs: Micro-OSCo-Designed VMs: Micro-OS Manage processor with micro-OS VMM software

• Manage processor resources in an integrated way• Identify program phase changes• Save/restore implementation contexts• A microprocessor-controlled microprocessor

Configurable I-cache size

Simultaneousmultithreading

Variable branchpredictor globalhistory

ConfigurableInstruction window

Configurable D-Cache size

Variable D-cacheprefetch algorithm

ConfigurableReorder Buffer

Pipeline


Co-Designed VMsCo-Designed VMs

Other Applications• Binary Translation (e.g. Transmeta)

Enables new ISAs• Security (Dynamo/RIO)

Traditional ISA program

VMM

IFRename & steer

. . .

Integer unit 0

Integer unit 1

Integer unit N

. . .

D cache unit 0

D cache unit 1

D Cache unit N

Translate Dynamic profiling

Conventional ISA program

VMM

IFRename & steer

. . .

Integer unit 0

Integer unit 1

Integer unit N

. . .

D cache unit 0

D cache unit 1

D Cache unit N

Translate Dynamic profiling


Speculative Multi-threadingSpeculative Multi-threading

Reasons for skepticism• Complex

Incompatible w/ deep pipelining

The devil will be in the details

researcher: 4 instruction types

designer: 100(s) of instruction types• High Power Consumption• Performance advantages tend to be focused on specific

programs (benchmarks)• Better to push ahead with the real thread


The Memory Wall: D-Cache MissesThe Memory Wall: D-Cache Misses

Divide into:• Short misses

– handle like long latency functional unit• Long misses

– need special treatment

Things that can reduce performance1) Structural hazards

ROB fills up behind load and dispatch stallsWindow fills with instructions dependent on load and issue

stops2) Control dependences

Mispredicted branch dependent on load data Instructions beyond branch wasted


Structural and Data BlockagesStructural and Data Blockages

Experiment:• Window size 32, Issue width 4• Ideal branch prediction• Cache miss delay 1000 cycles• Separate Window and ROB 4K entries each• Simulate single cache miss and see what happens


ResultsResults

Issue continues at full speed

Typical dependent instructions: about 30

Usually dependent instructions follow load closely

Benchmark Avg. # insts Avg. #insts issued after in windowmiss dep. on load

Bzip2 3950 17.8Crafty 3747 20.1Eon 3923 22.4Gap 3293 31.6Gcc 3678 17.2Mcf 3502 96.2Gzip 3853 11.5Parser 3648 32.6Perl 3519 30.3Twolf 3673 44.7Vortex 3606 7.8Vpr 2371 34.0


Control DependencesControl Dependences

Non-ideal Branch prediction• How many cache misses lead to branch mispredict

and when?• Use 8K gshare


ResultsResults

• Bimodal behavior; for some programs, branch mispredictions are crucial

• In many cases 30-40% cache miss data leads to mispredicted branch

• Inhibits ability to overlap data cache misses• One more reason to worry about branch

prediction

fract. loads #insts before Benchmark driving mispredict

mispredictBzip2 .01 33.5Crafty .30 20.3Eon .18 30.6Gap .33 27.0Gcc .35 32.4Mcf .01 27.7Gzip .44 32.4Parser .08 35.9Perl .40 30.2Twolf .37 65.6Vortex .16 41.2Vpr .47 31.3


Dealing with the Memory WallDealing with the Memory Wall

Don’t speculate about itRun through it

ROB grows as nD• issue width is n ; miss delay D cycles

miss delay of 200 cycles; four-issue machine ROB of about 800 entries

Window grows as dm• m outstanding misses; d dependent instructions each• Example:

6 outstanding misses and 30 dependent instructions

then the window should be enlarged by 180 slots


Future High Performance ProcessorsFuture High Performance Processors

Fast clock cycle: 8 gates per stage Less speculation

• Deciding what to take out more important than new things to put in

Return to Simplicity• Leave the baroque era behind

ILP less important


Research in the deep pipeline domainResearch in the deep pipeline domain

When there are 40 gate levels, we can be sloppy about adding gadgets

When there are 8 gate levels, a gadget requiring even one more level slows clock by 12.5%

logic

logic

logic

logic

logic

logic

latch latch

Neat Gadget

logic

To really evaluate performance impact of adding a gadget, we need a detailed logic design

Future research should be focused in jettisoning gadgets, not adding them


Conclusion: Important Research AreasConclusion: Important Research Areas

Processor simplicity Power efficiency Security Reliability Reduced design times Systems (on a chip) balancing threads and on-chip

RAM Many very simple processors on a chip

• Look at architecture of Denelcor HEP…


Attack of Killer Game ChipsAttack of Killer Game Chips

OR: The most important thing I learned at Cray Rsch. OR: What happened to SSTs?

•It isn’t enough that we can build them•It isn’t enough that there are interested customers•Volume rules!

Researchers have made a supercomputer - which is powerful enough to rival the top systems in the world - out of PlayStation 2 components A US research centre has clustered 70 Sony PlayStation 2 game consoles into a Linux supercomputer that ranks among the 500 most powerful in the world. According to the New York Times, the National Centre for Supercomputing Applications (NCSA) at the University of Illinois assembled the $50,000 (£30,000) machine out of components bought in retail shops. In all, 100 PlayStation 2 consoles were bought but only 70 have been used for this project.


AcknowledgementsAcknowledgements

Performance ModelTejas Karkhanis

FundingNSF, SRC, IBM, Intel

Japanese Transistor RadioRadiophile.com

Is There Anything More to Learn about High Performance Processors? J. E. Smith.

Documents