High-Performance Microarchitecture Techniquesrichter/cs550/MRF... · 60MHz P5 Micro-Architecture 5 Hyper Pipelined Technology 1GHz 166MHz P6 Micro-Architecture 10 @ intro 1.5 GHz

HighHigh--PerformancePerformanceMicroarchitectureMicroarchitecture TechniquesTechniques

John Paul ShenJohn Paul ShenDirector of Director of MicroarchitectureMicroarchitecture ResearchResearch

Intel LabsIntel Labs

October 29, 2002October 29, 2002Microprocessor Research ForumMicroprocessor Research Forum

IntelIntel’’s s MicroarchitectureMicroarchitectureResearch LabsResearch Labs! USA: California, Oregon, Texas (John Shen)

– High Frequency Superscalar Processors– Helper Threads for SMT and CMP Machines– Future Enterprise Server Processors

! Israel: Haifa (Ronny Ronen)– Low Power Microarchitecture Techniques– Future Mobile High-performance Processors

! Spain: Barcelona (Antonio Gonzalez)– Speculative Multithreading for SMT and CMP– Clustered Microarchitecture Techniques

Microprocessor Performance Microprocessor Performance Growth in PerspectiveGrowth in Perspective!! Doubling every 18 months (1982Doubling every 18 months (1982--2000): 2000):

–– Total of 3,200XTotal of 3,200X–– Cars travel at 176,000 MPH; get 64,000 miles/gal.Cars travel at 176,000 MPH; get 64,000 miles/gal.–– Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200)Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200)–– Wheat yield: 320,000 bushels per acreWheat yield: 320,000 bushels per acre

!! Doubling every 24 months (1971Doubling every 24 months (1971--2001): 2001): –– Total of 36,000XTotal of 36,000X–– Cars travel at 2,400,000 MPH; get 600,000 miles/gal.Cars travel at 2,400,000 MPH; get 600,000 miles/gal.–– Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000)Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000)–– Wheat yield: 3,600,000 bushels per acreWheat yield: 3,600,000 bushels per acre

Unmatched by any other industry!!Unmatched by any other industry!!

““Iron LawIron Law”” of of Microprocessor PerformanceMicroprocessor Performance

1/Processor Performance = ---------------Time

Program

= ------------------ X ---------------- X ------------Instructions Cycles

Program Instruction

Time

Cycle(inst. count) (CPI) (cycle time)

Processor Performance = -----------------IPC x GHzinst. count

Performance Improvement Performance Improvement TechniquesTechniques!! Increase GHzIncrease GHz

–– Process TechnologyProcess Technology–– Circuit TechniquesCircuit Techniques–– Pipelining and CachesPipelining and Caches

!! Increase IPC (Reduce CPI)Increase IPC (Reduce CPI)–– Superscalar PipelinesSuperscalar Pipelines–– OutOut--ofof--order Executionorder Execution–– Cache Miss ReductionCache Miss Reduction

!! Decrease Instruction CountDecrease Instruction Count–– Compiler OptimizationCompiler Optimization–– Architecture ExtensionsArchitecture Extensions

MicroarchitectureTechniques

SPECint92 LandscapeSPECint92 Landscape

P6 vs. Pentium 4 PipelinesP6 vs. Pentium 4 Pipelines

11 22 33 44 55 66 77 88 99 1010FetchFetch FetchFetch DecodeDecode DecodeDecode DecodeDecode RenameRename ROB RdROB Rd Rdy/SchRdy/Sch DispatchDispatch ExecExec

Basic P6 PipelineBasic P6 Pipeline

Basic PentiumBasic Pentium®® 4 Processor Pipeline4 Processor Pipeline11 22 33 44 55 66 77 88 99 1010 1111 1212

TC TC NxtNxt IPIP TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch1313 1414

DispDisp DispDisp1515 1616 1717 1818 1919 2020

RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF

Intro at Intro at 1.51.51.51.51.51.51.51.5GHzGHz

.18.18µµ

Intro at Intro at 733MHz733MHz

.18.18µµ

Freq

uenc

yFr

eque

ncy

TimeTimeIntroductionIntroduction

233MHz233MHz

60MHz60MHz P5 MicroP5 Micro--ArchitectureArchitecture55

Hyper Pipelined TechnologyHyper Pipelined Technology

1GHz1GHz

166MHz166MHz

P6 MicroP6 Micro--ArchitectureArchitecture

1010

@ intro@ intro

1.5 GHz1.5 GHz

2020Netburst MicroNetburst Micro--ArchitectureArchitecture

Deeper and Wider PipelinesDeeper and Wider Pipelines

FetchDec.Disp.Exec.Mem.Retire

Execute

Memory

Fetch

Decode

Dispatch

Retire

BranchPenalty

LoadPenalty

LoadPenalty

BranchPenalty

ALUPenalty

ALUPenalty

Pipelining Penalty LoopsPipelining Penalty Loops!! Branch PenaltyBranch Penalty

–– Branch predictorBranch predictor–– CPI overhead:CPI overhead:

–– Branch% x Branch% x MispredictionMisprediction% x % x PipeDepthPipeDepth–– Performance lost:Performance lost:

–– CPI overhead x CPI overhead x PipeWidthPipeWidth

!! Load PenaltyLoad Penalty–– Cache hierarchyCache hierarchy–– CPI overhead:CPI overhead:

–– Load% x Load% x AvgLoadLatencyAvgLoadLatency–– Average Load Latency:Average Load Latency:

–– ΣΣ Cache(i)HitCache(i)Hit% x % x Cache(i)LatencyCache(i)Latency

!! ALU PenaltyALU Penalty–– Forwarding paths and superForwarding paths and super--pipeliningpipelining

Branch PredictionBranch Prediction

Decode Buffer

Fetch

Dispatch Buffer

Decode

Reservation

Dispatch

StationsIssue

Execute

Finish Completion Buffer

Branch

nPC to Icache

nPC(seq.) = PC+4PCBranch

Predictor

specu. target

BTBupdate

prediction

(target addr.and history)

specu. cond.

FA-mux

Branch Prediction TechnologyBranch Prediction Technology!! Basic 2Basic 2--bit Local History Predictorbit Local History Predictor

–– ~~80% prediction accuracy80% prediction accuracy–– ~25 instructions/~25 instructions/mispredictmispredict–– ~5 cycles/25 instructions (0.2 CPI)~5 cycles/25 instructions (0.2 CPI)

!! TwoTwo--Level Correlated Predictor (P6)Level Correlated Predictor (P6)–– ~90% prediction accuracy~90% prediction accuracy–– ~50 instructions/~50 instructions/mispredictmispredict–– ~10 cycles/50 instructions (0.2 PI)~10 cycles/50 instructions (0.2 PI)

!! Current State of the Art (Pentium 4)Current State of the Art (Pentium 4)–– ~95% prediction accuracy~95% prediction accuracy–– ~100 instructions/~100 instructions/mispredictmispredict–– ~20 cycles/100 instructions (0.2 CPI)~20 cycles/100 instructions (0.2 CPI)

!! Current Research ChallengeCurrent Research Challenge (2008)(2008)–– ~98% prediction accuracy~98% prediction accuracy–– ~250 instructions/~250 instructions/mispredictmispredict–– ~25 cycles/250 instructions (0.1 CPI)~25 cycles/250 instructions (0.1 CPI)

Data Cache and Data Cache and PrefetchingPrefetching

Completion Buffer

Decode Buffer

Dispatch BufferDecode

ReservationDispatch

Complete

Stations

Data Cache

Main Memory

I-cacheBranch Predictor

Memory

branch integer integer floating store loadpoint

Prefetch

ReferencePrediction

Queue

Stor

e Bu

ffer

Cache Hierarchy TechnologyCache Hierarchy Technology!! Current Commercial Workload Current Commercial Workload (6 cycles/load)(6 cycles/load)

–– L1 Hits: 80% x 2 cycles = 1.6L1 Hits: 80% x 2 cycles = 1.6–– L2 Hits: 15% x 10 cycles = 1.5L2 Hits: 15% x 10 cycles = 1.5–– L3 Hits: 4% x 30 cycles = 1.2L3 Hits: 4% x 30 cycles = 1.2–– Memory: 1% x 150 cycles = 1.5Memory: 1% x 150 cycles = 1.5

!! Future Commercial Workload Future Commercial Workload (17 cycles/load)(17 cycles/load)–– L1 Hits: 80% x 4 cycles = 3.2L1 Hits: 80% x 4 cycles = 3.2–– L2 Hits: 15% x 20 cycles = 3.0L2 Hits: 15% x 20 cycles = 3.0–– L3 Hits: 4% x 60 cycles = 2.4L3 Hits: 4% x 60 cycles = 2.4–– Memory: 1% x 800 cycles = 8.0Memory: 1% x 800 cycles = 8.0

!! Current Research ChallengeCurrent Research Challenge (5 cycles/load)(5 cycles/load)–– Efficient and judicious cachesEfficient and judicious caches–– Load partitioning and specialized cachingLoad partitioning and specialized caching–– Aggressive memory Aggressive memory prefetchingprefetching

Memory Latency BottleneckMemory Latency Bottleneck

CacheCacheLatencyLatency(Clocks)(Clocks)

ExternalExternalMemoryMemory

11

1010

100100

10001000

L1L1 L2L2 L3L3

InstructionInstructionCostCost

External Memory LatencyExternal Memory Latency

00

400400

800800

PentiumPentium®®

procprocPentium Pentium Pro ProcPro Proc

Pentium Pentium III procIII proc

FutureFutureProcessorsProcessors

Cache Cache PrefetchingPrefetching::• Hardware:Hardware: Limited by predictable patterns• Software:Software: Limited by single control flow• Research Challenge:Research Challenge: Pointer-intensive code

Frequency vs. ParallelismFrequency vs. Parallelism

!! Increase Frequency (GHz)Increase Frequency (GHz)–– Deeper PipelinesDeeper Pipelines–– Increases Branch/Load penaltiesIncreases Branch/Load penalties–– Lowers IPC Lowers IPC

!! Increase Instruction Parallelism (IPC)Increase Instruction Parallelism (IPC)–– Wider PipelinesWider Pipelines–– Increases ComplexityIncreases Complexity–– Lowers GHzLowers GHz

FrontFront--End PipeEnd Pipe--Depth Penalty Depth Penalty

Execute

Memory

Fetch

Decode

Dispatch

Retire

Execute

Memory

Fetch

Decode

Dispatch

Retire

Optimize

Front-EndContraction

Back-EndOptimization

Alleviate PipeAlleviate Pipe--Depth Penalty Depth Penalty !! FrontFront--End ContractionEnd Contraction

–– Code ReCode Re--mapping and Cachingmapping and Caching–– Trace Construction, Caching, OptimizationTrace Construction, Caching, Optimization–– Leverage BackLeverage Back--End OptimizationsEnd Optimizations

!! BackBack--End OptimizationEnd Optimization–– MultipleMultiple--Branch, Trace, Stream, PredictionBranch, Trace, Stream, Prediction–– Code Reordering, Alignment, OptimizationCode Reordering, Alignment, Optimization–– PrePre--decode, Predecode, Pre--rename, Prerename, Pre--schedulingscheduling–– Memory PreMemory Pre--fetch Prediction and Controlfetch Prediction and Control

Execution Core ImprovementExecution Core Improvement

Execute

Memory

Fetch

Decode

Dispatch

Retire

Optimize

• Super-pipelinedALU design

• Very high-speed arithmetic units

• Speculative OoO execution

• Criticality-baseddata caching

• Aggressive datapre-fetching

How Deep Can You Go?How Deep Can You Go?

0

5

10

15

20

251 8 15 22 29 36 43 50 57 64 71 78 85 92 99

P ip e lin e D e p th

F re q ue nc yC P IP e rfo rm anc eP o we r

[Ed Grochowski, 7/6/01]

57?

Source: Intel CorporationSource: Intel Corporation

How Much ILP Is There?How Much ILP Is There?Weiss and Smith [1984]Weiss and Smith [1984] 1.581.58Sohi and Vajapeyam [1987]Sohi and Vajapeyam [1987] 1.811.81Tjaden and Flynn [1970]Tjaden and Flynn [1970] 1.861.86Tjaden and Flynn [1973]Tjaden and Flynn [1973] 1.961.96Uht [1986]Uht [1986] 2.002.00Smith et al. [1989]Smith et al. [1989] 2.002.00Jouppi and Wall [1988]Jouppi and Wall [1988] 2.402.40Johnson [1991]Johnson [1991] 2.502.50Acosta et al. [1986]Acosta et al. [1986] 2.792.79Wedig [1982]Wedig [1982] 3.003.00Butler et al. [1991]Butler et al. [1991] 5.85.8Melvin and Patt [1991]Melvin and Patt [1991] 66Wall [1991]Wall [1991] 77Kuck et al. [1972]Kuck et al. [1972] 88Riseman and Foster [1972]Riseman and Foster [1972] 5151Nicolau and Fisher [1984] Nicolau and Fisher [1984] 9090

SPECint95 LandscapeSPECint95 LandscapeLandscape of Microprocessor Families

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

80 180 280 380 480 580 680 780 880 980

Frequency (MHz)

SPEC

int9

5/M

Hz

Alpha AMD-x86

Intel-x86

** Data source www.spec.org

5

10

15

20

2530 35 40 45 50 55 60 SPECint 95

064

164

264

AthlonAthlon

PPro

P

PIIPIII

Bryan Black

SPECint2000 LandscapeSPECint2000 LandscapeLandscape of Microprocessor Families

0

0.5

1

0 500 1000 1500 2000 2500

Frequency (MHz)

SPEC

int2

000/

MHz

Intel-x86

AMD-x86

Alpha

PowerPC

Sparc

IPF

800 SPECint 2000700600500400300200

10050

PIII-Xeon

P4

Athlon

264C

Sparc-III

264A

604e Itanium

** Data source www.spec.org

264B25

Bryan Black

Parallelism in TransitionParallelism in Transition

1

10

100

1000

10000

100000

1000000

1980 1985 1990 1995 2000 2005 2010

MIP

S Pentium® Pro ArchitectureSpeculative Out of Order

Pentium® 4 ArchitectureTrace Cache

Future Xeon™ ArchitectureMulti-Threaded

Multi-Threaded, Multi-Core

Pentium® ArchitectureSuper Scalar

Era of Era of Instruction Instruction ParallelismParallelism

Era of Era of Thread Thread

ParallelismParallelism

SummarySummaryPerformance Demand ContinuesPerformance Demand Continues!! 55--10 billion transistors by 201010 billion transistors by 2010!! 1010--20 GHz by 201020 GHz by 2010

Challenge Is Power and EfficiencyChallenge Is Power and Efficiency!! Power dissipation, delivery, densityPower dissipation, delivery, density!! New clever/efficient implementationsNew clever/efficient implementations

New Frontiers to ExploreNew Frontiers to Explore!! Synergism of ILP, TLP, and MLPSynergism of ILP, TLP, and MLP!! ““SemiSemi--CustomCustom”” MicroarchitecturesMicroarchitectures

High-Performance Microarchitecture Techniquesrichter/cs550/MRF... · 60MHz P5 Micro-Architecture 5 Hyper Pipelined Technology 1GHz 166MHz P6 Micro-Architecture 10 @ intro 1.5 GHz

Documents