High High - - Performance Performance Microarchitecture Microarchitecture Techniques Techniques John Paul Shen John Paul Shen Director of Director of Microarchitecture Microarchitecture Research Research Intel Labs Intel Labs October 29, 2002 October 29, 2002 Microprocessor Research Forum Microprocessor Research Forum
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
– High Frequency Superscalar Processors– Helper Threads for SMT and CMP Machines– Future Enterprise Server Processors
! Israel: Haifa (Ronny Ronen)– Low Power Microarchitecture Techniques– Future Mobile High-performance Processors
! Spain: Barcelona (Antonio Gonzalez)– Speculative Multithreading for SMT and CMP– Clustered Microarchitecture Techniques
Microprocessor Performance Microprocessor Performance Growth in PerspectiveGrowth in Perspective!! Doubling every 18 months (1982Doubling every 18 months (1982--2000): 2000):
–– Total of 3,200XTotal of 3,200X–– Cars travel at 176,000 MPH; get 64,000 miles/gal.Cars travel at 176,000 MPH; get 64,000 miles/gal.–– Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200)Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200)–– Wheat yield: 320,000 bushels per acreWheat yield: 320,000 bushels per acre
!! Doubling every 24 months (1971Doubling every 24 months (1971--2001): 2001): –– Total of 36,000XTotal of 36,000X–– Cars travel at 2,400,000 MPH; get 600,000 miles/gal.Cars travel at 2,400,000 MPH; get 600,000 miles/gal.–– Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000)Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000)–– Wheat yield: 3,600,000 bushels per acreWheat yield: 3,600,000 bushels per acre
Unmatched by any other industry!!Unmatched by any other industry!!
““Iron LawIron Law”” of of Microprocessor PerformanceMicroprocessor Performance
1/Processor Performance = ---------------Time
Program
= ------------------ X ---------------- X ------------Instructions Cycles
Program Instruction
Time
Cycle(inst. count) (CPI) (cycle time)
Processor Performance = -----------------IPC x GHzinst. count
!! Current State of the Art (Pentium 4)Current State of the Art (Pentium 4)–– ~95% prediction accuracy~95% prediction accuracy–– ~100 instructions/~100 instructions/mispredictmispredict–– ~20 cycles/100 instructions (0.2 CPI)~20 cycles/100 instructions (0.2 CPI)
!! Current Research ChallengeCurrent Research Challenge (2008)(2008)–– ~98% prediction accuracy~98% prediction accuracy–– ~250 instructions/~250 instructions/mispredictmispredict–– ~25 cycles/250 instructions (0.1 CPI)~25 cycles/250 instructions (0.1 CPI)
Data Cache and Data Cache and PrefetchingPrefetching
Completion Buffer
Decode Buffer
Dispatch BufferDecode
ReservationDispatch
Complete
Stations
Data Cache
Main Memory
I-cacheBranch Predictor
Memory
branch integer integer floating store loadpoint
Prefetch
ReferencePrediction
Queue
Stor
e Bu
ffer
Cache Hierarchy TechnologyCache Hierarchy Technology!! Current Commercial Workload Current Commercial Workload (6 cycles/load)(6 cycles/load)
–– L1 Hits: 80% x 2 cycles = 1.6L1 Hits: 80% x 2 cycles = 1.6–– L2 Hits: 15% x 10 cycles = 1.5L2 Hits: 15% x 10 cycles = 1.5–– L3 Hits: 4% x 30 cycles = 1.2L3 Hits: 4% x 30 cycles = 1.2–– Memory: 1% x 150 cycles = 1.5Memory: 1% x 150 cycles = 1.5
!! Future Commercial Workload Future Commercial Workload (17 cycles/load)(17 cycles/load)–– L1 Hits: 80% x 4 cycles = 3.2L1 Hits: 80% x 4 cycles = 3.2–– L2 Hits: 15% x 20 cycles = 3.0L2 Hits: 15% x 20 cycles = 3.0–– L3 Hits: 4% x 60 cycles = 2.4L3 Hits: 4% x 60 cycles = 2.4–– Memory: 1% x 800 cycles = 8.0Memory: 1% x 800 cycles = 8.0
!! Current Research ChallengeCurrent Research Challenge (5 cycles/load)(5 cycles/load)–– Efficient and judicious cachesEfficient and judicious caches–– Load partitioning and specialized cachingLoad partitioning and specialized caching–– Aggressive memory Aggressive memory prefetchingprefetching
Cache Cache PrefetchingPrefetching::• Hardware:Hardware: Limited by predictable patterns• Software:Software: Limited by single control flow• Research Challenge:Research Challenge: Pointer-intensive code
Frequency vs. ParallelismFrequency vs. Parallelism
!! Increase Frequency (GHz)Increase Frequency (GHz)–– Deeper PipelinesDeeper Pipelines–– Increases Branch/Load penaltiesIncreases Branch/Load penalties–– Lowers IPC Lowers IPC
How Much ILP Is There?How Much ILP Is There?Weiss and Smith [1984]Weiss and Smith [1984] 1.581.58Sohi and Vajapeyam [1987]Sohi and Vajapeyam [1987] 1.811.81Tjaden and Flynn [1970]Tjaden and Flynn [1970] 1.861.86Tjaden and Flynn [1973]Tjaden and Flynn [1973] 1.961.96Uht [1986]Uht [1986] 2.002.00Smith et al. [1989]Smith et al. [1989] 2.002.00Jouppi and Wall [1988]Jouppi and Wall [1988] 2.402.40Johnson [1991]Johnson [1991] 2.502.50Acosta et al. [1986]Acosta et al. [1986] 2.792.79Wedig [1982]Wedig [1982] 3.003.00Butler et al. [1991]Butler et al. [1991] 5.85.8Melvin and Patt [1991]Melvin and Patt [1991] 66Wall [1991]Wall [1991] 77Kuck et al. [1972]Kuck et al. [1972] 88Riseman and Foster [1972]Riseman and Foster [1972] 5151Nicolau and Fisher [1984] Nicolau and Fisher [1984] 9090
SPECint95 LandscapeSPECint95 LandscapeLandscape of Microprocessor Families
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
80 180 280 380 480 580 680 780 880 980
Frequency (MHz)
SPEC
int9
5/M
Hz
Alpha AMD-x86
Intel-x86
** Data source www.spec.org
5
10
15
20
2530 35 40 45 50 55 60 SPECint 95
064
164
264
AthlonAthlon
PPro
P
PIIPIII
Bryan Black
SPECint2000 LandscapeSPECint2000 LandscapeLandscape of Microprocessor Families
0
0.5
1
0 500 1000 1500 2000 2500
Frequency (MHz)
SPEC
int2
000/
MHz
Intel-x86
AMD-x86
Alpha
PowerPC
Sparc
IPF
800 SPECint 2000700600500400300200
10050
PIII-Xeon
P4
Athlon
264C
Sparc-III
264A
604e Itanium
** Data source www.spec.org
264B25
Bryan Black
Parallelism in TransitionParallelism in Transition
1
10
100
1000
10000
100000
1000000
1980 1985 1990 1995 2000 2005 2010
MIP
S Pentium® Pro ArchitectureSpeculative Out of Order
Pentium® 4 ArchitectureTrace Cache
Future Xeon™ ArchitectureMulti-Threaded
Multi-Threaded, Multi-Core
Pentium® ArchitectureSuper Scalar
Era of Era of Instruction Instruction ParallelismParallelism
Era of Era of Thread Thread
ParallelismParallelism
SummarySummaryPerformance Demand ContinuesPerformance Demand Continues!! 55--10 billion transistors by 201010 billion transistors by 2010!! 1010--20 GHz by 201020 GHz by 2010
Challenge Is Power and EfficiencyChallenge Is Power and Efficiency!! Power dissipation, delivery, densityPower dissipation, delivery, density!! New clever/efficient implementationsNew clever/efficient implementations
New Frontiers to ExploreNew Frontiers to Explore!! Synergism of ILP, TLP, and MLPSynergism of ILP, TLP, and MLP!! ““SemiSemi--CustomCustom”” MicroarchitecturesMicroarchitectures