Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Milad Hashemi, Onur Mutlu, Yale N. Patt UT Austin/Google, ETH Zürich, UT Austin October 19 th , 2016
Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads
Milad Hashemi, Onur Mutlu, Yale N. PattUT Austin/Google, ETH Zürich, UT Austin
October 19th, 2016
Continuous Runahead Outline• OverviewofRunahead• RunaheadLimitations• ContinuousRunaheadDependenceChains• ContinuousRunaheadEngine• ContinuousRunaheadEvaluation• Conclusions
2
Continuous Runahead Outline• OverviewofRunahead• RunaheadLimitations• ContinuousRunaheadDependenceChains• ContinuousRunaheadEngine• ContinuousRunaheadEvaluation• Conclusions
3
Runahead Execution Overview•Runahead dynamically expands the instruction window
when the pipeline is stalled [Mutlu et al., 2003]• The core checkpoints architectural state• The result of the memory operation that caused the stall is
marked as poisoned in the physical register file• The core continues to fetch and execute instructions• Operations are discarded instead of retired• The goal is to generate new independent cache misses
4
Traditional Runahead Accuracy
0%10%20%30%40%50%60%70%80%90%
100%
Requ
estA
ccuracy
Runahead
GHB
Stream
Markov+Stream
5
Traditional Runahead Accuracy
0%10%20%30%40%50%60%70%80%90%
100%
Requ
estA
ccuracy
Runahead
GHB
Stream
Markov+Stream
6
Runaheadis95%Accurate
Traditional Runahead Prefetch Coverage
0%10%20%30%40%50%60%70%80%90%
100%
%Inde
pend
entC
ache
Misses
7
Traditional Runahead Prefetch Coverage
0%10%20%30%40%50%60%70%80%90%
100%
%Inde
pend
entC
ache
Misses
8
Runaheadhasonly13%Prefetch Coverage
Traditional Runahead Performance Gain
0%
50%
100%
150%
200%
250%
300%
350%
%IPCIm
provem
ento
ver
No-PrefetchingBa
seline
RunaheadPerformanceGain OraclePerformanceGain
9
Traditional Runahead Performance Gain
0%
50%
100%
150%
200%
250%
300%
350%
%IPCIm
provem
ento
ver
No-PrefetchingBa
seline
RunaheadPerformanceGain OraclePerformanceGain
10
Runaheadhasa12%PerformanceGainRunaheadOraclehasan85%PerformanceGain
Traditional Runahead Interval Length
0
20
40
60
80
100
120
140
CyclesPerRun
aheadInterval
128ROB
256ROB
512ROB
1024ROB
11
Traditional Runahead Interval Length
0
20
40
60
80
100
120
140
CyclesPerRun
aheadInterval
128ROB
256ROB
512ROB
1024ROB
12
RunaheadIntervalsareShortLowPerformanceGain
•WhichinstructionstouseduringContinuousRunahead?• Dynamicallytargetthedependencechainsthatleadtocriticalcachemisses
•WhathardwaretouseforContinuousRunahead?• Howlongshouldchainspre-executefor?
Continuous Runahead Challenges
13
Dependence Chains
LD[R6]->R8
ADDR9,R1->R6
ADDR4,R5->R9
LD[R3]->R5
CacheMiss
14
Experimentwith3policiestodeterminethebestpolicytouseforContinuousRunahead:• PC-BasedPolicy• UsethedependencechainthathascausedthemostmissesforthePCthatisblockingretirement
• MaximumMissesPolicy• Useadependencechainfrom thePCthathasgeneratedthemostmisses fortheapplication
• StallPolicy• UseadependencechainfromthePCthathascausedthemostfull-windowstalls fortheapplication
Dependence Chain Selection Policies
15
-20
0
20
40
60
80
100
%IPCIm
provem
ent
RunaheadBuffer
PC-Policy
Maximum-MissesPolicy
StallPolicy
Dependence Chain Selection Policies
16
0100200300400500600700800900
1000
Num
bero
fPCs
90%ofStalls
AllStalls
AllMisses
Why does Stall Policy Work?
17
0100200300400500600700800900
1000
Num
bero
fPCs
90%ofStalls
AllStalls
AllMisses
Why does Stall Policy Work?
18
19PCscover90%ofallStalls
0
0.2
0.4
0.6
0.8
1
1.2
Normalize
dPerfo
rmance
1Chain
2Chains
4Chains
8Chains
16Chains
32Chains
Constrained Dependence Chain Storage
19
0
0.2
0.4
0.6
0.8
1
1.2
Normalize
dPerfo
rmance
1Chain
2Chains
4Chains
8Chains
16Chains
32Chains
Constrained Dependence Chain Storage
20
Storing1Chainprovides95%ofthePerformance
Maintaintwostructures:• 32-entrycacheofPCstotracktheoperationsthatcausethepipelinetofrequentlystall• ThelastdependencechainforthePCthathascausedthemostfull-windowstalls
Ateveryfullwindowstall:• IncrementthecounterofthePCthatcausedthestall• GenerateadependencechainforthePCthathascausedthemoststalls
Continuous Runahead Chain Generation
21
Runahead for Longer Intervals
ContinuousRunahead
Engine(CRE)
Core 0 Core 1
Core 2 Core 3
LLC
LLC
LLC
LLC
DRAMChannel 0
DRAMChannel 1
22
• NoFront-End• NoRegisterRenamingHardware• 32PhysicalRegisters• 2-Wide• NoFloatingPointorVectorPipeline• 4kBDataCache
CRE Microarchitecture
23
SHIFTP1->P9
ADDP7+1->P1
ADDP9+P1->P3
SHIFTP3->P2
LD[P2]->P8
Cycle:012345
RegisterRemappingTable:
ADDE5+1->E3
SHIFTE3->E4
ADDE4+E3->E2
SHIFTE2->E1
LD[E1]->E0
CorePhysicalRegister
CREPhysicalRegister
SearchList: P2
FirstCREPhysicalRegister
EAX EBX ECX
P8
E0
E0
P2
E1
E1
P3
E2
P3
P1
E3
E3
P9
E4
P9,P1P1P7
P7
E5MAPE3->E5
Dependence Chain Generation
24
ADD E5 + 1 -> E3
SHIFT E3 -> E4 ADD E4 + E3 -> E2
SHIFT E2 -> E1
MEM_LD [E1] -> E0
Dependence Chain Generation
25
0%10%20%30%40%50%60%70%80%90%
100%
1k 5k 10k 25k 50k 100k 250k 500k 1M 2MUpdateInterval(InstructionsRetired)
ContinuousRunaheadRequestAccuracy
GMeanPerformanceGain
Interval Length
26
• Single-Core/Quad-Core• 4-wideIssue• 256EntryReorderBuffer• 92EntryReservationStation
• Caches• 32KB8-WaySetAssociativeL1I/D-Cache• 1MB8-WaySetAssociativeSharedLastLevelCacheperCore
• Non-UniformMemoryAccessLatencyDDR3System• 256-EntryMemoryQueue• BatchScheduling
• Prefetchers• Stream,GlobalHistoryBuffer• FeedbackDirectedPrefetching:DynamicDegree1-32
• CRECompute• 2-wideissue• 1ContinuousRunaheadissuecontextwitha32-entrybufferand32-entryphysicalregisterfile• 4kBDataCache
System Configuration
27
0
20
40
60
80
100
120
%IPCIm
provem
ento
ver
No-PrefetchingBa
seline
RunaheadBuffer
ContinuousRunahead
Single-Core Performance
28
0
20
40
60
80
100
120
%IPCIm
provem
ento
ver
No-PrefetchingBa
seline
RunaheadBuffer
ContinuousRunahead
Single-Core Performance
29
21%SingleCorePerformanceIncreaseoverpriorStateoftheArt
0
20
40
60
80
100
120
140
%IPCIm
provem
ento
ver
No-PrefetchingBa
seline
RunaheadBuffer
ContinuousRunahead
StreamPF
GHBPF
ContinuousRunahead+Stream
ContinuousRunahead+GHB
Single-Core Performance + Prefetching
30
0
20
40
60
80
100
120
140
%IPCIm
provem
ento
ver
No-PrefetchingBa
seline
RunaheadBuffer
ContinuousRunahead
StreamPF
GHBPF
ContinuousRunahead+Stream
ContinuousRunahead+GHB
Single-Core Performance + Prefetching
31
IncreasesPerformanceoverandIn-ConjunctionwithPrefetching
0%10%20%30%40%50%60%70%80%90%
100%
%Inde
pend
entC
ache
Misses
Prefetched
Independent Miss Coverage
32
0%10%20%30%40%50%60%70%80%90%
100%
%Inde
pend
entC
ache
Misses
Prefetched
Independent Miss Coverage
33
70%Prefetch Coverage
0
0.5
1
1.5
2
2.5
Normalize
dBa
ndwidth
ContinuousRunahead
StreamPF
GHBPF
Bandwidth Overhead
34
0
0.5
1
1.5
2
2.5
Normalize
dBa
ndwidth
ContinuousRunahead
StreamPF
GHBPF
Bandwidth Overhead
35
LowBandwidthOverhead
0
10
20
30
40
50
60
H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 GMean
%W
eightedSpeedu
pIm
provem
ent
ContinuousRunahead
StreamPF
GHBPF
Multi-Core Performance
36
0
10
20
30
40
50
60
H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 GMean
%W
eightedSpeedu
pIm
provem
ent
ContinuousRunahead
StreamPF
GHBPF
Multi-Core Performance
37
43%WeightedSpeedupIncrease
0
10
20
30
40
50
60
70
80
90
H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 GMean
%W
eightedSpeedu
pIm
provem
ent
ContinuousRunahead
StreamPF
GHBPF
ContinuousRunahead+Stream
ContinuousRunahead+GHB
Multi-Core Performance + Prefetching
38
0
10
20
30
40
50
60
70
80
90
H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 GMean
%W
eightedSpeedu
pIm
provem
ent
ContinuousRunahead
StreamPF
GHBPF
ContinuousRunahead+Stream
ContinuousRunahead+GHB
Multi-Core Performance + Prefetching
39
13%WeightedSpeedupGainoverGHBPrefetching
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 Mean
EnergyNormalize
dto
No-PrefetchingBa
seline
ContinuousRunahead
StreamPF
GHBPF
ContinuousRunahead+Stream
ContinuousRunahead+GHB
Multi-Core Energy Evaluation
40
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 Mean
EnergyNormalize
dto
No-PrefetchingBa
seline
ContinuousRunahead
StreamPF
GHBPF
ContinuousRunahead+Stream
ContinuousRunahead+GHB
Multi-Core Energy Evaluation
41
22%EnergyReduction
• Runaheadprefetch coverageislimitedbythedurationofeachrunaheadinterval• Toremovethisconstraint,weintroducethenotionofContinuousRunahead•WecandynamicallyidentifythemostcriticalLLCmissestotargetwithContinuousRunaheadbytrackingtheoperationsthatcausethepipelinetofrequentlystall•WemigratethesedependencechainstotheCREwheretheyareexecutedcontinuouslyinaloop
Conclusions
42
• ContinuousRunaheadgreatlyincreasesprefetch coverage• Increasessingle-coreperformanceby34.4%• Increasesmulti-coreperformanceby43.3%• Synergisticwithvarioustypesofprefetching
Conclusions
43
Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads
Milad Hashemi, Onur Mutlu, Yale N. PattUT Austin/Google, ETH Zürich, UT Austin
October 19th, 2016
44