Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim
Jan 08, 2016
Effect of Instruction Fetch and Memory Scheduling on
GPU Performance
Nagesh B Lakshminarayana, Hyesoon Kim
Outline
• Background and Motivation• Policies• Experimental Setup• Results• Conclusion
2
GPU Architecture (based on Tesla Architecture)
SM – Streaming Multiprocessor SP – Scalar Processor SIMT – Single Instruction Multiple Thread
3
SM Architecture (based on Tesla Architecture)
• Fetch Mechanism– Fetch 1 instruction for selected warp– Stall Fetch for warp when it executes a
Load/Store or when it encounters a Branch
• Scheduler Policy– Oldest first and Inorder (within warp)
• Caches– I Cache, Shared Memory, Constant Cache
and Texture Cache
4
Handling Multiple Memory Requests
• MSHR/Memory Request Queue – Allows merging of memory requests
(Intra-core)
• DRAM Controller– Allows merging of memory requests
(Inter-core)
5
Intra-core Merging
6
Code Example - Intra-Core Merging• From MonteCarlo in CUDA SDK
for(iSum = threadIdx.x; iSum < SUM_N; iSum += blockDim.x)
{…for(int i = iSum; i < pathN; i += SUM_N){
real r = d_Samples[i];real callValue = endCallValue(S, X, r, MuByT,
VBySqrtT);sumCall.Expected += callValue;sumCall.Confidence += callValue * callValue;
} …
}7
iSum 0, 2 = 2
iSum 1, 2 = 2
iSum 2, 2 = 2
i 0, 2 = 2i 1, 2 = 2i 2, 2 = 2
r 0, 2 = r 1, 2 = r 2, 2 = d_Samples[2]
A X, Y X – Block Id, Y – Thread Id
multiple blocks are assigned to the same SM
threads with corresponding Ids in different blocks access the same memory locations
Inter-core Merging
8
Why look at Fetch?
• Allows implicit control over resources allocated to a warp
• Can control progress of a warp• Can boost performance by fetching
more for critical warps
• Implicit resource control within a core
9
• Memory System is a performance bottleneck for several applications
• DRAM scheduling decides the order in which memory requests are granted
• Can prioritize warps based on criticality
• Implicit performance control across cores
Why look at DRAM Scheduling?
10
By controlling Fetch and DRAM Scheduling we can control
performance
11
How is This Useful?
• Understand applications and their behavior better
• Detect patterns or behavioral groups across applications
• Design new policies for GPGPU applications to improve performance
12
Outline
• Background and Motivation• Policies • Experimental Setup• Results• Conclusion
13
Fetch Policies
• Round Robin (RR) [default in Tesla architecture]
• FAIR– Ensures uniform progress of all warps
• ICOUNT [Tullsen’96]– Same as ICOUNT in SMT– Tries to increase throughput by giving
priority to fast moving threads
• Least Recently Fetched (LRF) – Prevents starvation of warps
14
New Oracle Based Fetch Policies
• ALL– Gives priority to longer warps (total
length until termination)– Ensures all warps finish at the same
time, this results in higher occupancy
15
Priorities:warp 0 > warp 1 > warp 2 > warp 3
New Oracle Based Fetch Policies
• BAR– Gives priority to warps with greater
number of instructions to next barrier– Idea is to reduce wait time at barriers
16
Priorities:warp 0 > warp 1 > warp 2 > warp 3
Priorities:warp 2 > warp 1 > warp 0 > warp 3
New Oracle Based Fetch Policies
• MEM_BAR– Similar to BAR but gives higher priority to
warps with more memory instructions
17
Priorities: warp 0 > warp 2 > warp 1 = warp 3
Priorities: warp 1 > warp 0 = warp 2 > warp 3
Priority(Wa) > Priority(Wb)If MemInst(Wa) > MemInst(Wb) or
If MemInst(Wa) = MemInst(Wb) AND Inst(Wa) > Inst(Wb)
DRAM Scheduling Policies
• FCFS• FRFCFS [Rixner’00]• FR_FAIR (new policy)
– Row hit with fairness– Ensures uniform progress of warps
• REM_INST (new Oracle based policy)– Row hit with priority for warps with
greater number of instructions remaining for termination
– Prioritizes longer warps
18
Outline
• Background and Motivation• Policies • Experimental Setup• Results• Conclusion
19
Experimental Setup
• Simulated GPU Architecture– 8 SMs– Frontend : 1 wide, 1KB I Cache, branch stall– Execution : 8 wide SIMD execution unit, IO
scheduling, 4 cycle latency for most instructions
– Caches : 64KB software managed cache, 8 load accesses/cycle
– Memory : 32B wide bus, 8 DRAM banks– RR fetch, FRFCFS DRAM scheduling (baseline)
• Trace driven, cycle accurate simulator• Per warp traces generated using GPU
Ocelot[Kerr’09]20
Benchmarks
• Taken from– CUDA SDK 2.2 – MonteCarlo, Nbody,
ScalarProd– PARBOIL[UIUC’09] – MRI-Q, MRI-FHD, CP, PNS– RODINIA[Che’09] – Leukocyte, Cell, Needle
• Classification based on lengths of warps– Symmetric, if <= 2% divergence– Asymmetric, otherwise (results included in
paper)
21
Outline
• Background and Motivation• Policies • Experimental Setup• Results• Conclusion
22
FRFCFS
00.20.40.60.8
11.21.4
No
rmal
ized
Exe
cuti
on
Du
rati
on
ICOUNT
LRF
FAIR
ALL
BAR
MEM_BAR
Results - Symmetric Applications
• Compute intensive – no variation with different fetch policies
• Memory bound – improvement with fairness oriented fetch policies i.e., FAIR, ALL, BAR, MEM_BAR
Baseline : RR + FRFCFS
23
FRFAIR
00.20.40.60.8
11.21.4
No
rmal
ized
Exe
cuti
on
Du
rati
on
RRICOUNTLRFFAIRALLBARMEM_BAR
Results – Symmetric Applications
• On average, better than FRFCFS• MersenneTwister shows huge improvement• REM_INST DRAM policy performs similar to FRFAIR
24
Baseline : RR + FRFCFS
MonteCarlo
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Normalized Execution Time
Ratio of Merged MemoryRequests
Analysis: MonteCarlo
• Fairness oriented fetch policies improve performance by increasing intra-core merging
25
FRFCFS DRAM Scheduling
MersenneTwister
0
0.2
0.4
0.60.8
1
1.2
RR
ICO
UN
TLR
FFA
IRA
LLB
AR
ME
M_B
AR RR
ICO
UN
TLR
FFA
IRA
LLB
AR
ME
M_B
AR RR
ICO
UN
TLR
FFA
IRA
LLB
AR
ME
M_B
AR
FRFCFS FRFAIR REM_INST
Normalized Execution Time
DRAM Row Buffer Hit Ratio
Analysis: MersenneTwister
• FAIR DRAM Scheduling (FRFAIR, REM_INST) improves performance by increasing DRAM Row Buffer Hit ratio
26
Baseline : RR + FRFCFS
BlackScholes
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Normalized Execution Time
MLP/100
Analysis: BlackScholes
FRFCFS DRAM Scheduling
• Fairness oriented fetch policies increase MLP• Increased (MLP + Row Buffer Hit ratio) improves
performance
27
Outline
• Background and Motivation• Policies • Experimental Setup• Results• Conclusion
28
Conclusion
• Compute intensive applications– Fetch and DRAM Scheduling do not matter
• Symmetric memory intensive applications– Fairness oriented Fetch (FAIR, ALL, BAR,
MEM_BAR) and DRAM policies (FR_FAIR, REM_INST) provide performance improvement• MonteCarlo(40%),MersenneTwister(50%),
BlackScholes(18%)
• Asymmetric memory intensive applications– No correlation between performance and
Fetch and DRAM Scheduling policies
29
THANK YOU!
30