Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman he University of Texas at Austin Michael Shebanow NVIDIA Chang Joo Lee Intel Rustam Miftakhutdinov he University of Texas at Austin Onur Mutlu Carnegie Mellon University Yale N. Patt The University of Texas at Austi MICRO-44 December 6 th , 2011 Porto Alegre, Brazil
18
Embed
Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Improving GPU Performance viaLarge Warps and Two-Level Warp
Scheduling
Veynu NarasimanThe University of Texas at Austin
Michael ShebanowNVIDIA
Chang Joo LeeIntel
Rustam MiftakhutdinovThe University of Texas at Austin
Onur MutluCarnegie Mellon University
Yale N. PattThe University of Texas at Austin
MICRO-44December 6th, 2011Porto Alegre, Brazil
Rise of GPU Computing
GPUs have become a popular platform for general purpose applications
New Programming ModelsCUDAATI Stream TechnologyOpenCL
Order of magnitude speedup over single-threaded CPU
How GPUs Exploit ParallelismMultiple GPU cores (i.e., Streaming Multiprocessors)
Focus on a single GPU core
Exploit parallelism in 2 major ways:
Threads grouped into warps Single PC per warp Warps executed in SIMD fashion
Multiple warps concurrently executedRound-robin schedulingHelps hide long latencies
The ProblemDespite these techniques, computational resources
Divergence stack still usedHandled at the large warp level
How large should we make the warps?More threads per warp more potential for sub-warp creationToo large a warp size can degrade performance
Re-fetch policy for conditional branchesMust wait till last sub-warp finishes
Optimization for unconditional branch instructionsDon’t create multiple sub-warpsSub-warping always completes in a single cycle
Two Level Round Robin Scheduling
Split warps into equal sized fetch groups
Create initial priority among the fetch groups
Round-robin scheduling among warps in same fetch group
When all warps in the highest priority fetch group are stalledRotate fetch group prioritiesHighest priority fetch group becomes least
Warps arrive at a stalling point at slightly different points in timeBetter overlap of computation and memory latency
Round Robin vs Two Level Round Robin
time
Core
MemorySystem
All Warps Compute
Req Warp 0
All Warps Compute
Req Warp 1
Req Warp 15
Round Robin Scheduling, 16 total warps
time
Core
MemorySystem
Compute
Req Warp 0Req Warp 1
Req Warp 7
Two Level Round Robin Scheduling, 2 fetch groups, 8 warps each
Group 0
Compute
Group 1
Req Warp 8Req Warp 9
Req Warp 15
Compute
Group 0
Compute
Group 1
Saved Cycles
More on Two Level SchedulingWhat should the fetch group size be?
Enough warps to keep pipeline busy in the absence of long latency stallsToo small
Uneven progression of warps in the same fetch group Destroys data locality among warps
Too large Reduces benefits of two-level scheduling More warps stall at the same time
Not just for hiding memory latencyComplex instructions (e.g., sine, cosine, sqrt, etc.)Two-level scheduling allows warps to arrive at such instructions at slightly
different points in time
Combining LWM and Two Level Scheduling
4 large warps, 256 threads eachFetch group size = 1 large warp
Problematic for applications with few long latency stallsNo stalls no fetch group priority changesSingle large warp starvedBranch re-fetch policy for large warps bubbles in pipeline
Timeout invoked fetch group priority change32K instruction timeout periodAlleviates starvation
Methodology
Scalar Front End
1-wide fetch, decode
4KB single ported I-Cache
Round-robin scheduling
SIMD Back End In order, 5 stages, 32 parallel SIMD lanes
Register File and On Chip
Memories
64KB Register File
128KB, 4-way, D-Cache with 128B line size
128KB, 32-banked private memory
Memory System
Open row, first-come first-serve scheduling
8 banks, 4KB row buffer per bank
100-cycle row hit latency, 300-cycle row conflict latency
32 GB/s memory bandwidth
Simulate single GPU core with 1024 thread contextsdivided into 32 warps each with 32 threads
Overall IPC Results
0
5
10
15
20
25
30
35
blac
kjac
k
sor
t
vite
rbi
km
eans
dec
rypt
bla
cksc
hole
s
nee
dlem
an
hot
spot
mat
rix_m
ult
redu
ctio
n
his
togr
am
IPC
Baseline TBC LWM 2Lev LWM+2Lev
0.0
0.1
0.2
0.3
0.4
0.5
0.6
bfs
0
5
10
15
20
25
30
35
gmea
n
LWM+2Lev improves performance by 19.1% over baseline and by 11.5% over TBC
IPC and Computational Resource Utilization
02468
10121416
baseline LWM 2LEV LWM+2LEV
IPC
IPC for blackjack
0%
20%
40%
60%
80%
100%
120%
baseline LWM 2LEV LWM+2LEV
Computational Resource Utilization for blackjack
32
24 to 31
16 to 23
8 to 15
1 to 7
0
0
5
10
15
20
25
baseline LWM 2LEV LWM+2LEV
IPC
IPC for histogram
0%
20%
40%
60%
80%
100%
120%
baseline LWM 2LEV LWM+2LEV
Computational Resource Utilization for histogram
32
24 to 31
16 to 23
8 to 15
1 to 7
0
ConclusionFor maximum performance, the computational resources on
GPUs must be effectively utilized
Branch divergence and long latency operations cause them to be underutilized or unused
We proposed two mechanism to alleviate thisLarge Warp Microarchitecture for branch divergenceTwo-level scheduling for long latency operations
Improves performance by 19.1% over traditional GPU coresIncreases scope of applications that can run efficiently on a GPU