Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

Improving GPU Performance viaLarge Warps and Two-Level Warp

Scheduling

Veynu NarasimanThe University of Texas at Austin

Michael ShebanowNVIDIA

Chang Joo LeeIntel

Rustam MiftakhutdinovThe University of Texas at Austin

Onur MutluCarnegie Mellon University

Yale N. PattThe University of Texas at Austin

MICRO-44December 6th, 2011Porto Alegre, Brazil

Rise of GPU Computing

GPUs have become a popular platform for general purpose applications

New Programming ModelsCUDAATI Stream TechnologyOpenCL

Order of magnitude speedup over single-threaded CPU

How GPUs Exploit ParallelismMultiple GPU cores (i.e., Streaming Multiprocessors)

Focus on a single GPU core

Exploit parallelism in 2 major ways:

Threads grouped into warps Single PC per warp Warps executed in SIMD fashion

Multiple warps concurrently executedRound-robin schedulingHelps hide long latencies

The ProblemDespite these techniques, computational resources

can still be underutilized

Two reasons for this:

Branch divergence

Long latency operations

Branch Divergence

1111A

1111D

1001B 0110C

D1111D

C0110D

ReconvergePC

ActiveMask

ExecutePC

Current PC:Current Active Mask:

A1111B1001

Taken Not Taken

Long Latency Operations

time

Core

MemorySystem

All Warps Compute

Req Warp 0

All Warps Compute

Req Warp 1

Req Warp 15

Round Robin Scheduling, 16 total warps

0%10%20%30%40%50%60%70%80%90%

100%

Computational Resource Utilization

32

24 to 31

16 to 23

8 to 15

1 to 7

0

32 warps, 32 threads per warp, SIMD width = 32, round-robin scheduling

Good

Bad

Large Warp Microarchitecture (LWM)

Alleviates branch divergence

Fewer, but larger warpsWarp size much greater than SIMD width

Total thread count and SIMD-width stay the same

Dynamically breaks down large warp into sub-warpsCan be executed on existing SIMD pipeline

Rearrange active mask as 2D structureNumber of columns = SIMD widthSearch each column for an active thread to create new sub-warp

Large Warp Microarchitecture Example

Decode Stage

1 0 0 10 1 0 00 0 1 11 0 0 00 0 1 00 1 0 01 0 0 10 1 0 0

0 00

01 1 1 1

00

00

1 1 1 1

0 00

1 1 1 11 1 0 1

Sub-warp 0 mask Sub-warp 0 maskSub-warp 1 mask Sub-warp 0 maskSub-warp 1 maskSub-warp 2 mask

1 1 1 1 1 1 1 1

More Large Warp Microarchitecture

Divergence stack still usedHandled at the large warp level

How large should we make the warps?More threads per warp more potential for sub-warp creationToo large a warp size can degrade performance

Re-fetch policy for conditional branchesMust wait till last sub-warp finishes

Optimization for unconditional branch instructionsDon’t create multiple sub-warpsSub-warping always completes in a single cycle

Two Level Round Robin Scheduling

Split warps into equal sized fetch groups

Create initial priority among the fetch groups

Round-robin scheduling among warps in same fetch group

When all warps in the highest priority fetch group are stalledRotate fetch group prioritiesHighest priority fetch group becomes least

Warps arrive at a stalling point at slightly different points in timeBetter overlap of computation and memory latency

Round Robin vs Two Level Round Robin

time

Core

MemorySystem

All Warps Compute

Req Warp 0

All Warps Compute

Req Warp 1

Req Warp 15

Round Robin Scheduling, 16 total warps

time

Core

MemorySystem

Compute

Req Warp 0Req Warp 1

Req Warp 7

Two Level Round Robin Scheduling, 2 fetch groups, 8 warps each

Group 0

Compute

Group 1

Req Warp 8Req Warp 9

Req Warp 15

Compute

Group 0

Compute

Group 1

Saved Cycles

More on Two Level SchedulingWhat should the fetch group size be?

Enough warps to keep pipeline busy in the absence of long latency stallsToo small

Uneven progression of warps in the same fetch group Destroys data locality among warps

Too large Reduces benefits of two-level scheduling More warps stall at the same time

Not just for hiding memory latencyComplex instructions (e.g., sine, cosine, sqrt, etc.)Two-level scheduling allows warps to arrive at such instructions at slightly

different points in time

Combining LWM and Two Level Scheduling

4 large warps, 256 threads eachFetch group size = 1 large warp

Problematic for applications with few long latency stallsNo stalls no fetch group priority changesSingle large warp starvedBranch re-fetch policy for large warps bubbles in pipeline

Timeout invoked fetch group priority change32K instruction timeout periodAlleviates starvation

Methodology

Scalar Front End

1-wide fetch, decode

4KB single ported I-Cache

Round-robin scheduling

SIMD Back End In order, 5 stages, 32 parallel SIMD lanes

Register File and On Chip

Memories

64KB Register File

128KB, 4-way, D-Cache with 128B line size

128KB, 32-banked private memory

Memory System

Open row, first-come first-serve scheduling

8 banks, 4KB row buffer per bank

100-cycle row hit latency, 300-cycle row conflict latency

32 GB/s memory bandwidth

Simulate single GPU core with 1024 thread contextsdivided into 32 warps each with 32 threads

Overall IPC Results

0

5

10

15

20

25

30

35

blac

kjac

k

sor

t

vite

rbi

km

eans

dec

rypt

bla

cksc

hole

s

nee

dlem

an

hot

spot

mat

rix_m

ult

redu

ctio

n

his

togr

am

IPC

Baseline TBC LWM 2Lev LWM+2Lev

0.0

0.1

0.2

0.3

0.4

0.5

0.6

bfs

0

5

10

15

20

25

30

35

gmea

n

LWM+2Lev improves performance by 19.1% over baseline and by 11.5% over TBC

IPC and Computational Resource Utilization

02468

10121416

baseline LWM 2LEV LWM+2LEV

IPC

IPC for blackjack

0%

20%

40%

60%

80%

100%

120%


Computational Resource Utilization for blackjack

32

24 to 31

16 to 23

8 to 15

1 to 7

0

0

5

10

15

20

25


IPC

IPC for histogram

0%

20%

40%

60%

80%

100%

120%


Computational Resource Utilization for histogram

32

24 to 31

16 to 23

8 to 15

1 to 7

0

ConclusionFor maximum performance, the computational resources on

GPUs must be effectively utilized

Branch divergence and long latency operations cause them to be underutilized or unused

We proposed two mechanism to alleviate thisLarge Warp Microarchitecture for branch divergenceTwo-level scheduling for long latency operations

Improves performance by 19.1% over traditional GPU coresIncreases scope of applications that can run efficiently on a GPU

Questions

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

Documents

warp warps

warps compute req warp

large warps

mask subwarp

large warp level

larger warps warp size

new subwarp slide

subwarp creation