Top Banner
Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman he University of Texas at Austin Michael Shebanow NVIDIA Chang Joo Lee Intel Rustam Miftakhutdinov he University of Texas at Austin Onur Mutlu Carnegie Mellon University Yale N. Patt The University of Texas at Austi MICRO-44 December 6 th , 2011 Porto Alegre, Brazil
18

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

Dec 23, 2015

Download

Documents

Reynard Dixon
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

Improving GPU Performance viaLarge Warps and Two-Level Warp

Scheduling

Veynu NarasimanThe University of Texas at Austin

Michael ShebanowNVIDIA

Chang Joo LeeIntel

Rustam MiftakhutdinovThe University of Texas at Austin

Onur MutluCarnegie Mellon University

Yale N. PattThe University of Texas at Austin

MICRO-44December 6th, 2011Porto Alegre, Brazil

Page 2: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

Rise of GPU Computing

GPUs have become a popular platform for general purpose applications

New Programming ModelsCUDAATI Stream TechnologyOpenCL

Order of magnitude speedup over single-threaded CPU

Page 3: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

How GPUs Exploit ParallelismMultiple GPU cores (i.e., Streaming Multiprocessors)

Focus on a single GPU core

Exploit parallelism in 2 major ways:

Threads grouped into warps Single PC per warp Warps executed in SIMD fashion

Multiple warps concurrently executedRound-robin schedulingHelps hide long latencies

Page 4: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

The ProblemDespite these techniques, computational resources

can still be underutilized

Two reasons for this:

Branch divergence

Long latency operations

Page 5: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

Branch Divergence

1111A

1111D

1001B 0110C

D1111D

C0110D

ReconvergePC

ActiveMask

ExecutePC

Current PC:Current Active Mask:

A1111B1001

Taken Not Taken

Page 6: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

Long Latency Operations

time

Core

MemorySystem

All Warps Compute

Req Warp 0

All Warps Compute

Req Warp 1

Req Warp 15

Round Robin Scheduling, 16 total warps

Page 7: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

0%10%20%30%40%50%60%70%80%90%

100%

Computational Resource Utilization

32

24 to 31

16 to 23

8 to 15

1 to 7

0

32 warps, 32 threads per warp, SIMD width = 32, round-robin scheduling

Good

Bad

Page 8: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

Large Warp Microarchitecture (LWM)

Alleviates branch divergence

Fewer, but larger warpsWarp size much greater than SIMD width

Total thread count and SIMD-width stay the same

Dynamically breaks down large warp into sub-warpsCan be executed on existing SIMD pipeline

Rearrange active mask as 2D structureNumber of columns = SIMD widthSearch each column for an active thread to create new sub-warp

Page 9: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

Large Warp Microarchitecture Example

Decode Stage

1 0 0 10 1 0 00 0 1 11 0 0 00 0 1 00 1 0 01 0 0 10 1 0 0

0 00

01 1 1 1

00

00

1 1 1 1

0 00

1 1 1 11 1 0 1

Sub-warp 0 mask Sub-warp 0 maskSub-warp 1 mask Sub-warp 0 maskSub-warp 1 maskSub-warp 2 mask

1 1 1 1 1 1 1 1

Page 10: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

More Large Warp Microarchitecture

Divergence stack still usedHandled at the large warp level

How large should we make the warps?More threads per warp more potential for sub-warp creationToo large a warp size can degrade performance

Re-fetch policy for conditional branchesMust wait till last sub-warp finishes

Optimization for unconditional branch instructionsDon’t create multiple sub-warpsSub-warping always completes in a single cycle

Page 11: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

Two Level Round Robin Scheduling

Split warps into equal sized fetch groups

Create initial priority among the fetch groups

Round-robin scheduling among warps in same fetch group

When all warps in the highest priority fetch group are stalledRotate fetch group prioritiesHighest priority fetch group becomes least

Warps arrive at a stalling point at slightly different points in timeBetter overlap of computation and memory latency

Page 12: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

Round Robin vs Two Level Round Robin

time

Core

MemorySystem

All Warps Compute

Req Warp 0

All Warps Compute

Req Warp 1

Req Warp 15

Round Robin Scheduling, 16 total warps

time

Core

MemorySystem

Compute

Req Warp 0Req Warp 1

Req Warp 7

Two Level Round Robin Scheduling, 2 fetch groups, 8 warps each

Group 0

Compute

Group 1

Req Warp 8Req Warp 9

Req Warp 15

Compute

Group 0

Compute

Group 1

Saved Cycles

Page 13: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

More on Two Level SchedulingWhat should the fetch group size be?

Enough warps to keep pipeline busy in the absence of long latency stallsToo small

Uneven progression of warps in the same fetch group Destroys data locality among warps

Too large Reduces benefits of two-level scheduling More warps stall at the same time

Not just for hiding memory latencyComplex instructions (e.g., sine, cosine, sqrt, etc.)Two-level scheduling allows warps to arrive at such instructions at slightly

different points in time

Page 14: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

Combining LWM and Two Level Scheduling

4 large warps, 256 threads eachFetch group size = 1 large warp

Problematic for applications with few long latency stallsNo stalls no fetch group priority changesSingle large warp starvedBranch re-fetch policy for large warps bubbles in pipeline

Timeout invoked fetch group priority change32K instruction timeout periodAlleviates starvation

Page 15: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

Methodology

Scalar Front End

1-wide fetch, decode

4KB single ported I-Cache

Round-robin scheduling

SIMD Back End In order, 5 stages, 32 parallel SIMD lanes

Register File and On Chip

Memories

64KB Register File

128KB, 4-way, D-Cache with 128B line size

128KB, 32-banked private memory

Memory System

Open row, first-come first-serve scheduling

8 banks, 4KB row buffer per bank

100-cycle row hit latency, 300-cycle row conflict latency

32 GB/s memory bandwidth

Simulate single GPU core with 1024 thread contextsdivided into 32 warps each with 32 threads

Page 16: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

Overall IPC Results

0

5

10

15

20

25

30

35

blac

kjac

k

sor

t

vite

rbi

km

eans

dec

rypt

bla

cksc

hole

s

nee

dlem

an

hot

spot

mat

rix_m

ult

redu

ctio

n

his

togr

am

IPC

Baseline TBC LWM 2Lev LWM+2Lev

0.0

0.1

0.2

0.3

0.4

0.5

0.6

bfs

0

5

10

15

20

25

30

35

gmea

n

LWM+2Lev improves performance by 19.1% over baseline and by 11.5% over TBC

Page 17: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

IPC and Computational Resource Utilization

02468

10121416

baseline LWM 2LEV LWM+2LEV

IPC

IPC for blackjack

0%

20%

40%

60%

80%

100%

120%

baseline LWM 2LEV LWM+2LEV

Computational Resource Utilization for blackjack

32

24 to 31

16 to 23

8 to 15

1 to 7

0

0

5

10

15

20

25

baseline LWM 2LEV LWM+2LEV

IPC

IPC for histogram

0%

20%

40%

60%

80%

100%

120%

baseline LWM 2LEV LWM+2LEV

Computational Resource Utilization for histogram

32

24 to 31

16 to 23

8 to 15

1 to 7

0

Page 18: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang.

ConclusionFor maximum performance, the computational resources on

GPUs must be effectively utilized

Branch divergence and long latency operations cause them to be underutilized or unused

We proposed two mechanism to alleviate thisLarge Warp Microarchitecture for branch divergenceTwo-level scheduling for long latency operations

Improves performance by 19.1% over traditional GPU coresIncreases scope of applications that can run efficiently on a GPU

Questions