Top Banner
ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant Professor College of William & Mary (http://adwaitjog.github.io/)
60

ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Mar 29, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

ACACES 2018 Summer School

GPU Architectures: Basic to Advanced Concepts

Adwait Jog, Assistant Professor

College of William & Mary (http://adwaitjog.github.io/)

Page 2: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Course Outline

q Lectures 1 and 2: Basics Concepts● Basics of GPU Programming● Basics of GPU Architecture

q Lecture 3: GPU Performance Bottlenecks● Memory Bottlenecks● Compute Bottlenecks ● Possible Software and Hardware Solutions

q Lecture 4: GPU Security Concerns● Timing channels● Possible Software and Hardware Solutions

Page 3: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Streaming Multi-Processor (SM)

Scratchpad

ControlSM

GPU-1 (Device - 1)SM SM SM SM

PE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PE

PE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PE

Register File

– Threads are assigned to SM in block granularity

– SM maintains thread/block idx #s– SM manages/schedules thread execution– Multiple blocks can be allocated to the SM

– Based on the amount of resources (shared memory, register file etc.)

Page 4: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

GPU Execution Model qBlocks assigned to each SM are scheduled on the

associated SIMD hardware (i.e., on the Processing Elements (PEs) or CUDA Cores).

qSM bundles threads (from various blocks) into warps (wavefronts) and runs them in lockstep on across PEs.

qAn NVIDIA warp groups 32 consecutive threads together (AMD wave-fronts group 64 threads together)

qWarps are:● Scheduling units in SM● Scheduled in multiplexed and pipelined manner

on the SM

Page 5: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

q Execution in an SM

W1

W2

W3 Computation

Waiting for Data from GPU Memory

GPU attempts to hide long memory latency with computation from

other warps

Tolerating Long Latencies

How SMs are able to context switch between warps so quickly?

Page 6: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Key Points so Far

qProgrammer organize threads into “blocks” (up to 1024 threads per block)

qMotivation: Write parallel software once and run on future hardware

qHardware spawns more threads/warps than GPU can run (some may wait)

qWarps associated with blocks can help in tolerating long latencies.

qGPUs support large register files (for fast context switching) and high bandwidth memories (for providing data to large number of concurrent threads)

Page 7: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

GPU Architecture Overview

Single-Instruction, Multiple-Threads

GPU

Interconnection Network

SM Cluster

SM SM

MemoryPartition

GDDR5/HBM

MemoryPartition

GDDR5/HBM

MemoryPartition

GDDR5/HBM Off-chip DRAM

SM Cluster

SM SM

SM Cluster

SM SM

Page 8: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

GPU Microarchitecture

qNot many details are publicly available about GPU microarchitecture.

qModel described next, embodied in GPGPU-Sim, developed from: white papers, programming manuals, IEEE Micro articles, patents.

Page 9: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

GPGPU-Sim from UBC – A Cycle-level Simulator

Correlation~0.976

12

0.00

50.00

100.00

150.00

200.00

250.00

0.00 50.00 100.00 150.00 200.00 250.00

GPG

PU-S

im IP

C

Quadro FX5800 IPC

HW - GPGPU-Sim Comparison

Page 10: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

GPU Instruction Set Architecture (ISA)

q NVIDIA defines a virtual ISA, called “PTX” (Parallel Thread eXecution)

q More recently, Heterogeneous System Architecture (HSA) Foundation (AMD, ARM, Imagination, Mediatek, Samsung, Qualcomm, TI) defined the HSAIL virtual ISA.

q PTX is Reduced Instruction Set Architecture (e.g., load/store architecture)

q Virtual: infinite set of registers (much like a compiler intermediate representation)

q PTX translated to hardware ISA by backend compiler (“ptxas”). Either at compile time (nvcc) or at runtime (GPU driver).

Page 11: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Some Example PTX Syntax

q Registers declared with a type:

.reg .pred p, q, r;

.reg .u16 r1, r2;

.reg .f64 f1, f2;

q ALU operations

add.u32 x, y, z; // x = y + z

mad.lo.s32 d, a, b, c; // d = a*b + c

q Memory operations:

ld.global.f32 f, [a];

ld.shared.u32 g, [b];

st.local.f64 [c], h

q Compare and branch operations:

setp.eq.f32 p, y, 0; // is y equal to zero?

@p bra L1 // branch to L1 if y equal to zero

Page 12: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Inside an SM (1)

q Fine-grained multithreading● Interleave warp execution to hide latency● Register values of all threads stays in core

SIMTFront End SIMD Datapath

FetchDecode

ScheduleBranch

Done (Warp ID)

Memory Subsystem Icnt.NetworkSMem L1 D$ Tex $ Const$

RegFile

Page 13: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Inside an SM (2)Schedule+ Fetch Decode Register

Read Execute Memory Writeback

SIMT Front End SIMD Datapath

ALUALUALU

I-Cache DecodeI-Buffer

ScoreBoard

Issue OperandCollector

MEM

ALUFetch SIMT-Stack

Done (WID)

Valid[1:N]

Branch Target PC

Pred.ActiveMask

q Three decoupled warp schedulers

q Scoreboard

q Large register file

q Multiple SIMD functional units

Scheduler 1

Scheduler 2

Scheduler 3

Page 14: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Fetch + Decode

qArbitrate the I-cache among warps● Cache miss handled by

fetching again later

qFetched instruction is decoded and then stored in the I-Buffer● 1 or more entries / warp● Only warp with vacant

entries are considered in fetch

Inst. W1 rInst. W2Inst. W3

vrvrv

ToFetch

Issue

DecodeScore-Board

IssueARB

PC1

PC2

PC3

ARB

SelectionToI-C

ache

Valid[1:N]

I-Cache DecodeI-Buffer

FetchValid[1:N]

Page 15: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Instruction IssueqSelect a warp and issue an instruction from its

I-Buffer for execution● Scheduling: Greedy-Then-Oldest (GTO)● GT200/later Fermi/Kepler:

Allow dual issue (superscalar)● Fermi: Odd/Even scheduler● To avoid stalling pipeline might

keep instruction in I-buffer untilknow it can complete (replay)

Inst. W1 rInst. W2Inst. W3

vrvrv

ToFetch

Issue

DecodeScore-Board

IssueARB

Page 16: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Scoreboard

qChecks for RAW and WAW dependency hazard●Flag instructions with hazards as not ready in

I-Buffer (masking them out from the scheduler)

qInstructions reserves registers at issue

qRelease them at writeback

Page 17: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Operand Collector

(from instruction issue stage)dispatch

Page 18: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

ALU Pipelines

qSIMD Execution Unit

qFully Pipelined

qEach pipe may execute a subset of instructions

qConfigurable bandwidth and latency (depending on the instruction)

qDefault: SP + SFU pipes

Page 19: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Memory Unit

qModel timing for memory instructions

qSupport half-warp (16 threads) ● Double clock the unit● Each cycle service half

the warp

qHas a private writebackpath

AccessCoalesc.A

GU

SharedMem

BankConflict

Const.Cache

TextureCache

DataCache

Mem

ory

Port

MSHR

Page 20: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Writeback

qEach pipeline has a result bus for writeback

qException: ●SP and SFU pipe shares a result bus●Time slots on the shared bus is pre-

allocated

Page 21: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

SM Cluster

q Collection of SIMT cores

Page 22: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

GPU Architecture Overview

Single-Instruction, Multiple-Threads

GPU

Interconnection Network

SM Cluster

SM SM

MemoryPartition

GDDR5/HBM

MemoryPartition

GDDR5/HBM

MemoryPartition

GDDR5/HBM Off-chip DRAM

SM Cluster

SM SM

SM Cluster

SM SM

Page 23: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Interconnection Network Model

q Intersim (Booksim) a flit level simulator ● Topologies (Mesh, Torus, Butterfly, …)● Routing (Dimension Order, Adaptive, etc. )● Flow Control (Virtual Channels, Credits)

q Two separate networks● From SIMT cores to memory partitions

- Read Requests, Write Requests

● From memory partitions to SIMT cores- Read Replies, Write Acks

Page 24: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Topology Examples

Page 25: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

GPU Architecture Overview

Single-Instruction, Multiple-Threads

GPU

Interconnection Network

SM Cluster

SM SM

MemoryPartition

GDDR3/GDDR5

MemoryPartition

GDDR3/GDDR5

MemoryPartition

GDDR3/GDDR5 Off-chip DRAM

SM Cluster

SM SM

SM Cluster

SM SM

Page 26: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Memory Partition

Interconnection Netw

ork

Memory Partition

L2 CacheBank

DRAMAccess

Scheduler

Off-Chip DRAM

Channel

DRAMTimingModel

AtomicOperationExecution

ROP QueueICNTàL2 Queue L2àDRAM Queue DRAM Latency

Queue

DRAMàL2 QueueL2àICNT Queue

Page 27: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

DRAM

Column Decoder

MemoryArray

Row

Dec

oderM

emor

y Co

ntro

ller

Row BufferRow Buffer

Row

Dec

oder

Column Decoder

Row Buffer

Column Decoder

Row Buffer

DRAM Access

• Row access – Activate a row or page of

a DRAM bank– Load it to row buffer

• Column access– Select and return a block

of data in row buffer• Precharge

– Write back the opened row into DRAM

– Otherwise it will be lost!

Page 28: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

DRAM Row Access Locality

Row Buffer

DRAM Bank

Rows

tRC = row cycle time

tRP = row precharge time

tRCD = row activate time

Bank Precharge Row A Activate Row B Pre...RB RBRARARARA Precharge Row B Act..tRP tRCD

tRC

Page 29: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

DRAM Bank-level Parallelism

• To increase DRAM performance and utilization• Multiple banks per

DRAM chip• To increase bus width

• Multiple chips per Memory Controller

Page 30: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Scheduling DRAM Requests

• Scheduling policies supported• First in first out (FIFO)

• In-order scheduling• First Ready First Come First Serve (FR-FCFS)

• Out of order scheduling• Requires associative search

Page 31: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Key GPU Performance Concerns

Page 32: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Key GPU Performance Concerns

– I) Data transfers between CPU and GPU are one of the major performance bottlenecks.

– II) Data transfers between SMs and global memory is costly. Can on-chip memory help?

GPU Memory

Cache

ALUControl

ALU

ALU

ALU

CPU Memory

Bottleneck!

GPU

(Dev

ice)

Scra

tchp

ad

Regi

ster

s an

dLo

cal M

emor

y

GPU

Glo

bal M

emor

y

Bottleneck!??

SMSM

SMSM

Page 33: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

CUDA Streams

q CUDA (and OpenCL) provide the capability to overlap computation on GPU with memory transfers using “Streams” (Command Queues)

q A Stream orders a sequence of kernels and memory copy “operations”.

q Operations in one stream can overlap with operations in a different stream.

Page 34: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

How Can Streams Help?

Serial:

Streams:

cudaMemcpy(H2D) kernel<<<>>> cudaMemcpy(D2H)

cudaMemcpy(H2D) K0 DH0

K1 DH1

K2 DH2

Time

Savings

GPU idle GPU idle GPU busy

Page 35: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

CUDA Streams

cudaStream_t streams[3];

for(i=0; i<3; i++)

cudaStreamCreate(&streams[i]); // initialize streams

for(i=0; i<3; i++) {

cudaMemcpyAsync(pD+i*size,pH+i*size,size,

cudaMemcpyHostToDevice,stream[i]); // H2D

MyKernel<<<grid,block,0,stream[i]>>>(pD+i,size); // compute

cudaMemcpyAsync(pD+i*size,pH+i*size,size,

cudaMemcpyDeviceToHost,stream[i]); // D2H

}

Page 36: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Manual CPU ó GPU Data Movementq Problem #1: Programmer needs to

identify data needed in a kernel and insert calls to move it to GPU

q Problem #2: Pointer on CPU does not work on GPU since different address spaces

q Problem #3: Bandwidth connecting CPU and GPU is order of magnitude smaller than GPU off-chip

q Problem #4: Latency to transfer data from CPU to GPU is order of magnitude higher than GPU off-chip

q Problem #5: Size of GPU DRAM memory much smaller than size of CPU main memory

Page 37: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Additional Features in CUDA

q Dynamic Parallelism (CUDA 5 onwards): Launch kernels from within a kernel. Reduce work for e.g., adaptive mesh refinement.

q Unified Memory (CUDA 6 onwards): Avoid need for explicit memory copies between CPU and GPU

http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/

See also, Gelado, et al. ASPLOS 2010.

Page 38: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Key GPU Performance Concerns

– I) Data transfers between CPU and GPU are one of the major performance bottlenecks.

– II) Data transfers between SMs and global memory are costly. Can on-chip memory help?

GPU Memory

Cache

ALUControl

ALU

ALU

ALU

CPU Memory

Bottleneck!

GPU

(Dev

ice)

Scra

tchp

ad

Regi

ster

s an

dLo

cal M

emor

y

GPU

Glo

bal M

emor

y

Bottleneck!

Page 39: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Let’s consider some software approaches first..before moving on to hardware approaches

Page 40: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Background: GPU Memory Address Spaces

q GPU has three address spaces to support increasing visibility of data between threads: local, shared, global

q In addition two more (read-only) address spaces: Constant and texture.

Page 41: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Partial Overview of CUDA Memories

– Device code can:– R/W per-thread registers– R/W all-shared global memory

– Host code can– Transfer data to/from per grid global

memory

Host

SM

GPU Global Memory

Block (0, 0)

Thread (0, 0)

Registers

Block (0, 1)

Thread (0, 0)

Registers

Thread (0, 1)

Registers

Thread (0, 1)

Registers

Page 42: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

CUDA Device Memory Management API functions

– cudaMalloc()– Allocates an object in the device global memory– Two parameters

– Address of a pointer to the allocated object– Size of allocated object in terms of bytes

– cudaFree()– Frees object from device global memory– One parameter

– Pointer to freed object

Host

SM

GPU Global Memory

Block (0, 0)

Thread (0, 0)

Registers

Block (0, 1)

Thread (0, 0)

Registers

Thread (0, 1)

Registers

Thread (0, 1)

Registers

Page 43: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Host-Device Data Transfer API functions

– cudaMemcpy()– memory data transfer– Requires four parameters

– Pointer to destination – Pointer to source– Number of bytes copied– Type/Direction of transfer

Host

SM

GPU Global Memory

Block (0, 0)

Thread (0, 0)

Registers

Block (0, 1)

Thread (0, 0)

Registers

Thread (0, 1)

Registers

Thread (0, 1)

Registers

Relatively new Features:

Transfer to device can be asynchronous

Explicit mention of memcpy by the users can be avoided by new CUDA Unified Memory

https://devblogs.nvidia.com/unified-memory-cuda-beginners/

Page 44: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Local address Space

Each thread has own “local memory”

0x42

Example: Location at address 100 for thread 0 is different from location at address 100 for thread 1.

Contains local variables private to a thread.

Page 45: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Global Address Spaces

thread block X

thread block Y

Each thread in the different thread blocks (even from different kernels) can access a region called “global memory”.

Commonly in GPGPU workloads threads write their own portion of global memory. Avoids need for synchronization—slow; also unpredictable thread block scheduling.

0x42

Page 46: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Blocks are partitioned after linearization

– Linearized thread blocks are partitioned – Thread indices within a warp are consecutive and increasing– Warp 0 starts with Thread 0

– Partitioning scheme is consistent across devices– Thus you can use this knowledge in control flow– However, the exact size of warps may change from

generation to generation

– DO NOT rely on any ordering within or between warps– If there are any dependencies between threads, you must

__syncthreads() to get correct results.

Page 47: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Warps in Multi-dimensional Thread Blocks

– The thread blocks are first linearized into 1D in row major order

– In x-dimension first, y-dimension next, and z-dimension last

Page 48: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Reminder: Kernel, Blocks, Threads

Page 49: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

“Coalescing” global accesses

qAligned accesses request single 128B cache blk

qMemory Divergence:

ld.global r1,0(r2)

128 255

128 256 1024 1152

ld.global r1,0(r2)

Page 50: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Example: Transpose (CUDA SDK)

__global__ void transposeNaive(float *odata, float* idata, int width)

{

int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; // TILE_DIM = 16

int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;

int index_in = xIndex + width * yIndex;

int index_out = yIndex + width * xIndex;

for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { // BLOCK_ROWS = 16

odata[index_out+i] = idata[index_in+i*width];

}

}

NOTE: “xIndex”, “yIndex”, “index_in”, “index_out”, and “i” are in local memory (local variables are register allocated but stack lives in local memory)

“odata” and “idata” are pointers to global memory(both allocated using calls to cudaMalloc -- not shown above)

1 23 4

1 32 4

Write to global memory highlighted above is not “coalesced”.

Page 51: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Scratchpad Memory

Each thread in the same thread block (work group) can access a memory region called scratchpad (or shared memory)

Shared memory address space is limited in size (16 to 48 KB).

Used as a software managed “cache” to avoid off-chip memory accesses.

Synchronize threads in a thread block using __syncthreads();

thread block

0x42

Page 52: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Optimizing Transpose for Coalescing

1 23 4

idata

odata

1 23 4

1 23 4

Step 1: Read block of data into shared memory

Step 2: Copy from shared memory into global memory using coalesce write

1 32 4

Page 53: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Use of Scratchpad__global__ void transposescratchpad (float *odata, float *idata, int width)

{

__shared__ float tile[TILE_DIM][TILE_DIM];

int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;

int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;

int index_in = xIndex + (yIndex)*width;

xIndex = blockIdx.y * TILE_DIM + threadIdx.x;

yIndex = blockIdx.x * TILE_DIM + threadIdx.y;

int index_out = xIndex + (yIndex)*width;

for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {

tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width];

}

__syncthreads();

for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {

odata[index_out+i*width] = tile[threadIdx.x][threadIdx.y+i];

}

}https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/

GOOD: Coalesced write BAD: Shared memory bank conflicts

Page 54: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Bank Conflicts

qTo increase bandwidth common to organize memory into multiple banks.

q Independent accesses to different banks can proceed in parallel

Bank 0 Bank 1

0246

1357

Example 1: Read 0, Read 1(can proceed in parallel)

Example 2: Read 0, Read 3(can proceed in parallel)

Bank 0 Bank 1

0246

1357

Example 3: Read 0, Read 2(bank conflict)

Bank 0 Bank 1

0246

1357

Page 55: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Shared Memory Bank Conflicts

__shared__ int A[BSIZE];

A[threadIdx.x] = … // no conflicts

0326496

1336597

316395127

2346698

Page 56: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Shared Memory Bank Conflicts

__shared__ int A[BSIZE];

A[2*threadIdx.x] = // 2-way conflict

0326496

1336597

316395127

2346698

Page 57: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Optimizing Transpose for Coalescing

1 23 4

idata

odata

1 23 4

1 23 4

Step 1: Read block of data into shared memory

Step 2: Copy from shared memory into global memory using coalesce write

1 32 4

Problem: Access two locations in sameshared memory bank.

Page 58: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Eliminate Bank Conflicts__global__ void transposeNoBankConflicts (float *odata, float *idata, int width)

{

__shared__ float tile[TILE_DIM][TILE_DIM+1];

int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;

int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;

int index_in = xIndex + (yIndex)*width;

xIndex = blockIdx.y * TILE_DIM + threadIdx.x;

yIndex = blockIdx.x * TILE_DIM + threadIdx.y;

int index_out = xIndex + (yIndex)* width;

for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {

tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width];

}

__syncthreads();

for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {

odata[index_out+i*width] = tile[threadIdx.x][threadIdx.y+i];

}

}

https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/

Page 59: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Optimizing Transpose for Coalescing

1 23 4

idata

odata

1 23

4

Step 1: Read block of data into shared memory

Step 2: Copy from shared memory into global memory using coalesce write

1 32 4

1 23

4

Bank 0 Bank 1

Bank 0 Bank 1

Page 60: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called

Reading Materialq NVIDIA Blogs:

● https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-c-kernels/

● https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/

q GPGPU-sim Manual and Tutorial Slides● http://www.gpgpu-sim.org/manual● http://www.gpgpu-sim.org/micro2012-tutorial/

q More background material: Jog et al., OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU performance, ASPLOS’13