Top Banner
GPU Programming Lecture 2: Data Parallelism and CUDA C Miaoqing Huang University of Arkansas Fall 2013 1 / 33
33

Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

May 09, 2018

Download

Documents

ngolien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

GPU Programming

Lecture 2: Data Parallelism and CUDA C

Miaoqing HuangUniversity of Arkansas

Fall 2013

1 / 33

Page 2: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Outline

Evolvements of NVIDIA GPU

CUDA Basic

Detailed StepsDevice Memories and Data TransferKernel Functions and Threading

2 / 33

Page 3: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Outline

Evolvements of NVIDIA GPU

CUDA Basic

Detailed StepsDevice Memories and Data TransferKernel Functions and Threading

3 / 33

Page 4: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Architecture of G80 GPU

SFU SFUSFU SFUSFU SFUSFU SFU SFU SFUSFU SFU

I 128 streaming processors in 16 streaming multiprocessors

4 / 33

Page 5: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Architecture of GT200 GPU (e.g., GeForce GTX 295)

SFU SFUSFU SFUSFU SFUSFU SFU SFU SFU SFU SFU

I 240 streaming processors in 30 streaming multiprocessors

5 / 33

Page 6: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Architecture of Fermi GPU (e.g., GeForce GTX 480, Tesla C2075)

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

LD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/ST

Spe

cial

Fu

nctio

n U

nit

Spe

cial

Fu

nctio

n U

nit

Spe

cial

Fu

nctio

n U

nit

Spe

cial

Fu

nctio

n U

nit

Register File (32,768 × 32-bit)

Dispatch Unit Dispatch UnitWarp Scheduler Warp Scheduler

Instruction Cache

Interconnect Network64 KB Shared Memory / L1 Cache

Fermi Streaming Multiprocessor (SM)

FermiSM

L2 Cache

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

FermiSM

Gig

athr

ead

Hos

t In

terfa

ceD

RA

MD

RA

M

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

16-SM Fermi GPU

Dispatch PortOperant Collector

Result Queue

FP Unit INT Unit

CUDA Core

I 512 Fermi streaming processors in 16 streaming multiprocessors

6 / 33

Page 7: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Architecture of Kepler GPU (e.g., Tesla K20)

I 2,880 streaming processors in 15 streaming multiprocessors

7 / 33

Page 8: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Architecture of Kepler Streaming Multiprocessor

I Each streaming multiprocessor contains 192 single-precisioncores and 64 double-precision cores

8 / 33

Page 9: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Comparison among Three ArchitecturesGPU G80 GT200 Fermi

Transistors 681 million 1.4 billion 3.0 billionCUDA Cores 128 240 512Double Precision Capability None 30 FMAs/clk 256 FMAs/clkSingle Precision Capability 128 MADs/clk 240 MADs/clk 512 FMAs/clkSpecial Function Units / SM 2 2 4Shared Memory / SM 16KB 16KB 48KB/16KBL1 Cache / SM None None 16KB/48KBL2 Cache None None 768KBECC Support No No Yes

Product

C

BA

Result

× =

+

=

(truncate extra digits)

Multiply-Add (MAD)Product

C

BA

Result

× =

+

=

(retain all digits)

Fused Multiply-Add (FMA)

9 / 33

Page 10: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Fermi vs. KeplerFERMIGF100

FERMIGF104

KEPLERGK104

KEPLERGK110

Compute Capability 2.0 2.1 3.0 3.5Threads / Warp 32 32 32 32Max Warps / Multiprocessor 48 48 64 64Max Threads / Multiprocessor 1536 1536 2048 2048Max Thread Blocks / Multiprocessor 8 8 16 1632 bit Registers / Multiprocessor 32768 32768 65536 65536Max Registers / Thread 63 63 63 255Max Threads / Thread Block 1024 1024 1024 1024Shared Memory Size Configurations (bytes) 16K 16K 16K 16K

48K 48K 32K 32K48K 48K

Max X Grid Dimension 2^16 1 2^16 1 2^32 1 2^32 1Hyper Q No No No YesDynamic Parallelism No No No Yes

Compute Capability of Fermi and Kepler GPUs

10 / 33

Page 11: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Compute CapabilityI The compute capability of a device is defined by a major revision

number and a minor revision numberI Devices with the same major revision number are of the same

core architectureI Kepler architecture: 3.xI Fermi architecture: 2.xI Prior devices: 1.x

I “NVIDIA CUDA C Programming Guide (v5.5)”I Appendix A lists of all CUDA-enabled devices along with their

compute capabilityI Appendix G gives the technical specifications of each compute

capability

11 / 33

Page 12: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Outline

Evolvements of NVIDIA GPU

CUDA Basic

Detailed StepsDevice Memories and Data TransferKernel Functions and Threading

12 / 33

Page 13: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Data ParallelismSquare Matrix Multiplication Example

28

M

N

P

WIDTH

WIDTH

WIDTH WIDTH

I P = M × N of sizeWIDTH×WIDTH

I The approach in C

for (i=0;i<WIDTH;i++) {for (j=0;j<WIDTH;j++) {

......P[i][j]=......;......

}}

I The approach in CUDA(the straightforward way)

I One thread calculates oneelement of P

I Issue WIDTH×WIDTHthreads simultaneously

13 / 33

Page 14: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Data ParallelismSquare Matrix Multiplication Example

28

M

N

P

WIDTH

WIDTH

WIDTH WIDTH

I P = M × N of sizeWIDTH×WIDTH

I The approach in C

for (i=0;i<WIDTH;i++) {for (j=0;j<WIDTH;j++) {

......P[i][j]=......;......

}}

I The approach in CUDA(the straightforward way)

I One thread calculates oneelement of P

I Issue WIDTH×WIDTHthreads simultaneously

14 / 33

Page 15: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Data ParallelismSquare Matrix Multiplication Example

28

M

N

P

WIDTH

WIDTH

WIDTH WIDTH

I P = M × N of sizeWIDTH×WIDTH

I The approach in C

for (i=0;i<WIDTH;i++) {for (j=0;j<WIDTH;j++) {

......P[i][j]=......;......

}}

I The approach in CUDA(the straightforward way)

I One thread calculates oneelement of P

I Issue WIDTH×WIDTHthreads simultaneously

15 / 33

Page 16: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

The Heterogeneous Platform

I A typical GPGPU-capableplatform consists of

I One or moremicroprocessors (CPUs) –host

I One or more GPUs –device

I GPU devices are connectedto CPUs through PCIe 2.0bus

I A CUDA program consists ofI The code on CPU –

software partI The code on GPU –

hardware part

QPI

25.6GB/s25.6GB/s

CPU-GPU5.8GB/s

Unidirectional

102GB/s

QPI

16x PCIe 2.0

QPI

QPI

144GB/s

16x PCIe 2.0

16 / 33

Page 17: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

CUDA Program Structure

Serial Code (host)

. . .

. . .

Parallel Kernel (device)

KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args);

Grid 0

Grid 1

I Integrated host+device application C programI Sequential or modestly parallel parts in host C codeI Highly parallel parts in device SPMD kernel C code

17 / 33

Page 18: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

CUDA Thread Organization

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

I A kernel is implemented as a gridof threads

I Threads in grid is furtherdecomposed into blocks

I A grid can be up tothree-dimensional

I A thread block is a batch ofthreads that can cooperate witheach other by:

I Synchronizing their executionI Efficiently sharing data through

shared memoryI A block can be one or two or

three-dimensionalI Each block and each thread has

its own ID

18 / 33

Page 19: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

CUDA Thread Organization

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

I A kernel is implemented as a gridof threads

I Threads in grid is furtherdecomposed into blocks

I A grid can be up tothree-dimensional

I A thread block is a batch ofthreads that can cooperate witheach other by:

I Synchronizing their executionI Efficiently sharing data through

shared memoryI A block can be one or two or

three-dimensionalI Each block and each thread has

its own ID

19 / 33

Page 20: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Matrix Multiplication: the main() function

28

M

N

P

WIDTH

WIDTH

WIDTH WIDTH

int main (void) {// 1.// Allocate and initialize// the matrices M, N, P// I/O to read the input// matrices M and N

......

// 2.// M*N on the processorMatrixMultiplication(M,N,P,width);

// 3.// I/O to write the output matrix P// Free matrices M, N, P

......return 0;

}

20 / 33

Page 21: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Matrix Multiplication: A Simple Host Version in C

WIDTH WIDTH

WIDTH

WIDTH

void MatrixMultiplication(float* M,float* N, float* P, int width)

{for (int i=0; i<width; i++) {for (int j=0; j<width; j++) {float sum = 0;for (int k=0; k<width; k++) {

float a = M[i*width+k];float b = N[k*width+j];sum += a*b;

}P[i*width+j] = sum;

}}

}

M0,0 M0,1 M0,2 M0,3

M1,0

M2,0

M3,0

M1,1

M2,1

M3,1

M1,2

M2,2

M3,2

M1,3

M2,3

M3,3

M0,0 M0,1 M0,2 M0,3 M1,0 M2,0 M3,0M1,1 M2,1 M3,1M1,2 M2,2 M3,2M1,3 M2,3 M3,3

Mi,j

row column

21 / 33

Page 22: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Matrix Multiplication: Move computation to the device

WIDTH WIDTH

WIDTH

WIDTH

void MatrixMultiplication(float* M,float* N, float* P, int width)

{int size = width*width*sizeof(float);float* Md, Nd, Pd;......// 1.// Allocate device memory for M, N,// and P// Copy M and N to allocated device// memory locations

// 2.// Kernel invocation code - to have// the device to perform the actual// matrix multiplication

// 3.// Copy P from the device memory// Free device matrices

}

22 / 33

Page 23: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Outline

Evolvements of NVIDIA GPU

CUDA Basic

Detailed StepsDevice Memories and Data TransferKernel Functions and Threading

23 / 33

Page 24: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Memory Hierarchy on DeviceI Memory hierarchy on device

I Global MemoryI Main means of communicating

between host and deviceI Long latency access

I Shared MemoryI Short latency

I RegisterI Per-thread local variables

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

I Data accessI Device code can read/write

I Per-thread registers, per-block shared memory, per-grid globalmemory

I Host code can transfer data to/fromI Per-grid global memory

24 / 33

Page 25: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

CUDA Device Memory AllocationI cudaMalloc()

I Allocates object in the deviceGlobal Memory

I Requires two parameters1. Address of a pointer to the

allocated object2. Size of of allocated object

float* Md; //d indicates a device dataint size = width*width*sizeof(float);

cudaMalloc((void**)&Md, size);cudaFree(Md);

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

I cudaFree()I Frees object from device Global Memory

I Pointer to freed object

25 / 33

Page 26: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

CUDA Host-Device Data TransferI cudaMemcpy()

I Memory data transferI Requires four parameters

1. Pointer to destination2. Pointer to source3. Number of bytes copied4. Type of transfer

I Types of transferI cudaMemcpyHostToHostI cudaMemcpyHostToDeviceI cudaMemcpyDeviceToHostI cudaMemcpyDeviceToDevice

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

26 / 33

Page 27: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Matrix Multiplication: After Integrating the Data Transfer

void MatrixMultiplication(float* M, float* N, float* P, int width){int size = width*width*sizeof(float);float* Md, Nd, Pd;

// 1. Allocate and Load M, N to device memorycudaMalloc((void**)&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMalloc((void**)&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// Allocate P on the devicecudaMalloc((void**)&Pd, size);

// 2. Kernel invocation code -- to be shown later

// 3. Read P from the devicecudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

// Free device matricescudaFree(Md); cudaFree(Nd); cudaFree (Pd);

}

27 / 33

Page 28: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

CUDA Thread Organization

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

I A kernel is implemented as a gridof threads

I Threads in grid is furtherdecomposed into blocks

I A grid can be up tothree-dimensional

I A thread block is a batch ofthreads that can cooperate witheach other by:

I Synchronizing their executionI Efficiently sharing data through

shared memoryI A block can be one or two or

three-dimensionalI Each block and each thread has

its own ID

28 / 33

Page 29: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Index (i.e., coordinates) of Block and Thread

O

Y

X

Z

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,0,0)

Thread(3,1,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

tx ty tz

M0,0 M0,1 M0,2 M0,3

M1,0

M2,0

M3,0

M1,1

M2,1

M3,1

M1,2

M2,2

M3,2

M1,3

M2,3

M3,3

I Each block and each thread are assigned an index, i.e.,blockIdx and threadIdx

I blockIdx.x, blockIdx.y, blockIdx.zI threadIdx.x, threadIdx.y, threadIdx.z

I threadIdx.y−→row, threadIdx.x−→column

29 / 33

Page 30: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Index (i.e., coordinates) of Block and Thread

O

Y

X

Z

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,0,0)

Thread(3,1,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

tx ty tz

M0,0 M0,1 M0,2 M0,3

M1,0

M2,0

M3,0

M1,1

M2,1

M3,1

M1,2

M2,2

M3,2

M1,3

M2,3

M3,3

I Each block and each thread are assigned an index, i.e.,blockIdx and threadIdx

I blockIdx.x, blockIdx.y, blockIdx.zI threadIdx.x, threadIdx.y, threadIdx.z

I threadIdx.y−→row, threadIdx.x−→column

30 / 33

Page 31: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Define the Dimension of Grid and BlockI Use pre-defined dim3 to define the

dimension of grid and blockI Use gridDim and blockDim to get the

dimensions of the grid and the block

dim3 dimGrid(width_g, height_g, depth_g);dim3 dimBlock(width_b, height_b, depth_b);

I Total number of threads issuedI width_g×height_g×depth_g×width_b×height_b×depth_b

I Launch the device computation threads

MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

31 / 33

Page 32: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

Kernel Function Specification

WIDTH WIDTH

WIDTH

WIDTH

// Matrix multiplication kernel// -- per thread code

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd,int width)

{int tx = threadIdx.x;int ty = threadIdx.y;

// Pvalue is used to store the// element of the matrix// that is computed by the threadfloat Pvalue = 0;

for (int k = 0; k < width; ++k) {float Melement = Md[ty*width+k];float Nelement = Nd[k*width+tx];Pvalue += Melement * Nelement;

}

Pd[ty*width+tx] = Pvalue;}

32 / 33

Page 33: Lecture 2: Data Parallelism and CUDA Cmqhuang/courses/4013/f2013/lecture/GPU_Lectur… · I “NVIDIA CUDA C Programming Guide ... (3,1,0) Thread (0,0,0) Thread ... Lecture 2: Data

CUDA Function DeclarationsExecuted on Only callable from

__device__ float DeviceFunc() device device__globe__ void DeviceFunc() device host__host__ float HostFunc() host host

I __global__ defines a kernel functionI Must be void function

I __device__ and __host__ can be used togetherI Generate two versions of the same function during the compilation

I For functions executed on the deviceI No recursionI No static variable declarations inside the functionI No variable number of argumentsI No indirect function calls through pointers

I All functions are host functions by default

33 / 33