GPU History CUDA - cse.uaa.alaska.eduafkjm/cs331/handouts/gpu-cuda.pdf · CUDA Intro Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations

4/25/2012

1

GPU History CUDA Intro

Graphics Pipeline Elements

1. A scene description: vertices, triangles, colors,

lighting

2.Transformations that map the scene to a

camera viewpoint

3.“Effects”: texturing, shadow mapping, lighting

calculations

4.Rasterizing: converting geometry into pixels

5.Pixel processing: depth tests, stencil tests, and

other per-pixel operations.

4/25/2012

2

Host

Vertex Control Vertex

Cache VS/T&L

Triangle Setup

Raster

Shader

ROP

FBI

Texture

Cache Frame

Buffer

Memory

CPU

GPU Host Interface

A Fixed Function

GPU Pipeline

Texture mapping example: painting a world map

texture image onto a globe object.

Texture Mapping Example

4/25/2012

3

3D Application

or Game

3D API:

OpenGL or

Direct3D

Programmable

Vertex

Processor

Primitive

Assembly

Rasterization &

Interpolation

3D API

Commands

Transformed

Vertices

Assembled

Polygons,

Lines, and

Points

GPU

Command &

Data Stream

Programmable

Fragment

Processor

Rasterized

Pre-transformed

Fragments

Transformed

Fragments

Raster

Ops Framebuffer

Pixel

Updates GPU

Front

End

Pre-transformed

Vertices

Vertex Index

Stream

Pixel

Location

Stream

CPU – GPU Boundary

CPU

GPU

An example of separate vertex processor and fragment processor in

a programmable graphics pipeline

Programmable Vertex and Pixel Processors

What is (Historical) GPGPU ?

• General Purpose computation using GPU and graphics API in

applications other than 3D graphics

– GPU accelerates critical path of application

• Data parallel algorithms leverage GPU attributes

– Large data arrays, streaming throughput

– Model is SPMD

– Low-latency floating point (FP) computation

• Applications – see http://gpgpu.org

– Game effects (FX) physics, image processing

– Physical modeling, computational engineering, matrix algebra,

convolution, correlation, sorting

4/25/2012

4

Tesla GPU

• NVIDIA developed a more general purpose GPU

• Can programming it like a regular processor

• Must explicitly declare the data parallel parts of the

workload

– Shader processors fully programming processors with

instruction memory, cache, sequencing logic

– Memory load/store instructions with random byte

addressing capability

– Parallel programming model primitives; threads, barrier

synchronization, atomic operations

CUDA

• “Compute Unified Device Architecture”

• General purpose programming model

– User kicks off batches of threads on the GPU

– GPU = dedicated super-threaded, massively data parallel co-processor

• Targeted software stack

– Compute oriented drivers, language, and tools

• Driver for loading computation programs into GPU

– Standalone Driver - Optimized for computation

– Interface designed for compute – graphics-free API

– Data sharing with OpenGL buffer objects

– Guaranteed maximum download & readback speeds

– Explicit GPU memory management

http://www.opengl.org/

4/25/2012

5

CUDA Devices and Threads

• A compute device

– Is a coprocessor to the CPU or host

– Has its own DRAM (device memory)

– Runs many threads in parallel

– Is typically a GPU but can also be another type of parallel processing

device

• Data-parallel portions of an application are expressed as device

kernels which run on many threads

• Differences between GPU and CPU threads

– GPU threads are extremely lightweight

• Very little creation overhead

– GPU needs 1000s of threads for full efficiency

• Multi-core CPU needs only a few

10

G80 CUDA mode – A Device Example

• Processors execute computing threads

• New operating mode/HW interface for computing

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture Texture Texture

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Load/store Load/store Load/store Load/store Load/store

4/25/2012

6

Arrays of Parallel Threads

• A CUDA kernel is executed by an array of threads – All threads run the same code (SPMD)‏

– Each thread has an ID that it uses to compute memory addresses and make control decisions

7 6 5 4 3 2 1 0

…

float x = input[threadID];

float y = func(x);

output[threadID] = y;

…

threadID

…

float x =

input[threadID];

float y = func(x);


…

threadID

Thread Block 0

… …

float x =

input[threadID];

float y = func(x);


…

Thread Block 1

…

float x =

input[threadID];

float y = func(x);


…

Thread Block N - 1

Thread Blocks: Scalable Cooperation

• Divide monolithic thread array into multiple blocks

– Threads within a block cooperate via shared memory,

atomic operations and barrier synchronization

– Threads in different blocks cannot cooperate

– Up to 65535 blocks, 512 threads/block

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0

4/25/2012

7

Host

Kernel

1

Kernel

2

Device

Grid 1

Block

(0, 0)

Block

(1, 0)

Block

(0, 1)

Block

(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread

(0,1,0)

Thread

(1,1,0)

Thread

(2,1,0)

Thread

(3,1,0)

Thread

(0,0,0)

Thread

(1,0,0)

Thread

(2,0,0)

Thread

(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Block IDs and Thread IDs

• We launch a “grid” of “blocks” of “threads”

• Each thread uses IDs to decide what data to work on – Block ID: 1D or 2D

– Thread ID: 1D, 2D, or 3D

• Simplifies memory addressing when processing multidimensional data – Image processing

– Solving PDEs on volumes

– …

CUDA Memory Model Overview

• Global memory

– Main means of

communicating R/W

Data between host and

device

– Contents visible to all

threads

– Long latency access

Grid

Global Memory

Block (0, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

Block (1, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

Host

4/25/2012

8

15

CUDA Device Memory Allocation

• cudaMalloc()

– Allocates object in the

device Global Memory

– Requires two parameters

• Address of a pointer to the

allocated object

• Size of allocated object

• cudaFree()

– Frees object from device

Global Memory

• Pointer to freed object

Grid

Global

Memory

Block (0, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

Block (1, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

Host

DON’T use a CPU

pointer in a GPU

function !

CUDA Device Memory Allocation (cont.)

• Code example:

– Allocate a 64 * 64 single precision float array

– Attach the allocated storage to Md

– “d” is often used to indicate a device data structure

TILE_WIDTH = 64;

float* Md;

int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);

cudaMalloc((void**)&Md, size);

cudaFree(Md);

4/25/2012

9

CUDA Host-Device Data Transfer

• cudaMemcpy()

– memory data transfer

– Requires four parameters

• Pointer to destination

• Pointer to source

• Number of bytes copied

• Type of transfer

– Host to Host

– Host to Device

– Device to Host

– Device to Device

• Non-blocking/asynchronous

transfer

Grid

Global

Memory

Block (0, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

Block (1, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

Host

CUDA Host-Device Data Transfer

(cont.)

• Code example:

– Transfer a 64 * 64 single precision float array

– M is in host memory and Md is in device memory

– cudaMemcpyHostToDevice and

cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

4/25/2012

10

CUDA Function Declarations

host host __host__ float HostFunc()‏

host device __global__ void KernelFunc()‏

device device __device__ float DeviceFunc()‏

Only callable

from the:

Executed

on the:

• __global__ defines a kernel function

– Must return void

• __device__ and __host__ can be used

together

__global__ void add(int a, int b, int *c)

{

*c = a + b;

}

int main()

{

int a,b,c;

int *dev_c;

a=3;

b=4;

cudaMalloc((void**)&dev_c, sizeof(int));

add<<<1,1>>>(a,b,dev_c); // 1 Block and 1 Thread/Block

cudaMemcpy(&c, dev_c, sizeof(int), cudaMemcpyDeviceToHost);

printf("%d + %d is %d\n", a, b, c);

cudaFree(dev_c);

return 0;

}

Code Example

4/25/2012

11

#define N 10

void add(int *a, int *b, int *c)

{

int tID = 0;

while (tID < N)

{

c[tID] = a[tID] + b[tID];

tID += 1;

}

}

int main()

{

int a[N], b[N], c[N];

// Fill Arrays

for (int i = 0; i < N; i++)

{

a[i] = i,

b[i] = 1;

}

add (a, b, c);

for (int i = 0; i < N; i++)

{

printf("%d + %d = %d\n", a[i], b[i], c[i]);

}

return 0;

}

Sequential Code – Adding Arrays

#include "stdio.h"

#define N 10

__global__ void add(int *a, int *b, int *c)

{

int tID = blockIdx.x;

if (tID < N)

{

c[tID] = a[tID] + b[tID];

}

}

CUDA Code – Adding Arrays

int main()

{

int a[N], b[N], c[N];

int *dev_a, *dev_b, *dev_c;

cudaMalloc((void **) &dev_a, N*sizeof(int));

cudaMalloc((void **) &dev_b, N*sizeof(int));

cudaMalloc((void **) &dev_c, N*sizeof(int));

// Fill Arrays

for (int i = 0; i < N; i++)

{

a[i] = i,

b[i] = 1;

}

cudaMemcpy(dev_a, a, N*sizeof(int), cudaMemcpyHostToDevice);

cudaMemcpy(dev_b, b, N*sizeof(int), cudaMemcpyHostToDevice);

add<<<N,1>>>(dev_a, dev_b, dev_c);

cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost);

for (int i = 0; i < N; i++)

{

printf("%d + %d = %d\n", a[i], b[i], c[i]);

}

return 0;

}

4/25/2012

12

Julia Fractal

• Evaluates an iterative equation for points in the complex plane

– A point is not in the set if iterating diverges and approaches infinity

– A point is in the set if iterating remains bounded

• Equation

– Zn+1=Zn2

+ C

• Where Z is a point in the complex plane, C is a constant

• Our implementation uses the freeimage library

CPU Fractal Implementation

• Structure to store, multiply, and divide complex

numbers

#include "FreeImage.h"

#include "stdio.h"

#define DIM 1000

struct cuComplex {

float r;

float i;

cuComplex( float a, float b ) : r(a), i(b) {}

float magnitude2( void ) { return r * r + i * i; }

cuComplex operator*(const cuComplex& a) {

return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);

}

cuComplex operator+(const cuComplex& a) {

return cuComplex(r+a.r, i+a.i);

}

};

4/25/2012

13


• Julia function

int julia(int x, int y)

{

const float scale = 1.5;

float jx = scale * (float)(DIM/2 - x)/(DIM/2);

float jy = scale * (float)(DIM/2 - y)/(DIM/2);

cuComplex c(-0.8, 0.156);

cuComplex a(jx, jy);

int i = 0;

for (i = 0; i < 200; i++)

{

a = a*a + c;

if (a.magnitude2() > 1000) return 0;

}

return 1;

}


• What will become our kernel

– Array of char is 0 or 1 to indicate pixel or no pixel

void kernel(char *ptr)

{

for (int y = 0; y<DIM; y++)

for (int x=0; x<DIM; x++)

{

int offset = x + y * DIM;

ptr[offset] = julia(x,y);

}

}

4/25/2012

14

CPU Fractal Implementation int main()

{

FreeImage_Initialise();

FIBITMAP * bitmap = FreeImage_Allocate(DIM, DIM, 32);

char charmap[DIM][DIM];

kernel(&charmap[0][0]);

RGBQUAD color;

for (int i = 0; i < DIM; i++){

for (int j = 0; j < DIM; j++){

color.rgbRed = 0;

color.rgbGreen = 0;

color.rgbBlue = 0;

if (charmap[i][j]!=0)

color.rgbBlue = 255;

FreeImage_SetPixelColor(bitmap, i, j, &color);

}

}

FreeImage_Save(FIF_BMP, bitmap, "output.bmp");

FreeImage_Unload(bitmap);

return 0;

}

GPU Fractal Implementation

• Assign the computation of each point to a processor

• Use a 2D block and the blockIdx.x and blockIdx.y

variables to determine which pixel we should be

working on

4/25/2012

15

GPU Fractal

• __device__ makes this accessible from the compute

device

__device__ struct cuComplex {

float r;

float i;

__device__ cuComplex( float a, float b ) : r(a), i(b) {}

__device__ float magnitude2( void ) { return r * r + i * i; }

__device__ cuComplex operator*(const cuComplex& a) {

return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);

}

__device__ cuComplex operator+(const cuComplex& a) {

return cuComplex(r+a.r, i+a.i);

}

};

GPU Fractal

__device__ int julia(int x, int y)

{

// Same as CPU version

}

__global__ void kernel(char *ptr)

{

int x = blockIdx.x;

int y = blockIdx.y;

int offset = x + y * DIM;

ptr[offset] = julia(x,y);

}

4/25/2012

16

GPU Fractal

int main()

{

FreeImage_Initialise();

FIBITMAP * bitmap = FreeImage_Allocate(DIM, DIM, 32);

char charmap[DIM][DIM];

char *dev_charmap;

cudaMalloc((void**)&dev_charmap, DIM*DIM*sizeof(char));

dim3 grid(DIM,DIM);

kernel<<<grid,1>>>(dev_charmap);

cudaMemcpy(charmap, dev_charmap, DIM*DIM*sizeof(char),

cudaMemcpyDeviceToHost);

GPU Fractal

RGBQUAD color;

for (int i = 0; i < DIM; i++){

for (int j = 0; j < DIM; j++){

color.rgbRed = 0;

color.rgbGreen = 0;

color.rgbBlue = 0;

if (charmap[i][j]!=0)

color.rgbBlue = 255;

FreeImage_SetPixelColor(bitmap, i, j, &color);

}

}

FreeImage_Save(FIF_BMP, bitmap, "output.bmp");

FreeImage_Unload(bitmap);

cudaFree(dev_charmap);

return 0;

}

GPU History CUDA - cse.uaa.alaska.eduafkjm/cs331/handouts/gpu-cuda.pdf · CUDA Intro Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations

Documents