λ Fernando Magno Quintão Pereira PROGRAMMING LANGUAGES LABORATORY Universidade Federal de Minas Gerais - Department of Computer Science PROGRAM ANALYSIS AND OPTIMIZATION – DCC888 MEMORY OPTIMIZATIONS FOR GRAPHICS PROCESSING UNITS The material in these slides has been taken from the NVIDIA manuals (Best Practices Guide & Optimizing Matrix Transpose in CUDA), and from a paper by Ryoo et al [Ryoo12]. See "A bit of History" in the last slide
71
Embed
Memory Optimizations for Graphics Processing Units
Memory Optimizations for Graphics Processing Units. The material in these slides has been taken from the NVIDIA manuals (Best Practices Guide & Optimizing Matrix Transpose in CUDA), and from a paper by Ryoo et al [Ryoo12]. See "A bit of History" in the last slide. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
λFernando Magno Quintão Pereira
PROGRAMMING LANGUAGES LABORATORYUniversidade Federal de Minas Gerais - Department of Computer Science
PROGRAM ANALYSIS AND OPTIMIZATION – DCC888
MEMORY OPTIMIZATIONS FOR GRAPHICS
PROCESSING UNITS
The material in these slides has been taken from the NVIDIA manuals (Best Practices Guide & Optimizing Matrix Transpose in CUDA), and from a paper by Ryoo et al [Ryoo12]. See "A bit of History" in the last slide
DCC 888
λ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory
WHAT ARE GRAPHICS PROCESSING UNITS
The Wheel of Reincarnation
• In the good old days, the graphics hardware was just the VGA. All the processing was in software.
• People started complaining: software is slow…– But, what do you want to run at the
hardware level?
A scandalously briefhistory of GPUs
1) Do you know how the frame buffer works?
2) Can you program the VGA standard in any way?
The Wheel of Reincarnation
• Some functions, like the rasterizer, are heavily used. What is rasterization?
• Better to implement these functions in hardware.
A scandalously briefhistory of GPUs
1) How can we implement a function at the hardware level?
2) What is the advantage of implementing a function at the hardware level?
3) Is there any drawback?
4) Can we program (in any way) this hardware used to implement a specific function?
Graphics Pipeline
• Graphics can be processed in a pipeline.– Transform, project, clip, display, etc…
• Some functions, although different, can be implemented by very similar hardware.
• Add a graphics API to program the shaders.– But this API is so specific… and the hardware
is so powerful… what a waste!
A scandalously briefhistory of GPUs
Shading is an example. Do you know what is a shader?
General Purpose Hardware
• Let’s add a instruction set to the shader.– Let’s augment this hardware with general purpose integer
operations.– What about adding some branching machinery too?
• Hum… add a high level language on top of this stuff.– Plus a lot of documentation. Advertise it! It should look
cool!• Oh boy: we have now two general purpose processors.– We should unify them. The rant starts all over again…
A scandalously briefhistory of GPUs
1.5 turns around the wheel
• Lets add a display processor to the display processor– After all, there are some operations that are really specific,
and performance critical…
Dedicated rasterizer
A scandalously briefhistory of GPUs
Brief Timeline
Year Transistors Model Tech
1999 25M GeForce 256 DX7, OpenGL
2001 60M GeForce 3 Programmable Shader
2002 125M GeForce FX Cg programs
2006 681M GeForce 8800 C for CUDA
2008 1.4G GeForce GTX 280
IEEE FP
2010 3.0G Fermi Cache, C++
A scandalously briefhistory of GPUs
Computer Organization
• GPUs show different types of parallelism– Single Instruction Multiple Data (SIMD)– Single Program Multiple Data (SPMD)
• In the end, we have a MSIMD hardware.
1) Why are GPUs so parallel?
2) Why traditional CPUs do not show off all this parallelism?
We can think on a SIMD hardware as a firing squad: we have a captain, and a row of soldiers. The captain issues orders, such as set, aim, fire! And all the soldiers, upon hearing one of these orders, performs an action. They all do the same action, yet, they use different guns and bullets.
An outrageously conciseoverview of theprogramming mode.
The Programming EnvironmentAn outrageously conciseoverview of theprogramming mode.
• There are two main programming languages used to program graphics processing units today: OpenCL and C for CUDA
• These are not the first languages developed for GPUs. They came after Cg or HLSL, for instance.– But they are much more general and expressive.
• We will focus on C for CUDA.• This language lets the programmer explicitly write code
that will run in the CPU, and code that will run in the GPU.– It is a heterogeneous programming language.
From C to CUDA in one StepAn outrageously conciseoverview of theprogramming mode.
void saxpy_serial(int n, float alpha, float *x, float *y) { for (int i = 0; i < n; i++) y[i] = alpha*x[i] + y[i];}// Invoke the serial function:saxpy_serial(n, 2.0, x, y);
• This program, written in C, performs a typical vector operation, reading two arrays, and writing on a third array.
• We will translate this program to C for CUDA.
1) What is the asymptotic complexity of this program?
2) How much can we parallelize this program? In a world with many – really many – processors, e.g., the PRAM world, what would be the complexity of this program?
The first Cuda program
void saxpy_serial(int n, float alpha, float *x, float *y) { for (int i = 0; i < n; i++) y[i] = alpha*x[i] + y[i];}// Invoke the serial function:saxpy_seral(n, 2.0, x, y);
__global__void saxpy_parallel(int n, float alpha, float *x, float *y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = alpha * x[i] + y[i];}// Invoke the parallel kernel:int nblocks = (n + 255) / 256;saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
What happened to the loop in the CUDA program?
An outrageously conciseoverview of theprogramming mode.
Understanding the Code
• Threads are grouped in warps, blocks and grids• Threads in different grids do not talk to each
other– Grids are divided in blocks
• Threads in the same block share memory and barriers– Blocks are divided in warps
• Threads in the same warp follow the SIMD model.
__global__void saxpy_parallel(int n, float alpha, float *x, float *y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) y[i] = alpha * x[i] + y[i];}// Invoke the parallel kernel:int nblocks = (n + 255) / 256;saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
An outrageously conciseoverview of theprogramming mode.
Raising the level
• Cuda programs contain CPU programs plus kernels
• Kernels are called via a special syntax:
• The C part of the program is compiled as traditional C.
• The kernel part is first translated into PTX, and then this high level assembly is translated into SASS.
__global__ void matMul1(float* B, float* C, float* A, int w) { float Pvalue = 0.0;
for (int k = 0; k < w; ++k) { Pvalue += B[threadIdx.y * w + k] * C[k * w + threadIdx.x]; }
A[threadIdx.x + threadIdx.y * w] = Pvalue;}void Mul(const float* A, const float* B, int width, float* C) { int size = width * width * sizeof(float); // Load A and B to the device float* Ad; cudaMalloc((void**)&Ad, size); cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice); float* Bd; cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice); // Allocate C on the device float* Cd; cudaMalloc((void**)&Cd, size); // Compute the execution configuration assuming // the matrix dimensions are multiples of BLOCK_SIZE dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y); // Launch the device computation Muld<<<dimGrid, dimBlock>>>(Ad, Bd, width, Cd); // Read C from the device cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(Ad); cudaFree(Bd); cudaFree(Cd); }
kernel<<<dGrd, dBck>>>(A,B,w,C);
An outrageously conciseoverview of theprogramming mode.
Lowering the level
• CUDA assembly is called Parallel Thread Execution (PTX)
__global__ voidsaxpy_parallel(int n, float a, float *x, float *y) {
int i = bid.x * bid.x + tid.x; if (i < n) y[i] = a * x[i] + y[i];
}
What do you think an assembly language for parallel programming should have?
An outrageously conciseoverview of theprogramming mode.
DCC 888
λ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory
MEMORY OPTIMIZATIONS
A Brief Overview of the GPU Threading Model
• Each thread has local registers and local memory
• Threads are grouped in warps– Warps run in SIMD exec
• Warps are grouped in blocks– Shared memory + syncs
• Blocks are grouped in grids– Each grid represents a kernel
1) How do different grids communicate?
2) How do threads in the same block communicate?
3) What determines the size of the block of threads?
4) What determines the size of the grid of threads?
5) What is the effect of branches in the warp execution?
A Brief Overview of the GPU Threading Model
Going to the Archives
• GPUs are much more memory intensive than traditional CPUs. Lets look into an example?– The GeForce 8800 processes 32 pixels per clock. Each pixel
contains a color (3 bytes) and a depth (4 bytes), which are read and written. On the average 16 extra bytes of information are read for each pixel. How many bytes are processed per clock?
To put these numbers in perspective, how much data is processed in each cycle of an ordinary x86 CPU?
The GPU Archive
• Registers: fast, yet few. Private to each thread
• Shared memory: used by threads in the same block
• Local memory: off-chip and slow. Private to each thread
• Global memory: off-chip and slow. Used to provide communication between blocks and grids.
The GPU Archive
1) Why can't we leave all the data in registers?
2) Why can't we leave all the data in shared memory?
3) The CPU also has a memory hierarchy. Do you remember how is this hierarchy like?
4) Why do we have a memory hierarchy also in the CPU?
• Registers: fast, yet few. Private to each thread
• Shared memory: used by threads in the same block
• Local memory: off-chip and slow. Private to each thread
• Global memory: off-chip and slow. Used to provide communication between blocks and grids.
The Traveler’s Journal
The interplanetary trip Commuting data between host and device
The silk road trip Commuting data between global and shared memory
The bakery walk Commuting data between shared memory and registers
Reaching a book on the table Reading/Writing data to registers
DCC 888
λ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory
INTER-DEVICE COMMUNICATION
The Interplanetary Trip
• Copying data between GPU and CPU is pretty slow. CUDA provides some library functions for this:
– cudaMalloc: allocates data in the GPU memory space
– cudaMemset: fills a memory area with a value
– cudaFree: frees the data in the GPU memory space
– cudaMemcpy: copies data from CPU to GPU, or vice-versa
• Once data is sent to the GPU, it stays on the DRAM, even after the kernel is done executing
• Try invoking kernels on data already on the GPU
1) Can you think about a situation in which it is better to leave a kernel, do some computation on the CPU, and then call another kernel?
2) By the way, can you think about a problem that is inherently sequential?
The GPU deserves complex work
__global__void matSumKernel(float* S, float* A, float* B, int side) { int ij = tid.x + tid.y * side; A[ij] = B[ij] + C[ij];}
__global__void matMul1(float* B, float* C, float* A, int w) { float v = 0.0; for (int k = 0; k < w; ++k) { v += B[tid.x*w+k] * C[k*w+tid.x]; } A[tid.x + tid.y * w] = v;}
1) What is the complexity of copying data from the CPU to the GPU?
2) Is it worth to do matrix sum in the GPU?
3) Is it worth to do matrix multiplication in the GPU?
Matrix Sum × Matrix Mul
• Matrix Sum:
• Matrix Mul:
The ballerina’s waltz
• Start working as soon as data is available - use a pipeline!cudaMemcpy(dst, src, N * sizeof(float), dir);kernel<<<N/nThreads, nThreads>>>(dst);
sz = N * sizeof(float) / nStreams;for (i = 0; i < nStreams; i++) { offset = i * N / nStreams; cudaMemcpyAsync(dst+offset, src+offset, sz, dir, stream[i]);}for (i=0; i<nStreams; i++) { gridSize = N / (nThreads * nStreams); offset = i * N / nStreams; kernel<<<gridSize, nThreads, 0, stream[i]>>>(dst+offset);}
• C for CUDA has an API for asynchronous transfers:
What is the glue between data and computation in this example?
The ballerina’s waltz
• Asynchronous memory copy overlaps data transfer and GPU processing
This technique to obtain parallelism, e.g., pipeline parallelism, is a pattern used in many different scenarios. Could you name other examples of pipeline parallelism?
DCC 888
λ Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory
GLOBAL MEMORY ACCESS
The Silk Road Trip
• Reading or writing to the global memory is also slow.– But not as much as reading/writing between host and
device.– The Global Memory is on-board.
The Matrix Multiplication Kernel
void matmult(float* B, float* C, float* A, int w) { for (unsigned int i = 0; i < w; ++i) { for (unsigned int j = 0; j < w; ++j) { A[i * w + j] = 0.0; for (unsigned int k = 0; k < w; ++k) { A[i * w + j] += B[i * w + k] * C[k * w + j]; } } }}
1) In the PRAM model, what is the asymptotic complexity of the matrix multiplication problem?
2) Could you translate this program to C for CUDA?
Matrix Multiplication Kernel
__global__ void matMul1(float* B, float* C, float* A, int Width) { float Pvalue = 0.0;
int tx = blockIdx.x * blockDim.x + threadIdx.x; int ty = blockIdx.y * blockDim.y + threadIdx.y;
for (int k = 0; k < Width; ++k) { Pvalue += B[tx * Width + k] * C[k * Width + ty]; }
A[ty + tx * Width] = Pvalue;} From matMul, available in the course webpage.
1) What is the asymptotic complexity of this program?
2) Given width = 10, how many accesses to the global memory does this program perform?
3) How to know how many floating-point operations per second this program will perform?
In this example, each thread is responsible for multiplying one line of B by one column of C, to produce an element of A.
1) What is the proportion of floating-point to non-floating-point operations in the innermost loop of the optimized program?
2) What do we gain, at the hardware level, for having less branches to execute?
3) About ½ of the instructions in the innermost loop are floating-point operations. What is the new GFops in the GTX 8800, that has an upper limit of 172.8 GFlops?
Shared Bank Conflicts
• The shared memory has 16 access doors.
• If two threads of a half-warp read from the same door, then a conflict happens.
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Ideal cases Very bad case
No conflictSam
e address
Matrix transpose
__shared__ float tile[BLOCK_WIDTH][BLOCK_WIDTH]; // Compute the index of the data in the input matrix: int xIndex = blockIdx.x * blockDim.x + threadIdx.x; int yIndex = blockIdx.y * blockDim.y + threadIdx.y; int index_in = yIndex * Width + xIndex;
// Compute the index of the data in the output matrix: xIndex = blockIdx.y * blockDim.x + threadIdx.x; yIndex = blockIdx.x * blockDim.y + threadIdx.y; int index_out = yIndex * Width + xIndex;
• You may not quite believe me…__global__ void transpose2(float* In, float* Out, int Width) { __shared__ float tile[BLOCK_WIDTH][BLOCK_WIDTH + 1]; // Compute the index of the data in the input matrix: int xIndex = blockIdx.x * blockDim.x + threadIdx.x; int yIndex = blockIdx.y * blockDim.y + threadIdx.y; int index_in = yIndex * Width + xIndex;
// Compute the index of the data in the output matrix: xIndex = blockIdx.y * blockDim.x + threadIdx.x; yIndex = blockIdx.x * blockDim.y + threadIdx.y; int index_out = yIndex * Width + xIndex;