Page 1
Graphics Processing UnitsREFERENCES:
•COMPUTER ARCHITECTURE 5TH EDITION, HENNESSY AND PATTERSON, 2012
•HTTP://WWW.NVIDIA.COM/CONTENT/PDF/FERMI_WHITE_PAPERS/NVIDIA_FERMI_COMPUTE_ARCHITECTURE_WHITEPAPER.PDF
•HTTP://WWW.REALWORLDTECH.COM/PAGE.CFM?ARTICLEID=RWT093009110932&P=1
•HTTP://WWW.MODERNGPU.COM/INTRO/PERFORMANCE.HTML
•HTTP://HEATHER.CS.UCDAVIS.EDU/PARPROCBOOK
Page 2
CPU vs. GPU
http://chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.html
• CPU: small fraction of chip used for arithmetic
Page 3
CPU vs GPU
http://www.pcper.com/reviews/Graphics-Cards/NVIDIA-GT200-Revealed-GeForce-GTX-280-and-GTX-260-Review/NVIDIA-GT200-Archite
• GPU: large fraction of chip used for arithmetic
Page 4
CPU vs GPU
Intel Haswell 170 GFlops on quad-core at 3.4GHz
AMD Radeon R9 290 4800 GFlops at 9.5GHz
Nvidia GTX 970 5000 Gflops at 1.05GHz
Page 5
GPGPU
General Purpose GPU programming Massively parallel
Scientific computing, brain simulations, etc
In supercomputers 53 of top500.org supercomputers used
NVIDIA/AMD GPUs (Nov 2014 ranking)
Including 2nd and 6th places
Page 6
OpenCL vs CUDA
Both for GPGPU
OpenCL Open standard
Supported on AMD, NVIDIA, Intel, Altera, …
CUDA Proprietary (Nvidia)
Losing ground to OpenCL?
Similar performance
Page 7
CUDA
Programming on Parallel Machines, Norm Matloff, Chapter 5
http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Uses a thread hierarchy Thread
Block
Grid
Page 8
Thread
Executes an instance of a kernel (program)
ThreadID (within block), program counter, registers, private memory, input and output parameters
Private memory for register spills, function calls, array variables
Nvidia Fermi Whitepaper pg 6
Page 9
Block
Set of concurrently executing threads
Cooperate via barrier synchronization and shared memory (fast but small)
BlockID (within grid)
Nvidia Fermi Whitepaper pg 6
Page 10
Grid
Array of thread blocks running same kernel
Read and write global memory (slow – hundreds of cycles)
Synchronize between dependent kernel calls
Nvidia Fermi Whitepaper pg 6
Page 11
Hardware Mapping
GPU executes 1+ kernel (program) grids
Streaming Multiprocessor (sm) executes 1+ thread blocks
CUDA core executes thread
Page 12
Fermi Architecture
Debuted in 2010
512 CUDA cores executes 1 FP or integer instruction per cycle
32 CUDA cores per SM
16 SMs per GPU
6 64-bit memory ports
PCI-Express interface to CPU
GigaThread scheduler distributes blocks to SMs each SM has a thread scheduler (in hardware)
fast context switch
3 billion transistors
Page 13
Nvid
ia F
erm
i W
hit
epaper
pg 7
Page 14
CUDA core
pipelined integer and FP units
IEEE 754-2008 FP fused multiply-add
integer unit boolean, shift, move, compare, ...
Nvidia Fermi Whitepaper pg 8
Page 15
Streaming Multiprocessor (SM) 32 CUDA cores
16 ld/st units calculate source/destination
addresses
Special Function Units sin, cosine, reciprocal, sqrt
Nvidia Fermi Whitepaper pg 8
Page 16
Warps
32 threads from a block are bundled into warps which execute the same instr/cycle
this becomes the minimum size of SIMD data
warps are implicitly synchronized if threads branch in different directions, they step
through both using predicated instructions
two warp schedulers select 1 instruction from a warp each to issue to 16 cores, 16 ld/st units or 4 SFUs
Page 17
Maxwell Architecture
2014
16 streaming multiprocessors * 128 cores/SM = 2048 cores
Page 18
Programming CUDA
C code
daxpy(n,2.0,x,y); // invoke
void daxpy(int n, double a, double *x double *y) { for(int i=0; i<n; i++) y[i] = a*x[i] + y[i];}
Page 19
Programming CUDA
CUDA code
__host__int nblocks=(n+511)/512; // grid sizedaxpy<<<nblocks,512>>(n,2.0,x,y);// 512 threads/block
__global__void daxpy(int n, double a, double *x double *y) { int i=blockIdx.x*blockDim.x + threadIdx.x; if(i<n) y[i] = a*x[i] + y[i];}
Page 20
n=8192, 512 threads/blockgrid block0 warp0 Y[0]=A*X[0]+Y[0]
...
Y[31]=A*X[31]+Y[31]
...
warp15 Y[480]=A*X[480]+Y[480]
...
Y[511]=A*X[511]+Y[511]
...
block15 warp0 Y[7680]=A*X[7680]+Y[7680]
...
Y[7711]=A*X[7711]+Y[7711]
...
warp15 Y[8160]=A*X[8160]+Y[8160]
...
Y[8191]=A*X[8191]+Y[8191]
Page 21
Moving data between host and GPU
int main() {double *x, *y, a, *dx, *dy;x = (double *)malloc(sizeof(double)*n);y = (double *)malloc(sizeof(double)*n);// initialize x and y…cudaMalloc(dx, n*sizeof(double));cudaMalloc(dy, n*sizeof(double));cudaMemcpy(dx, x, n*sizeof(double), cudaMemcpyHostToDevice); …daxpy<<<nblocks,512>>(n,2.0,x,y);cudaThreadSynchronize();cudaMemcpy(y, dy, n*sizeof(double), cudaMemcpyDeviceToHost);cudaMemFree(dx); cudaMemFree(dy);free(x); free(y);
}