Top Banner
GPU Libraries Alan Gray EPCC The University of Edinburgh
17

GPU Libraries

Jan 02, 2016

Download

Documents

ahmed-freeman

GPU Libraries. Alan Gray EPCC The University of Edinburgh. Overview. Motivation for Libraries Simple example: Matrix Multiplication Overview of available GPU libraries. Computational Libraries. There are many “common” computational operations that are relevant for multiple problems - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: GPU  Libraries

GPU Libraries

Alan Gray

EPCC

The University of Edinburgh

Page 2: GPU  Libraries

Overview

• Motivation for Libraries

• Simple example: Matrix Multiplication

• Overview of available GPU libraries

2

Page 3: GPU  Libraries

Computational Libraries

• There are many “common” computational operations that

are relevant for multiple problems

• It is not productive for each user to implement their own

version from scratch

• It is also usually very complex to implement in a way that

gets optimal performance

• Solution: re-usable libraries– User just integrates call to library function within code– Library implementation is optimised for platform in use

• Obviously only works if desired library exists– Many CPU libraries have developed and in use for many years– An increasing number of GPU libraries are now available

3

Page 4: GPU  Libraries

Simple Example: Matrix Multiplication

4

1 2

3 4

5 6

7 8

1x5 + 2x7

1x6 + 2x8

3x5 + 4x7

3x6 + 4x8

19 22

43 50

Page 5: GPU  Libraries

Simple Example: Matrix Multiplication

5

for (i = 0; i < 2; i++) { for (j = 0; j < 2; j++) { matrix3[i][j]=0.; for (k = 0; k < 2; k++) { matrix3[i][j]+=matrix1[i][k]*matrix2[k][j]; } }}

1 2

3 4

5 6

7 8

1x5 + 2x7

1x6 + 2x8

3x5 + 4x7

3x6 + 4x8

19 22

43 50

matrix1 matrix2

matrix3

Page 6: GPU  Libraries

Matrix multiplication for large N

6

for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { matrix3[i][j]=0.; for (k = 0; k < N; k++) { matrix3[i][j]+=matrix1[i][k]*matrix2[k][j]; } }}• Each element of the result matrix is built up as the sum of a number of

multiplications

• This naïve implementation is not the only order in which the sum can be

accumulated

• It is much faster (when N is large) to rearrange the nested loop structure

such that small sub blocks of matrix1 and matrix2 are operated on in turn– Because these can be kept resident in fast on-chip caches and/or registers– Removes memory access bottlenecks

Page 7: GPU  Libraries

Linear Algebra Libraries

• Matrix multiplication (and similar) can be implemented easily

by hand, but results will be sub-optimal

• The Basic Linear Algebra Subprograms (BLAS) has been

around since 1979, and provides a range of basic linear

algebra operations– With implementations optimised for modern CPUs

• cuBLAS, a GPU-accelerated implementation, is available as

part of the CUDA distribution

• Other more complex linear algebra operations, e.g. matrix

inversion, eigenvalue determination… (built out of multiple

BLAS operations), and are available in LAPACK (CPU) – with MAGMA (free) and CULA (commercial) being two alternative

GPU-accelerated implementations 7

Page 8: GPU  Libraries

cuBLAS Matrix Multiplication

• First, note that cuBLAS uses linear indexing with column-major

storage – 2D arrays need to be “flattened”

8

int ld = N // leading dimension

for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { matrix3[i*ld+j]=0.; for (k = 0; k < N; k++) { matrix3[i*ld+j]+=matrix1[i*ld+k]*matrix2[k*ld+j]; } }}

Page 9: GPU  Libraries

cuBLAS Matrix Multiplication

9

http://docs.nvidia.com/cuda/cublas

Page 10: GPU  Libraries

10

Page 11: GPU  Libraries

11

Page 12: GPU  Libraries

cuBLAS Matrix Multiplication

• For our simple 2x2 example earlier

12

double alpha=1.0;double beta=0.0;int ld=2; //leading dimensionint N=2;

cublasHandle_t handle;cublasCreate(&handle);

//allocate memory for d_matrix1, d_matrix2, and d_matrix3 on GPU

// copy data to d_matrix and d_matrix2 on GPU

cublasDgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N, N, N, N, &alpha, d_matrix1, ld, d_matrix2, ld, &beta, d_matrix3, ld);

//also some additional code needed to ensure success of operation

//copy result d_matrix3 back from GPU

//free GPU memory

//cublasDestroy(handle);

Page 13: GPU  Libraries

GPU Accelerated Libraries

• developer.nvidia.com/gpu-accelerated-libraries

13

Page 14: GPU  Libraries

14

Page 15: GPU  Libraries

15

Page 16: GPU  Libraries

16

Page 17: GPU  Libraries

17