Top Banner
© APC CUDA math libraries
23

CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

May 03, 2018

Download

Documents

truongliem
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 2: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 2

CUDA Libraries

http://developer.nvidia.com/cuda-tools-ecosystem

CUDA Toolkit

• CUBLAS – linear algebra

• CUSPARSE – linear algebra with sparse matrices

• CUFFT – fast discrete Fourier transform

• CURAND – random number generation

• Thrust – STL-like template library

• NPP – signal and image processing

• NVCUVENC/NVCUVID – video encoder and decoder libraries

Page 3: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 3

3rd party libraries

MAGMA – heterogeneous LAPACK and BLAS

CUSP – algorithms for sparse linear algebra and graph computations

ArrayFire – comprehensive GPU matrix library

CULA Tools

IMSL Fortran Numerial Library

GPU AI – path finding

GPU AI for board games

Page 4: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 4

CUBLAS BLAS interface implementation

Column-major adressing, 0- and 1-based indexing

C compatibility macros

Level Complexity Examples

1 (vector-vector) AXPY: DOT:

2 (matrix-vector) GEMV – matrix-vector multiplication

3 (matrix-matrix) GEMM – matrix-matrix multiplication

O n

2O n

3O n

y ax y

,s x y

Page 5: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 5

CUBLAS Naming convention: cublas<T><func>

• <T> - data type S – single precision, real number

D – double precision, real number

C – single precision, complex number

Z – double precision, complex number

• <func> - BLAS literal

• Example: cublasDgemm

In API v.2 (CUDA 4.0+) handles are used for thread safety

Page 6: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 6

CUBLAS Additional types:

• cuComplex, cuDoubleComplex

• cublasHandle_t

• cublasStatus_t

Helper functions

• cublasCreate() / cublasDestroy()

• cublas{Get|Set}Stream()

• cublas{Get|Set}{Vector|Matrix}[Async]()

Page 7: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 7

CUBLAS - workflow

Initialize CUBLAS descriptor (cublasCreate())

Allocate GPU memory and upload data

Call all the necessary CUBLAS functions

Copy data from the GPU to host memory

Free CUBLAS descriptor (cublasDestroy())

Page 8: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 8

CUSPARSE

BLAS-like interface implementation for sparse matrices

Sparse = a lot of zero elements

Formats: • Dense format (often ineffective)

• COO: Coordinate

• CSR/CSC: Compressed Sparse Row/Column

• ELL: Ellpack-Itpack

• HYB: Hybrid

• BSR: Block Compressed Sparse Row

Page 9: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 9

Sparse Formats: COO

nnz = 9

cooValA = [1 4 2 3 5 7 8 9 6]

cooRowIndA = [0 0 1 1 2 2 2 3 3]

cooColIndA = [0 1 1 2 0 3 4 2 4]

1 4 0 0 0

0 2 3 0 0

5 0 0 7 8

0 0 9 0 6

A

Page 10: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 10

Sparse Formats: CSR

nnz = 9

cooValA = [1 4 2 3 5 7 8 9 6]

cooRowIndA = [0 2 4 7 9]

cooColIndA = [0 1 1 2 0 3 4 2 4]

1 4 0 0 0

0 2 3 0 0

5 0 0 7 8

0 0 9 0 6

A

Page 11: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 11

Sparse Formats: CSC

nnz = 9

cooValA = [1 5 4 2 3 9 7 8 6]

cooRowIndA = [0 2 0 1 1 3 2 2 3]

cooColIndA = [0 2 4 6 7 9]

1 4 0 0 0

0 2 3 0 0

5 0 0 7 8

0 0 9 0 6

A

Page 12: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 12

CUSPARSE – features

4 levels : cusparse<T><func>

• Sparse and dense vectors

• Sparse matrices and vectors

• Sparse matrices and dense matrices

• Format conversions

Single/Double Precision, Real/Complex values

Page 13: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 13

CUSPARSE – workflow

Initialize descriptor (cusparseCreate())

Allocate GPU memory and upload data

Call all the necessary CUSPARSE functions

Copy data from the GPU to host memory

Free CUBLAS descriptor(cusparseDestroy())

Page 14: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 14

CUFFT - Fast Discrete Fourier Transform

1

0

2exp

N

k n

n

iF f kn

N

1

0

1 2exp

N

n k

k

if F kn

N N

Page 15: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 15

CUFFT

Interface similar to FFTW (FFTW compatibility)

1D, 2D and 3D forward and inverse DFT

Single/Double Real/Complex

Up to 128M single precision elements in each dimension, 64M for double precision

CUDA Streams support (Asyncronous transforms)

IFFT(FFT(A)) = len(A)*A

Page 16: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 16

CUFFT - Example Poisson equation:

Exact solution:

2

4 2

, 0 , 1

, 2 ,, exp

2

u p f p p x y

s x y s x yf x y

0 2

,, exp

2

s x yu x y

Page 17: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 17

CUFFT - Example Numeric solution:

RSH expanded in Fourier harmonics

2expW i

N

1

2, 0

1, ,

Nnk mj

k j

j k

f n m f x y WN

1

2, , 4n n m mu n m f n m h W W W W

1

, 0

, ,N

nk mj

k j

j k

u x y u n m W

Page 18: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 18

CURAND

Pseudo- and Quasi-Random Number Generation

XORWOW, MRG32K3A, MTGP32 and SOBOL algorithms of generation

Distributions:

• Uniform

• [Log]Normal

• Poisson

Has 2 interfaces: for device and for host

Page 19: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 19

NPP: Image & Signal Processing

Similar to IPP

Arithmetic and logical operations

Color model conversion

Compression

Filtering Functions

Geometry transforms

Statistics functions

Page 20: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 20

ArrayFire

A comprehensive GPU matrix library: • Linear Algebra

• Signal&image processing

• Statistics

• Code timing

• Graphics

Unified array container type: • Single/Double Real/Complex [Un]signed + boolean

• ND support

• Easy index manipulation (Matlab-like)

• Parallel gfor loops and multi-gpu scaling

Page 21: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 21

ArrayFire

Example: Conway’s Game of Life

array dA(nx+2, ny+2, nz+2, A, afHost, 1); //The initialization

array dC(nx+2, ny+2, nz+2, s32);

array kernel = constant(1, 3, 3, 3, s32); //Convolution kernel

for (step=1; step<= num_steps; step++){

// Neighbors count

dC = convolve(dA.as(f32), kernel.as(f32)).as(s32);

dC -= dA;

// Evolution

dA = ((dA==0)*((dC==6) || (dC==7)) +

(dA==1)*((dC<=7) && (dC>=4))).as(s32);

}

Page 22: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 22

ArrayFire

Example: Conway’s Game of Life

Steps Host, sec AF, sec

10 3.2 1.798

100 32.1 1.987

1000 314 3.324

10000 - 12.353

Page 23: CUDA math libraries - dkrz.de · CUDA Libraries ... •CUBLAS – linear algebra •CUSPARSE – linear algebra with sparse ... more efficient than in library Sometimes everything

APC | 23

Conclusion

If you are not a professional in some area – use libraries

If you think you are a professional in particular area – use libraries at the beginning

Do not worry if you can not implement a routine more efficient than in library

Sometimes everything above is wrong. But only sometimes.