NVIDIAs CUDA Libraries Performance Report

CUDA Toolkit 3.2

Math Library Performance

January, 2011

CUDA Toolkit 3.2 Libraries

NVIDIA GPU-accelerated math libraries:

cuFFT – Fast Fourier Transforms Library

cuBLAS – Complete BLAS library

cuSPARSE – Sparse Matrix Library

cuRAND – Random Number Generation (RNG) Library

For more information on CUDA libraries: http://www.nvidia.com/object/gtc2010-presentation-archive.html#session2216

Multi-dimensional Fast Fourier Transforms

New in CUDA 3.2 :

Higher performance of 1D, 2D, 3D transforms with

dimensions of powers of 2, 3, 5 or 7

Higher performance and accuracy for 1D transform sizes

that contain large prime factors

cuFFT 3.2 up to 8.8x Faster than MKL

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

N (transform size = 2^N)

Single-Precision 1D Radix-2

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Double-Precision 1D Radix-2

* cuFFT 3.2, Tesla C2050 (Fermi) with ECC on

* MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz

* FFTW single-thread on same CPU

1D used in audio processing and as a foundation for 2D and 3D FFTs

Performance may vary based on OS version and motherboard configuration

cuFFT 1D Radix-3 up to 18x Faster than MKL

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Single-Precision

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Double-Precision

GFLOPS GFLOPS

CUFFT (ECC off)

CUFFT (ECC on)

18x for single-precision, 15x for double-precision

Similar acceleration for radix-5 and -7

* CUFFT 3.2 on Tesla C2050 (Fermi)

* MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz Performance may vary based on OS version and motherboard configuration

CUFFT (ECC off)

CUFFT (ECC on)

cuFFT 3.2 up to 100x Faster than cuFFT 3.1

0 2048 4096 6144 8192 10240 12288 14336 16384

1D Transform Size

Speedup for C2C single-precision Log Scale

cuFFT 3.2 vs. cuFFT 3.1 on Tesla C2050 (Fermi)

cuBLAS: Dense Linear Algebra on GPUs

Complete BLAS implementation

Supports all 152 routines for single, double, complex and

double complex

New in CUDA 3.2

7x Faster GEMM (matrix multiply) on Fermi GPUs

Higher performance on SGEMM & DGEMM for all matrix sizes

and all transpose cases (NN, TT, TN, NT)

Up to 8x Speedup for all GEMM Types

301 295

78 80 39 40

SGEMM CGEMM DGEMM ZGEMM

GEMM Performance on 4K by 4K matrices

CUBLAS3.2

* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on

* MKL 10.2.3, 4-core Corei7 @ 2.66Ghz Performance may vary based on OS version and motherboard configuration

cuBLAS 3.2 : Consistently 7x Faster DGEMM Performance

Dimension (m = n = k)

DGEMM 3.2 DGEMM 3.1 DGEMM MKL 4 THREADS

cuBLAS 3.2

Large perf

variance in

cuBLAS 3.1 7x Faster than MKL

* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on

* MKL 10.2.3, 4-core Corei7 @ 2.66Ghz

cuBLAS 3.1

MKL 10.2.3

cuSPARSE

New library for sparse linear algebra

Conversion routines for dense, COO, CSR and CSC formats

Optimized sparse matrix-vector multiplication

\alpha + \beta

3.0 2.0

32x Faster

1 Sparse Matrix x 6 Dense Vector

Speedup of cuSPARSE vs MKL

* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on

* MKL 10.2.3, 4-core Corei7 @ 3.07GHz

single

double

complex

double-complex

Log Scale

Performance of 1 Sparse Matrix x 6 Dense Vectors

Test cases roughly in order of increasing sparseness

* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on Performance may vary based on OS version and motherboard configuration

single

double

complex

double-complex

cuRAND

Library for generating random numbers

Features supported in CUDA 3.2

XORWOW pseudo-random generator

Sobol’ quasi-random number generators

Host API for generating random numbers in bulk

Inline implementation allows use inside GPU functions/kernels

Single- and double-precision, uniform and normal distributions

cuRAND performance

uniformint

uniformfloat

uniformdouble

normalfloat

normaldouble

Sobol’ Quasi-RNG (1 dimension)

uniformint

uniformfloat

uniformdouble

normalfloat

normaldouble

XORWOW Psuedo-RNG

* CURAND 3.2, NVIDIA C2050 (Fermi), ECC on Performance may vary based on OS version and motherboard configuration

NVIDIAs CUDA Libraries Performance Report

Documents

Processing Libraries (S3559) CUDA Accelerated...

GPU Computing with CUDA Lecture 8 - CUDA Libraries - …GPU....

CUDA performance libraries - sscc.ru CUDA-2014/07...

CUDA Libraries and CUDA Fortran - MASSIVE Home ·...

CUDA Computing Workshop - HPC | USC · CUDA Computing...

Hands-on Lab: Rapid Multi-GPU Programming with...

CUDA math libraries - dkrz.de · CUDA Libraries ......

GPU Computing with CUDA Lecture 6 - CUDA Libraries -...

TRM-06704-001 v5.5 | July 2013 CUDA SAMPLES Reference...

A Sampling of CUDA Libraries - Great Lakes Consortium ·...

CUDA Libraries and CUDA Fortran - Nvidia · CUDA Libraries....

Using OpenACC With CUDA Libraries

Overview of CUDA Libraries - Nvidia · Overview of CUDA...

parallel implementation of a ray tracer for underwater sound...

High-Productivity CUDA Programming -...

CUDA Libraries and Tools - Nvidia › content › GTC ›...