Top Banner
CUDA Toolkit 3.2 Math Library Performance January, 2011
14

NVIDIAs CUDA Libraries Performance Report

Mar 25, 2016

Download

Documents

Der neue CUDA Libraries Performance Report von NVIDIA ist ab sofort verfügbar. Der Bericht enthält eine Übersicht zu den Performance-Verbesserungen, die das aktuelle CUDA-Toolkit bietet, zum Beispiel in den Bereichen FFT, BLAS, Sparse-Matrixmultiplikation oder Random Number Generation.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: NVIDIAs CUDA Libraries Performance Report

CUDA Toolkit 3.2

Math Library Performance

January, 2011

Page 2: NVIDIAs CUDA Libraries Performance Report

2

CUDA Toolkit 3.2 Libraries

NVIDIA GPU-accelerated math libraries:

cuFFT – Fast Fourier Transforms Library

cuBLAS – Complete BLAS library

cuSPARSE – Sparse Matrix Library

cuRAND – Random Number Generation (RNG) Library

For more information on CUDA libraries: http://www.nvidia.com/object/gtc2010-presentation-archive.html#session2216

Page 3: NVIDIAs CUDA Libraries Performance Report

3

cuFFT

Multi-dimensional Fast Fourier Transforms

New in CUDA 3.2 :

Higher performance of 1D, 2D, 3D transforms with

dimensions of powers of 2, 3, 5 or 7

Higher performance and accuracy for 1D transform sizes

that contain large prime factors

Page 4: NVIDIAs CUDA Libraries Performance Report

4

cuFFT 3.2 up to 8.8x Faster than MKL

0

50

100

150

200

250

300

350

400

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

N (transform size = 2^N)

Single-Precision 1D Radix-2

0

20

40

60

80

100

120

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

N (transform size = 2^N)

Double-Precision 1D Radix-2

* cuFFT 3.2, Tesla C2050 (Fermi) with ECC on

* MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz

* FFTW single-thread on same CPU

1D used in audio processing and as a foundation for 2D and 3D FFTs

Performance may vary based on OS version and motherboard configuration

CUFFT

MKL

FFTW

CUFFT

MKL

FFTW

GF

LO

PS

GF

LO

PS

Page 5: NVIDIAs CUDA Libraries Performance Report

5

cuFFT 1D Radix-3 up to 18x Faster than MKL

0

20

40

60

80

100

120

140

160

180

200

220

240

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

N (transform size = 3^N)

Single-Precision

0

10

20

30

40

50

60

70

80

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

N (transform size = 3^N)

Double-Precision

GFLOPS GFLOPS

CUFFT (ECC off)

CUFFT (ECC on)

MKL

18x for single-precision, 15x for double-precision

Similar acceleration for radix-5 and -7

* CUFFT 3.2 on Tesla C2050 (Fermi)

* MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz Performance may vary based on OS version and motherboard configuration

CUFFT (ECC off)

CUFFT (ECC on)

MKL

GF

LO

PS

GF

LO

PS

Page 6: NVIDIAs CUDA Libraries Performance Report

6

cuFFT 3.2 up to 100x Faster than cuFFT 3.1

0x

1x

10x

100x

0 2048 4096 6144 8192 10240 12288 14336 16384

Sp

ee

du

p

1D Transform Size

Speedup for C2C single-precision Log Scale

cuFFT 3.2 vs. cuFFT 3.1 on Tesla C2050 (Fermi)

Performance may vary based on OS version and motherboard configuration

Page 7: NVIDIAs CUDA Libraries Performance Report

7

cuBLAS: Dense Linear Algebra on GPUs

Complete BLAS implementation

Supports all 152 routines for single, double, complex and

double complex

New in CUDA 3.2

7x Faster GEMM (matrix multiply) on Fermi GPUs

Higher performance on SGEMM & DGEMM for all matrix sizes

and all transpose cases (NN, TT, TN, NT)

Page 8: NVIDIAs CUDA Libraries Performance Report

8

Up to 8x Speedup for all GEMM Types

636

775

301 295

78 80 39 40

0

100

200

300

400

500

600

700

800

900

SGEMM CGEMM DGEMM ZGEMM

GF

LO

PS

GEMM Performance on 4K by 4K matrices

CUBLAS3.2

* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on

* MKL 10.2.3, 4-core Corei7 @ 2.66Ghz Performance may vary based on OS version and motherboard configuration

Page 9: NVIDIAs CUDA Libraries Performance Report

9

cuBLAS 3.2 : Consistently 7x Faster DGEMM Performance

0

50

100

150

200

250

300

350

GF

LO

PS

Dimension (m = n = k)

DGEMM 3.2 DGEMM 3.1 DGEMM MKL 4 THREADS

cuBLAS 3.2

Large perf

variance in

cuBLAS 3.1 7x Faster than MKL

* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on

* MKL 10.2.3, 4-core Corei7 @ 2.66Ghz

8%

Performance may vary based on OS version and motherboard configuration

cuBLAS 3.1

MKL 10.2.3

Page 10: NVIDIAs CUDA Libraries Performance Report

10

cuSPARSE

New library for sparse linear algebra

Conversion routines for dense, COO, CSR and CSC formats

Optimized sparse matrix-vector multiplication

1.0

2.0

3.0

4.0

y1

y2

y3

y4

\alpha + \beta

1.0

6.0

4.0

7.0

3.0 2.0

5.0

y1

y2

y3

y4

Page 11: NVIDIAs CUDA Libraries Performance Report

11

32x Faster

1

2

4

8

16

32

64

Sp

ee

du

p o

ve

r M

KL

1 Sparse Matrix x 6 Dense Vector

Speedup of cuSPARSE vs MKL

Performance may vary based on OS version and motherboard configuration

* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on

* MKL 10.2.3, 4-core Corei7 @ 3.07GHz

single

double

complex

double-complex

Log Scale

Page 12: NVIDIAs CUDA Libraries Performance Report

12

Performance of 1 Sparse Matrix x 6 Dense Vectors

0

20

40

60

80

100

120

140

160

GF

LO

PS

Test cases roughly in order of increasing sparseness

* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on Performance may vary based on OS version and motherboard configuration

single

double

complex

double-complex

Page 13: NVIDIAs CUDA Libraries Performance Report

13

cuRAND

Library for generating random numbers

Features supported in CUDA 3.2

XORWOW pseudo-random generator

Sobol’ quasi-random number generators

Host API for generating random numbers in bulk

Inline implementation allows use inside GPU functions/kernels

Single- and double-precision, uniform and normal distributions

Page 14: NVIDIAs CUDA Libraries Performance Report

14

cuRAND performance

0

2

4

6

8

10

12

14

16

18

uniformint

uniformfloat

uniformdouble

normalfloat

normaldouble

Sobol’ Quasi-RNG (1 dimension)

0

2

4

6

8

10

12

14

16

18

uniformint

uniformfloat

uniformdouble

normalfloat

normaldouble

Gig

aS

am

ple

s / s

ec

on

d

XORWOW Psuedo-RNG

* CURAND 3.2, NVIDIA C2050 (Fermi), ECC on Performance may vary based on OS version and motherboard configuration