Volta (Tesla V100) の紹介

Akira Naruse, 9th Nov. 2017

VOLTA (TESLA V100)

2

VOLTA

3

SG

EM

M /

W

2012 20142008 2010 2016

48

36

12

0

24

60

2018

72

Tesla Fermi

Kepler

Maxwell

Pascal

Volta

GPU ROADMAPS

4

VOLTA: TESLA V100

HPC and Deep Learning、両方に適した最速のGPU

Volta Architecture

Most Productive GPU

Tensor Core

120 Programmable

TFLOPS Deep Learning

Improved SIMT Model

New Algorithms

Volta MPS

Inference Utilization

Improved NVLink &

HBM2

Efficient Bandwidth

5

21B transistors815 mm2

80 SM5120 CUDA Cores640 Tensor Cores

16 GB HBM2900 GB/s HBM2

300 GB/s NVLink

VOLTA: TESLA V100

*full GV100 chip contains 84 SMs

6

P100 V100 Ratio

DL ops (FP16 or Mixed) 21 TOPS 120 TOPS 6x

FP32 10 TFLOPS 15 TFLOPS 1.5x

FP64 5 TFLOPS 7.5 TFLOPS 1.5x

L1 Caches 1.3 MB 10 MB 7.7x

L2 Cache 4 MB 6 MB 1.5x

Memory 720 GB/s 900 GB/s 1.2x

NVLink 160 GB/s 300 GB/s 1.9x

GPU性能比較

演算

容量

帯域

7

NEW HBM2 MEMORY ARCHITECTURE

STREAM

: Tr

iad-

Delivere

d G

B/s

P100 V10076% DRAM Utilization

95% DRAM Utilization

実効帯域は1.5倍

V100 measured on pre-production hardware.

HBM2 stack

8

ROAD TO EXASCALEVolta to Fuel Most Powerful

US Supercomputers

Rela

tive t

o T

esl

a P

100

Volta HPC Application Performance

System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware.

Summit

Supercomputer

200+ PetaFlops

~3,400 Nodes

10 Megawatts

1.5x

9

21B transistors815 mm2

80 SM5120 CUDA Cores640 Tensor Cores

16 GB HBM2900 GB/s HBM2

300 GB/s NVLink

VOLTA: TESLA V100

*full GV100 chip contains 84 SMs

10

VOLTA GV100 SM

GV100

FP32 units 64

FP64 units 32

INT units 64

Tensor Cores 8

Register File 256 KB

Unified L1/Shared

memory

128 KB

Active Threads 2048

11

VOLTA GV100 SM

Completely new ISA

Twice the schedulers

Simplified Issue Logic

Large, fast L1 cache

Improved SIMT model

Tensor acceleration

=

GPU史上、最も性能の出しやすいSM

使い勝手の良いアーキテクチャ

12

MPS: MULTI-PROCESS SERVICE複数プロセスで、安全かつ効率的にGPUを共有

Limited Isolation

A B C

CUDA MULTI-PROCESS SERVICE

Pascal GP100

A

B

C

CPU Processes

GPU Execution

Hardware Isolation

VOLTA MULTI-PROCESS SERVICE

Volta GV100

A B C

CUDA MULTI-PROCESS SERVICE CONTROLCPU Processes

GPU Execution

A B C

Pascal Volta

13

VOLTA: INDEPENDENT THREAD SCHEDULING

Pascal: Lock-Free Algorithms Volta: Starvation Free Algorithms

Communicating Algorithms

Threads cannot wait for messages Threads may wait for messages

14

PASCAL SIMT EXECUTION MODEL

ワープ内の分岐したスレッド間で、データ交換ができない

Time

X; Y;

div

erg

e

reconverg

e

A; B;

if (threadIdx.x < 4) {A;__syncwarp();B;

} else {X;__syncwarp();Y;

}

15

VOLTA SIMT EXECUTION MODEL

div

erg

e

A; B;

X; Y;

ワープ内の分岐したスレッド間でも、データ交換が可能

Time

synchro

niz

e

if (threadIdx.x < 4) {A;__syncwarp();B;

} else {X;__syncwarp();Y;

}__syncwarp();

16

VOLTA TENSOR CORE

17

TENSOR CORE128 ops /cycle

D = AB + C

D =

FP16 or FP32 FP16 FP16 FP16 or FP32

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

B0,0 B0,1 B0,2 B0,3

B1,0 B1,1 B1,2 B1,3

B2,0 B2,1 B2,2 B2,3

B3,0 B3,1 B3,2 B3,3

C0,0 C0,1 C0,2 C0,3

C1,0 C1,1 C1,2 C1,3

C2,0 C2,1 C2,2 C2,3

C3,0 C3,1 C3,2 C3,3

Mixed Precision

18

TENSOR SYNCHRONIZATION

ワープ内スレッドで同期

Full Warp 16x16 Matrix Math

16x16の行列乗算を、4x4の行列乗算の組み合わせとして実行

各スレッドに結果を分配

Warp (32 threads)

19

VOLTA TENSOR OPERATION

FP16

storage/input

Full precision

product

Sum with

FP32

accumulator

Convert to

FP32 result

FP16

FP16× + FP32

FP32

more products

20

USING TENSOR CORES

Volta Optimized Libraries

__device__ void tensor_op_16_16_16(float *d, half *a, half *b, float *c)

{wmma::fragment<matrix_a, …> Amat;wmma::fragment<matrix_b, …> Bmat;wmma::fragment<matrix_c, …> Cmat;

wmma::load_matrix_sync(Amat, a, 16);wmma::load_matrix_sync(Bmat, b, 16);wmma::fill_fragment(Cmat, 0.0f);

wmma::mma_sync(Cmat, Amat, Bmat, Cmat);

wmma::store_matrix_sync(d, Cmat, 16,wmma::row_major);

}

CUDA C++

Warp-Level Matrix Operations

NVIDIA cuDNN, cuBLAS, TensorRT

21

VOLTA: A GIANT LEAP FOR DEEP LEARNING

P100 V100 P100 V100

Images

per

Second

Images

per

Second

2.4x faster 3.7x faster

FP32 Tensor Cores FP16 Tensor Cores

V100 measured on pre-production hardware.

ResNet-50 Training ResNet-50 Inference

TensorRT - 7ms Latency

22

FP16でトレーニングして、精度は大丈夫なのか？

23

大丈夫です、Tensor Coreを使えばhttp://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html

Training with Mixed-Precision User Guide

24

大丈夫です、Tensor Coreを使えば

• Mixed Precision Training

• Forward中、Backward中は、ほぼ全てfp16で実行して問題ない (Tensorコアを使えば)

• Update(重みの更新)は、fp32で実行した方がよい (Update時間は短い)

• モデルによっては、Loss scalingと呼ばれるテクニックが必要 (オーバーヘッド小)

• 主要DLフレームワークで使用可能

• TensorFlow, MxNet, PyTorch, Caffe2, Theano, MS Cognitive Toolkit, Chainer

http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html

Training with Mixed-Precision User Guide

25

どれぐらい、Voltaで速くなるの?P100 FP32, V100 FP32 vs. V100 Tensor Core

Resnet50

(*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用

Conv,

1x1,

64

Conv,

3x3,

64

Conv,

1x1,

256

BN

ReLU

BN

ReLU

BN +x

ReLU

26


0 100 200 300 400 500 600

Conv BN Relu Cupy_* Misc.

570 ms

360 ms

197 ms

ImageNet, Resnet50, Batch:128 Time per iteration [ms]

P100 FP32

V100 FP32

V100Tensor Core


27


0 100 200 300 400 500 600


570 ms

360 ms

197 ms


P100 FP32

V100 FP32

V100Tensor Core


28


0 100 200 300 400 500 600


570 ms

360 ms

197 ms


約3倍

P100 FP32

V100 FP32

V100Tensor Core


29

マルチGPU性能ImageNet, Resnet50, Batch/GPU:128

224430

857

1,657

355675

1,331

2,530

649

1,199

2,359

4,064

0

1,000

2,000

3,000

4,000

5,000

1 GPU 2 GPUs 4 GPUs 8 GPUs

P100 FP32 V100 FP32 V100 Tensor Core

Images

per

second

(*) CUDA 9, cuDNN 7, NCCL 2, Chainer 3.0.0rc1+, CuPy 2.0.0rc1+ を使用、マシンはDGX1(V)

Volta (Tesla V100) の紹介

Technology