Akira Naruse, 9 th Nov. 2017 VOLTA (TESLA V100)
Akira Naruse, 9th Nov. 2017
VOLTA (TESLA V100)
2
VOLTA
3
SG
EM
M /
W
2012 20142008 2010 2016
48
36
12
0
24
60
2018
72
Tesla Fermi
Kepler
Maxwell
Pascal
Volta
GPU ROADMAPS
4
VOLTA: TESLA V100
HPC and Deep Learning、両方に適した最速のGPU
Volta Architecture
Most Productive GPU
Tensor Core
120 Programmable
TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink &
HBM2
Efficient Bandwidth
5
21B transistors815 mm2
80 SM5120 CUDA Cores640 Tensor Cores
16 GB HBM2900 GB/s HBM2
300 GB/s NVLink
VOLTA: TESLA V100
*full GV100 chip contains 84 SMs
6
P100 V100 Ratio
DL ops (FP16 or Mixed) 21 TOPS 120 TOPS 6x
FP32 10 TFLOPS 15 TFLOPS 1.5x
FP64 5 TFLOPS 7.5 TFLOPS 1.5x
L1 Caches 1.3 MB 10 MB 7.7x
L2 Cache 4 MB 6 MB 1.5x
Memory 720 GB/s 900 GB/s 1.2x
NVLink 160 GB/s 300 GB/s 1.9x
GPU性能比較
演算
容量
帯域
7
NEW HBM2 MEMORY ARCHITECTURE
STREAM
: Tr
iad-
Delivere
d G
B/s
P100 V10076% DRAM Utilization
95% DRAM Utilization
実効帯域は1.5倍
V100 measured on pre-production hardware.
HBM2 stack
8
ROAD TO EXASCALEVolta to Fuel Most Powerful
US Supercomputers
Rela
tive t
o T
esl
a P
100
Volta HPC Application Performance
System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware.
Summit
Supercomputer
200+ PetaFlops
~3,400 Nodes
10 Megawatts
1.5x
9
21B transistors815 mm2
80 SM5120 CUDA Cores640 Tensor Cores
16 GB HBM2900 GB/s HBM2
300 GB/s NVLink
VOLTA: TESLA V100
*full GV100 chip contains 84 SMs
10
VOLTA GV100 SM
GV100
FP32 units 64
FP64 units 32
INT units 64
Tensor Cores 8
Register File 256 KB
Unified L1/Shared
memory
128 KB
Active Threads 2048
11
VOLTA GV100 SM
Completely new ISA
Twice the schedulers
Simplified Issue Logic
Large, fast L1 cache
Improved SIMT model
Tensor acceleration
=
GPU史上、最も性能の出しやすいSM
使い勝手の良いアーキテクチャ
12
MPS: MULTI-PROCESS SERVICE複数プロセスで、安全かつ効率的にGPUを共有
Limited Isolation
A B C
CUDA MULTI-PROCESS SERVICE
Pascal GP100
A
B
C
CPU Processes
GPU Execution
Hardware Isolation
VOLTA MULTI-PROCESS SERVICE
Volta GV100
A B C
CUDA MULTI-PROCESS SERVICE CONTROLCPU Processes
GPU Execution
A B C
Pascal Volta
13
VOLTA: INDEPENDENT THREAD SCHEDULING
Pascal: Lock-Free Algorithms Volta: Starvation Free Algorithms
Communicating Algorithms
Threads cannot wait for messages Threads may wait for messages
14
PASCAL SIMT EXECUTION MODEL
ワープ内の分岐したスレッド間で、データ交換ができない
Time
X; Y;
div
erg
e
reconverg
e
A; B;
if (threadIdx.x < 4) {A;__syncwarp();B;
} else {X;__syncwarp();Y;
}
15
VOLTA SIMT EXECUTION MODEL
div
erg
e
A; B;
X; Y;
ワープ内の分岐したスレッド間でも、データ交換が可能
Time
synchro
niz
e
if (threadIdx.x < 4) {A;__syncwarp();B;
} else {X;__syncwarp();Y;
}__syncwarp();
16
VOLTA TENSOR CORE
17
TENSOR CORE128 ops /cycle
D = AB + C
D =
FP16 or FP32 FP16 FP16 FP16 or FP32
A0,0 A0,1 A0,2 A0,3
A1,0 A1,1 A1,2 A1,3
A2,0 A2,1 A2,2 A2,3
A3,0 A3,1 A3,2 A3,3
B0,0 B0,1 B0,2 B0,3
B1,0 B1,1 B1,2 B1,3
B2,0 B2,1 B2,2 B2,3
B3,0 B3,1 B3,2 B3,3
C0,0 C0,1 C0,2 C0,3
C1,0 C1,1 C1,2 C1,3
C2,0 C2,1 C2,2 C2,3
C3,0 C3,1 C3,2 C3,3
Mixed Precision
18
TENSOR SYNCHRONIZATION
ワープ内スレッドで同期
Full Warp 16x16 Matrix Math
16x16の行列乗算を、4x4の行列乗算の組み合わせとして実行
各スレッドに結果を分配
Warp (32 threads)
19
VOLTA TENSOR OPERATION
FP16
storage/input
Full precision
product
Sum with
FP32
accumulator
Convert to
FP32 result
FP16
FP16× + FP32
FP32
more products
20
USING TENSOR CORES
Volta Optimized Libraries
__device__ void tensor_op_16_16_16(float *d, half *a, half *b, float *c)
{wmma::fragment<matrix_a, …> Amat;wmma::fragment<matrix_b, …> Bmat;wmma::fragment<matrix_c, …> Cmat;
wmma::load_matrix_sync(Amat, a, 16);wmma::load_matrix_sync(Bmat, b, 16);wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16,wmma::row_major);
}
CUDA C++
Warp-Level Matrix Operations
NVIDIA cuDNN, cuBLAS, TensorRT
21
VOLTA: A GIANT LEAP FOR DEEP LEARNING
P100 V100 P100 V100
Images
per
Second
Images
per
Second
2.4x faster 3.7x faster
FP32 Tensor Cores FP16 Tensor Cores
V100 measured on pre-production hardware.
ResNet-50 Training ResNet-50 Inference
TensorRT - 7ms Latency
22
FP16でトレーニングして、精度は大丈夫なのか?
23
大丈夫です、Tensor Coreを使えばhttp://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html
Training with Mixed-Precision User Guide
24
大丈夫です、Tensor Coreを使えば
• Mixed Precision Training
• Forward中、Backward中は、ほぼ全てfp16で実行して問題ない (Tensorコアを使えば)
• Update(重みの更新)は、fp32で実行した方がよい (Update時間は短い)
• モデルによっては、Loss scalingと呼ばれるテクニックが必要 (オーバーヘッド小)
• 主要DLフレームワークで使用可能
• TensorFlow, MxNet, PyTorch, Caffe2, Theano, MS Cognitive Toolkit, Chainer
http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html
Training with Mixed-Precision User Guide
25
どれぐらい、Voltaで速くなるの?P100 FP32, V100 FP32 vs. V100 Tensor Core
Resnet50
(*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用
Conv,
1x1,
64
Conv,
3x3,
64
Conv,
1x1,
256
BN
ReLU
BN
ReLU
BN +x
ReLU
26
どれぐらい、Voltaで速くなるの?P100 FP32, V100 FP32 vs. V100 Tensor Core
0 100 200 300 400 500 600
Conv BN Relu Cupy_* Misc.
570 ms
360 ms
197 ms
ImageNet, Resnet50, Batch:128 Time per iteration [ms]
P100 FP32
V100 FP32
V100Tensor Core
(*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用
27
どれぐらい、Voltaで速くなるの?P100 FP32, V100 FP32 vs. V100 Tensor Core
0 100 200 300 400 500 600
Conv BN Relu Cupy_* Misc.
570 ms
360 ms
197 ms
ImageNet, Resnet50, Batch:128 Time per iteration [ms]
P100 FP32
V100 FP32
V100Tensor Core
(*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用
28
どれぐらい、Voltaで速くなるの?P100 FP32, V100 FP32 vs. V100 Tensor Core
0 100 200 300 400 500 600
Conv BN Relu Cupy_* Misc.
570 ms
360 ms
197 ms
ImageNet, Resnet50, Batch:128 Time per iteration [ms]
約3倍
P100 FP32
V100 FP32
V100Tensor Core
(*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用
29
マルチGPU性能ImageNet, Resnet50, Batch/GPU:128
224430
857
1,657
355675
1,331
2,530
649
1,199
2,359
4,064
0
1,000
2,000
3,000
4,000
5,000
1 GPU 2 GPUs 4 GPUs 8 GPUs
P100 FP32 V100 FP32 V100 Tensor Core
Images
per
second
(*) CUDA 9, cuDNN 7, NCCL 2, Chainer 3.0.0rc1+, CuPy 2.0.0rc1+ を使用、マシンはDGX1(V)