Top Banner
1 Confidential ©2019 VMware, Inc. GTC 2019 Hari Sivaraman, Dimitrios Skarlatos Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS
32

vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Apr 02, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

1Confidential │ ©2019 VMware, Inc.

GTC 2019

Hari Sivaraman, Dimitrios SkarlatosLan Vu, Uday Kurkure

vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS

Page 2: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

2

vMotion for NVIDIA GRID vGPU - Agenda

• GPUs in vSphere.

• vMotion for vGPU Architecture.

• Performance of vMotion for vGPU.

• MLaaS – a case study for vMotion performance.

• Conclusions and future work.

Page 3: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

3

vMotion for NVIDIA GRID vGPU – GPUs in vSphere

vSphereHypervisor

GPUGPU GPU

VMware DirectPath I/O

Virtual Machine

Guest OS

GPU driver

Applications

Virtual Machine

Guest OS

GPU driver

Applications

Virtual Machine

Guest OS

GPU driver

Applications

Pass-throu

gh

Pass-throu

gh

Pass-throu

gh

GPU

Pass-throu

gh

vSphereHypervisor

vGPU

Virtual MachineGuest OS

GPU driver

Applications

Virtual MachineGuest OS

GPU driver

Applications

Virtual MachineGuest OS

GPU driver

Applications

Virtual MachineGuest OS

GPU driver

Applications

Nvidia GRIDvGPU manager

vGPU

Nvidia GRID vGPU

Virtual MachineGuest OS

GPU driver

Applications

Virtual MachineGuest OS

GPU driver

Applications

Virtual MachineGuest OS

GPU driver

Applications

vGPUvGPU

GRIDGPU

vGPU vGPU vGPU vGPU

vMotion Sharing

vMotion Sharing

vMotion Sharing

vSphereHypervisor

Virtual Machine

Guest OS

VMware GPU driver

Applications

Nvidia Driver

GPU

vSGAVirtual

Machine

Guest OS

VMware GPU driver

Applications

Page 4: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

4

vMotion for NVIDIA GRID vGPU – vGPU

Hypervisor

Virtual Machine

Guest OS

Applications

Virtual Machine

Guest OS

Applications

Virtual Machine

Guest OS

GPU driver

Applications

Virtual Machine

Guest OS

GPU driver

Applications

Nvidia GRIDvGPU manager

Nvidia GRID vGPUVirtual Machine

Guest OS

GPU driver

Applications

Virtual Machine

Guest OS

GPU driver

Applications

Virtual Machine

Guest OS

GPU driver

Applications

Virtual Machine

Guest OS

GPU driver

Applications

Scheduler vGPU Dedicated device memory

vGPU

vGPU Dedicated device memoryvGPU Dedicated

device memory

vGPU

• GPU Memory is statically shared

• GPU memory per VM is called vGPU Profile

• For example: P40-1q profile for P40 GPU - vGPU has 1GB of device memory - 24 vGPUs per 1 physical P40

• CUDA cores are time-shared

Page 5: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

5

vMotion for NVIDIA GRID vGPU – Types of vMotion

vMotion Network

Datastore

SourceESX Host

Destination

ESX Host

VMware ESX

VMware ESXi & ESX

VMware ESXi & ESX

vMotion

Page 6: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

6

vMotion for NVIDIA GRID vGPU – vMotion

pre-copy memory pages 1

Stun the VM2

Checkpoint devices3

Xfer device checkpoint data (includes vGPU memory data)4

Power on VM & xfer pages from main memory5

VMware ESXi & ESX VMware ESXi & ESX

vMotion

Page 7: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

7

vMotion for NVIDIA GRID vGPU - Agenda

• GPUs in vSphere.

• vMotion for vGPU Architecture.

• Performance of vMotion for vGPU.

• MLaaS – a case study for vMotion performance.

• Conclusions and future work.

Page 8: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

8

vMotion for NVIDIA GRID vGPU - Workloads

VMware vSphere Cloud Hosted CAD

MLaaS

VDI

Cloud Hosted CAD

Page 9: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

9

vMotion for NVIDIA GRID vGPU – Test-bed

VMware ESXi 6.7u1

Dell R730 – Intel Broadwell CPUs + 1 x NVidia GRID P4040 cores (2 x 20-core socket) E5-2698 v4768 GB RAM

• ESX: 6.7u1 Nvidia Driver: 410.68

VMware ESXi 6.7u1

Dell R730 – Intel Broadwell CPUs + 1 x NVidia GRID P4040 cores (2 x 20-core socket) E5-2698 v4768 GB RAM

Switch

10Gb

E

10Gb

E

Page 10: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

10

vMotion for NVIDIA GRID vGPU – Performance of Word

Increase in vMotion time due to vGPU is just marginally more than measurement noise.

Page 11: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

11

vMotion for NVIDIA GRID vGPU – Performance of Word

Increase in vMotion time due to vGPU is just marginally more than measurement noise.

Page 12: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

12

vMotion for NVIDIA GRID vGPU – Performance of SPECapc for 3dsmax 2015

Benchmark: SPEcapc for 3dsmask 2015

Software: Autodesk 3dsmax 2015

Negligible increase in run-time due to vMotion!

Page 13: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

13

vMotion for NVIDIA GRID vGPU – Performance of SPECapc for 3dsmax 2015

Benchmark: SPEcapc for 3dsmask 2015

Software: Autodesk 3dsmax 2015

Negligible increase in run-time due to vMotion!

Page 14: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

14

vMotion for NVIDIA GRID vGPU – Performance of SPECapc for 3dsmax 2015

Page 15: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

15

vMotion for NVIDIA GRID vGPU – Performance of SPECapc for 3dsmax 2015

Page 16: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

16

vMotion for NVIDIA GRID vGPU - Agenda

• GPUs in vSphere.

• vMotion for vGPU Architecture.

• Performance of vMotion for vGPU.

• MLaaS – a case study for vMotion performance.

• Conclusions and future work.

Page 17: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc. 17

Revenues from the Artificial Intelligence (AI) market worldwide from 2016 to 2025

The largest proportion of revenues come from the ML/AI Enterprise Applications

Page 18: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc. 18

ML/AI Enterprise Application Deployment

Enterprise Datacenter / Clouds

ML/AIApp

ML/AIApp

ML/AIApp

Machine Learning as a Service GPUs

FPGAs

CPUs

Page 19: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc. 19

Machine Learning as a Service

Example #1 of deploying MLaaS on VMware vSphere

VMware vSphere

Virtual Machine

Physical Server

ML Frameworks

CPUs

Virtual Machine

ML Frameworks

GPUs

Pass-Through

Page 20: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc. 20

Machine Learning as a Service

Example #2 of deploying MLaaS on VMware vSphere

VMware vSphere

Virtual Machine

Physical Server

ML Frameworks

CPUs

Virtual Machine

ML Frameworks

GPUs

Mediated Pass-Through

vGPUvGPUNVIDIA GRID

Page 21: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc. 21

Machine Learning as a Service

Example #3 of deploying MLaaS on VMware vSphere with Container

VMware vSphere

Virtual Machine

Physical Server

ML Frameworks

CPUs

Virtual Machine

ML Frameworks

GPUs

vGPUvGPUNVIDIA GRID

Docker Container Docker Container

Page 22: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc. 22

Machine Learning as a Service

Example #4 of deploying MLaaS on VMware vSphere with Container & Kubernetes

VMware vSphere

Virtual Machine

Physical Server

ML Frameworks

CPUs GPUs

vGPUNVIDIA GRID

Docker Container …Kubernetes Worker

Virtual Machine

Kubernetes Master

Page 23: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc. 23

Machine Learning as a Service

VMware vSphere

Virtual Machine

Physical Server

ML Frameworks

CPUs GPUs

vGPUNVIDIA GRID

Docker Container …Kubernetes Worker

Virtual Machine

Kubernetes Master

VMware vSphere

Virtual Machine

Physical Server

ML Frameworks

CPUs GPUs

vGPUvGPUNVIDIA GRID

Docker Container …Kubernetes Worker

Virtual Machine

ML Frameworks

Docker Container

Kubernetes Worker

Example #4 of deploying MLaaS on VMware vSphere with Container & Kubernetes

Page 24: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc. 24

Experiments of MLaaS on VMware vSphereHardware and Software

VMware ESXi 6.5

Dell R730 with Intel Haswell CPUs (36 cores) + NVIDIA P40 GPU

VMware ESXi 6.5

Intel Haswell CPUs1VM with 18 vCPU

Request Prediction

Receive Response

MLaaS Clients

Page 25: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc. 25

Experiment #1: Inference ThroughputDeep Neural Network: Inception V3 vs. MobileNet – Higher is better

Models:Inception V3

48 Layers 5000 Million MAC

MobileNet:28 Layers

569 Million MAC

MobileNet

Page 26: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc. 26

Experiment #1: Inference Mean LatencyDeep Neural Network: Inception V3 vs. MobileNet

Models:Inception V3

48 Layers 5000M MAC

MobileNet:28 Layers

569 Million MAC

Page 27: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc. 27

Experiment #2: Inference Throughput

(36 CPU cores) ( 8 CPU cores & 1 GPU)

Higher is better

Page 28: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc. 28

Experiment #2: Mean Inference Latency

(36 CPU cores) ( 8 CPU cores & 1 GPU)

Lower is better

Page 29: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc. 29

Machine Learning as a Service

vMotion for NVIDIA GRID vGPU - MLaaS

VMware vSphere

Virtual Machine

Physical Server

ML Frameworks

CPUs GPUs

vGPUNVIDIA

GRID

Docker Container

Kubernetes Worker

VMware vSphere

Physical ServerCPUs GPUs

vGPUNVIDIA

GRID

ClientClient

ClientClient vMotion

Page 30: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

30

vMotion for NVIDIA GRID vGPU – Test-bed

VMware ESXi 6.7u1

Dell R730 – Intel Broadwell CPUs + 1 x NVidia GRID P4040 cores (2 x 20-core socket) E5-2698 v4768 GB RAM

• ESX: 6.7u1 Nvidia Driver: 410.68

VMware ESXi 6.7u1

Dell R730 – Intel Broadwell CPUs + 1 x NVidia GRID P4040 cores (2 x 20-core socket) E5-2698 v4768 GB RAM

Switch

10Gb

E

10Gb

E

Page 31: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

31

vMotion Stun Time

vMotion for NVIDIA GRID vGPU - MLaaS

Page 32: vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study ......Lan Vu, Uday Kurkure vMotion for NVIDIA GRID vGPU Virtual Machines: Case Study of vMotion Using MLaaS. Confidential

Confidential │ ©2019 VMware, Inc.

Agenda

32

vMotion for Nvidia GRID vGPU: Conclusions and Upcoming Improvements

• vMotion for Nvidia GRID vGPU is now available

Conclusions:

Upcoming Improvements:• Speedup xfer rate of device checkpoint and vGPU memory data.

• The performance impact of vMotion on VDI, CAD and ML applications is negligible or small.

• The performance impact of multiple vMotions running concurrently is small.

• Pre-copy vGPU memory data to reduce stun time to meet or exceed vMotion’s standard of 1 second.