Top Banner
Dr. Adolf Hohl, SA AUTO Datacenter EMEA IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX DGX REFERENCE ARCHITECTURE SOLUTION
16

IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

May 28, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

Dr. Adolf Hohl, SA AUTO Datacenter EMEA

IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGXDGX REFERENCE ARCHITECTURE SOLUTION

Page 2: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

2

Page 3: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

3

HOW DO WE TRAIN THESE NETWORKS?

• SINGLE GPU CODE is a dying specie

• All our AV DL code is made for MULTIGPU and scalable :

• Runs on Single GPU

• Runs on Multi GPU

• Runs on Multi Nodes with Multiple GPUs

• We use a Cluster for DL Training

• Just ONE codebase

• Just ONE way to orchestrate

I talked about these in a previousIBM Meetup (https://www.youtube.com/watch?v=8xj4CK4ZUMQ)

Page 4: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

4

THE TRUE TCO OF AN AI PLATFORM

Study & exploration

Platform Design Productive Experi-

mentation

HW & SW Integra-

tion

Trouble-shooting

Software eng’g

Software optimiz-

ation

Design and Build for

Scale

Software re-optimiz-

ation

InsightsTraining at Scale

1. Designing and Building an AI Compute Platform – from Scratch

OPEX

CAPEX

Day 1

Month 3

Time and budget spent on things other than data science

“DIY” TCO

Page 5: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

5

NVIDIA DGX-1: THE ESSENTIAL TOOL OF AIFastest Start, Effortless Productivity, Revolutionary Performance

1 PFLOPS | 8x Tesla V100 32GB | 300 GB/s NVLink Hybrid Cube Mesh

2x Xeon | 8 TB RAID 0 | Quad 100Gbps, Dual 10GbE | 3U — 3500W

8 TB SSD 8 x Tesla V100 16GB32GB

Page 6: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

6

STACKING DGXAggregating Ressources – Scaling Out

InfiniBand/Converged Ethernet

Interconnected Nodes

• Precondition to Scale

• Precondition for effective MultiNode-MultiGPU scaling

• Precondition to aggregate ressources which were left over

Storage

Page 7: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

7

SCALING WITH HOROVODOne Process per GPU – One Datapipeline per GPU

InfiniBand/Converged Ethernet

Storage

Tower(indiv. process)

Page 8: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

8

SOFTWARE STACK TO SCALE OUT

NVIDIA GPU CLOUD (NGC)

Ready to scale

Optimized

MPI, Horovod

NCCL

ngc.nvidia.com

IBM PowerAI

ibmcom/powerai

Ready to scale

Optimized

hub.docker.com/r/ibmcom/powerai/

Page 9: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

10

IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX

10

• NVIDIA DGX-1 | up to 9x DGX-1 Systems

• IBM Spectrum Scale NVMe Appliance| 40GB/s per

node, 120GB/s in 6RU| 300TB per node

• NETWORK: Mellanox SB7700 Switch | 2x EDR IB with

RDMA

• NVIDIA DGX SOFTWARE STACK | NVIDIA Optimized

Frameworks

• IBM: High performance, low latency, parallel file

system

• IBM: Extensible and composable

HARDWARE

SOFTWARE

The Engine to Power Your AI Data Pipeline

Page 10: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

12

IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX: SCALABLE REFERENCE ARCHITECTURES

Scaling with NVIDIA DGX-1

• Start with a single IBM Spectrum Scale NVMe and a single DGX-1

• Grow capacity in a cost-effective, modular approach

• Each config delivers balanced performance, capacity and scale

• IBM Spectrum Scale NVME all-flash appliance is power efficient to allow maximum flexibility when designing rack space and addressing power requirements

Page 11: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

13

Performance at Scale

For multiple DGX-1 servers, IBM

Spectrum Scale on NVMe architecture

demonstrates linear scale up to full

saturation of all DGX-1 server GPUs

The multi-DGX server image processing

rates shown demonstrate scalability for

Inception-v4, ResNet-152, VGG-16,

Inception-v3, ResNet-50, GoogLeNet

and AlexNet models

IBM STORAGE WITH NVIDIA DGX: FULLY-OPTIMIZED AND QUALIFIED

Page 12: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

14

BUSINESS IMPACT OFIBM SPECTRUM STORAGE FOR AI

WITH NVIDIA DGX

Page 13: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

15

THE IMPACT OF IBM STORAGE + NVIDIA DGXON TIMELINE

Study & exploration

Platform DesignProductive

Experi-mentation

Install and Deploy DGX RA

SOLUTION

Trouble-shooting

Software eng’g

Software optimiz-

ation

Design and Build for

Scale

Software re-

optimiz-ation

InsightsTraining at Scale

2. Deploying an Integrated, Full-Stack AI Solution using DGX Systems

Day 1

Month 3

“DIY” TCO

CAPEX

DGX TCOdeployment

cycle shortened

Wasted time/effort - eliminated

Page 14: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

16

Study & exploration

Insights

2. Deploying an Integrated, Full-Stack AI Solution using DGX Systems

Day 1

Week 1

Install and Deploy DGX RA

SOLUTION

CAPEX

Productive Experi-

mentation

Training at Scale

“DIY” TCO

DGX TCO

THE IMPACT OF IBM STORAGE + NVIDIA DGXON TIMELINE

Page 15: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

17

IBM & NVIDIA REFERENCE ARCHITECTUREValidated design for deploying DGX at-scale with IBM Storage

Download athttps://bit.ly/2GcYbgO

Learn more about DGX RA Solutions at:https://bit.ly/2OpXYeC

Page 16: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH