Top Banner
Software Labs. / SK Telecom 효율적 AI Infra를 위한 GPU Cloud 구축 사례
24

효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

Aug 29, 2019

Download

Documents

trandang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

Software Labs. / SK Telecom

효율적 AI Infra를위한 GPU Cloud 구축사례

Page 2: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

1

AI Speaker HD Map Self-drivingNavigation Surveillance Media ServicesNetwork

Big Data

Network Datacenter System

High Performance Computing

AI

AI Services

Machine Learning Infra

Page 3: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

2

SK Telecom AI Infra

Legacy AI 개발환경

음성인식 영상인식 언어이해 지식기술 검색기술

사용자 (개발자)

Resource관리자

AI개발환경요청

AI개발환경제공

Infra Resource (CPU/GPU)

Container 생성/회수Docker Image 관리Infra Resource 관리

사용자 (개발자)

Resource관리자

Infra Resource (CPU/GPU)

사용자 (개발자)

Resource관리자

Infra Resource (CPU/GPU)

사용자 (개발자)

Resource관리자

Infra Resource (CPU/GPU)

사용자 (개발자)

Resource관리자

Infra Resource (CPU/GPU)

GPU Cloud 구축 GPU Cloud 고도화

Page 4: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

3

SK Telecom AI Infra

음성인식 영상인식 언어이해 지식기술 검색기술

사용자 (개발자) 사용자 (개발자) 사용자 (개발자)

Resource Pool

사용자 (개발자) 사용자 (개발자)

GPU Cloud 솔루션

AI개발환경요청

AI개발환경제공

Container 생성/회수Docker Image 관리Infra Resource 관리

Legacy AI 개발환경 GPU Cloud 구축 GPU Cloud 고도화

Page 5: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

4

SK Telecom AI Infra

음성인식 영상인식 언어이해 지식기술 검색기술

사용자 (개발자) 사용자 (개발자) 사용자 (개발자)

Resource Pool

사용자 (개발자) 사용자 (개발자)

GPU Cloud 솔루션

TrainingJob

Submit

Training결과반환

Job SchedulingResource 할당분산 및 병렬 처리

Legacy AI 개발환경 GPU Cloud 구축 GPU Cloud 고도화

Page 6: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

5

What is SCALE ?

GPU Infra 효율극대화

고속병렬분산학습

손쉬운 AI 개발환경제공

AI Infra 관리편이성

SKT Cloud for AI Learning -Cloud Solution for Private GPU Cluster

Static Allocation, Dynamic Scheduling (IaaS, PaaS)

Docker Image Registry & Build Server

Parallel Execution w/ Parameter Optimize

Storage & Caching

Preemtive Scheduling

Virtual Scale-Out

Page 7: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

6

Utilizing EXPENSIVE GPUs

Page 8: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

7

IaaS

Static Allocation

Page 9: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

8

Dynamic Scheduling

PaaS

Page 10: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

9

IaaS + PaaS Example

GPUDeveloper

Team A

Team B

Team C

IaaS (Static Docker Allocation) PaaS (Dynamic Job Scheduling)

Parallel Execution

Page 11: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

10

Docker Image Registry & Build Server

AutomatedDocker Image Build Server

Off-the-self Images

• Kernel Version

• CUDA Version

• Tensorflow Version

Customized Images

Image RegistryPull Development

Page 12: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

11

D C B A

Parallel Execution

PaaS

Submit Queue

Job A

Job B

Job C

Job D

Data A

Data B

Data C

Data D

Page 13: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

12

A-4 A-3 A-2 A-1

Parallel Execution w/ Hyperparameter Optimization

PaaS

Submit Queue

Job A-1

Job A-2

Job A-3

Job A-4

Data A

Data A

Data A

Data A

- Act. Function = ReLU- Learning Rate = 0.5- Momentum = 0.7

- Act. Function = ReLU- Learning Rate = 0.7- Momentum = 0.9

Hyperparameters

- Act. Function = Leaky ReLU- Learning Rate = 0.5- Momentum = 0.7

- Act. Function = Leaky ReLU- Learning Rate = 0.7- Momentum = 0.9

Page 14: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

13

Parallel Execution - Distributed Learning

A A A A

PaaS

Submit Queue

Job A

Job A

Job A

Job A

Data A-1

Data A-2

Data A-3

Data A-4

Page 15: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

14

Parallel Execution - Distributed Learning

Super-Convergence Shortest Training Time

Sync-Async Hybrid Local Sync, Global Async

Communication Optimization Compute-Communication Overlap, Gradient Compression

Dynamic Resources w/ Dynamically Scheduled Resources

Optimizer Optimizer for Distributed DL

Heterogeneous Infra Heterogeneous GPU, Server, Network

Page 16: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

15

Flash Storage

(NAS w/ GlusterFS)

Storage & Caching

56GbE

10GbE

10GbE

10GbE

56GbE

56GbE

56GbE

56GbE

Caching

Page 17: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

16

Bp3

Bp3

Preemptive Scheduling

PaaS

Ap2

Ap2

Ap2

Ap2

Bp3

Bp3

Cp1

Cp1

Submit Queue Waiting Queue

Cp1

Cp1

Bp3

Bp3

Job A: 4 Hour

Job B: 6 Hour

Page 18: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

17

Virtual Scale-Out

Company A Company B Company C

SKT

Page 19: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

18

Under The Hood

ImageRegistry

Image BuildServer

ML Infra ManagerAPI server

(gateway)

PaaS IaaS Admin Tool

Web Portal CLI Monitoring

Web Interface

Services

ContainerOrchestration

ClusterManagement

Logging Auth Monitoring

StorageNetworkingGPUCPU Memory

Application Frameworks

Cluster Resource

Manager

Task Workers

Framework FrameworkFramework

Agent #1 Agent #2 Agent #3 Agent #N• • •

Infrastructure

MesosKubernetes

Page 20: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

19

IaaS 활용예시

사용자 Portal

자원 할당 개발 환경 설정 Machine Learning 실행 개발 환경 Commit

▪ GPU : V100 * 2▪ CPU : 4 Core▪ Mem : 16GB▪ OS : Ubuntu 16.04▪ Docker Image :

Tensorflow 1.8Cuda 9.0Jupyter Lab

Resource Pool

자원 요청

자원 할당

▪ Ubuntu 16.04▪ Tensorflow 1.8▪ Cuda 9.0▪ Matlab R2018▪ Matlab Deep Learning

Toolbox

▪ Ubuntu 16.04▪ Tensorflow 1.8▪ Cuda 9.0▪ ~~~▪ ~~~

[Installing MatLab]

cd /usr/local/srcsudo tar xf matlab_linux_R2018b.tgzcd MATHWORKS_R2018bsudo ./install

[Installing Machine Learning Toolbox]

~~~~~~~~~

clear

camera = webcam;nnet = alexnet;

while truepicture = camera.snapshot;picture = imresize(picture,m [227,227]);label = claasify(nnet, picture);image(picture);title(char(label));drawnow;

end

Page 21: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

20

PaaS 활용예시

코드 및 데이터 준비 Machine Learning 수행 수행 결과 확인

[코드]

[데이터]

def train():mnist = input_data.read_data_sets(

FLAGS.data_dir, fake_data=FLAGS.fake_data)sess = tf.InteractiveSession()with tf.name_scope('input'):

x = tf.placeholder(tf.float32, [None, 784], name='x-input')

y_ = tf.placeholder(tf.int64, [None], name='y-input')

step 0: acc(0.0796), loss(2.6134636)step 10: acc(0.5102), loss(1.8237301)step 20: acc(0.6811), loss(1.3650213)step 30: acc(0.7567), loss(1.0394053)step 40: acc(0.809), loss(0.8181167)step 50: acc(0.8268), loss(0.6837858)step 60: acc(0.8494), loss(0.5858479)step 70: acc(0.8675), loss(0.52006394)step 80: acc(0.8774), loss(0.4760332)step 90: acc(0.8857), loss(0.44460815)Adding run metadata for 99………

Job Queue

GPU Resource

Page 22: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

21

A대학교 AI Infra 구축사례

AI Cloud InfraLegacy AI R&D Infra

Infra Resource (CPU/GPU)

A Lab

Infra Resource (CPU/GPU)

B Lab

A Lab

B Lab

C Lab

D Lab

IaaS

GPU

CPU

Memory

PaaS

Job Scheduler Distributed DL

Page 23: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

22

B대학교 AI Infra 구축사례

AI Cloud InfraLegacy AI Class Infra

Resource관리자

AI Class 수강생

Workstationw/ GPU

▪ OS 설치

▪ Hostname / IP 설정

▪ ID 생성, PW 설정

▪ CUDA Driver 설치

▪ ML/DL Framework 설치

▪ 수강생 관리

▪ 접속 Host 안내 (IP, ID)

Resource관리자

AI Class 수강생

VirtualizedCloud Infra

▪ SCALE 솔루션 운영/관리

▪ 자원 요청

▪ 수업 진행

▪ 자원 반납

▪ 서버 접속

▪ 수업 진행

▪ 서버 접속 불가

▪ 자원 오/남용

▪ Malware 설치

(코인채굴 코드 등)

IP / ID 유출교수/연구진

▪ 자원 요청

▪ 연구 수행

▪ 자원 반납

Page 24: 효율적 AI Infra를위한 GPU Cloud 구축사례 Im... · 1 AI Speaker Navigation Surveillance HD Map Self-driving Network Media Services Big Data Network Datacenter System High

23

감사합니다

[email protected] / [email protected]