Software Labs. / SK Telecom 효율적 AI Infra를 위한 GPU Cloud 구축 사례
Software Labs. / SK Telecom
효율적 AI Infra를위한 GPU Cloud 구축사례
1
AI Speaker HD Map Self-drivingNavigation Surveillance Media ServicesNetwork
Big Data
Network Datacenter System
High Performance Computing
AI
AI Services
Machine Learning Infra
2
SK Telecom AI Infra
Legacy AI 개발환경
음성인식 영상인식 언어이해 지식기술 검색기술
사용자 (개발자)
Resource관리자
AI개발환경요청
AI개발환경제공
Infra Resource (CPU/GPU)
Container 생성/회수Docker Image 관리Infra Resource 관리
사용자 (개발자)
Resource관리자
Infra Resource (CPU/GPU)
사용자 (개발자)
Resource관리자
Infra Resource (CPU/GPU)
사용자 (개발자)
Resource관리자
Infra Resource (CPU/GPU)
사용자 (개발자)
Resource관리자
Infra Resource (CPU/GPU)
GPU Cloud 구축 GPU Cloud 고도화
3
SK Telecom AI Infra
음성인식 영상인식 언어이해 지식기술 검색기술
사용자 (개발자) 사용자 (개발자) 사용자 (개발자)
Resource Pool
사용자 (개발자) 사용자 (개발자)
GPU Cloud 솔루션
AI개발환경요청
AI개발환경제공
Container 생성/회수Docker Image 관리Infra Resource 관리
Legacy AI 개발환경 GPU Cloud 구축 GPU Cloud 고도화
4
SK Telecom AI Infra
음성인식 영상인식 언어이해 지식기술 검색기술
사용자 (개발자) 사용자 (개발자) 사용자 (개발자)
Resource Pool
사용자 (개발자) 사용자 (개발자)
GPU Cloud 솔루션
TrainingJob
Submit
Training결과반환
Job SchedulingResource 할당분산 및 병렬 처리
Legacy AI 개발환경 GPU Cloud 구축 GPU Cloud 고도화
5
What is SCALE ?
GPU Infra 효율극대화
고속병렬분산학습
손쉬운 AI 개발환경제공
AI Infra 관리편이성
SKT Cloud for AI Learning -Cloud Solution for Private GPU Cluster
Static Allocation, Dynamic Scheduling (IaaS, PaaS)
Docker Image Registry & Build Server
Parallel Execution w/ Parameter Optimize
Storage & Caching
Preemtive Scheduling
Virtual Scale-Out
6
Utilizing EXPENSIVE GPUs
7
IaaS
Static Allocation
8
Dynamic Scheduling
PaaS
9
IaaS + PaaS Example
GPUDeveloper
Team A
Team B
Team C
IaaS (Static Docker Allocation) PaaS (Dynamic Job Scheduling)
Parallel Execution
10
Docker Image Registry & Build Server
AutomatedDocker Image Build Server
Off-the-self Images
• Kernel Version
• CUDA Version
• Tensorflow Version
…
Customized Images
Image RegistryPull Development
11
D C B A
Parallel Execution
PaaS
Submit Queue
Job A
Job B
Job C
Job D
Data A
Data B
Data C
Data D
12
A-4 A-3 A-2 A-1
Parallel Execution w/ Hyperparameter Optimization
PaaS
Submit Queue
Job A-1
Job A-2
Job A-3
Job A-4
Data A
Data A
Data A
Data A
- Act. Function = ReLU- Learning Rate = 0.5- Momentum = 0.7
- Act. Function = ReLU- Learning Rate = 0.7- Momentum = 0.9
Hyperparameters
- Act. Function = Leaky ReLU- Learning Rate = 0.5- Momentum = 0.7
- Act. Function = Leaky ReLU- Learning Rate = 0.7- Momentum = 0.9
13
Parallel Execution - Distributed Learning
A A A A
PaaS
Submit Queue
Job A
Job A
Job A
Job A
Data A-1
Data A-2
Data A-3
Data A-4
14
Parallel Execution - Distributed Learning
Super-Convergence Shortest Training Time
Sync-Async Hybrid Local Sync, Global Async
Communication Optimization Compute-Communication Overlap, Gradient Compression
Dynamic Resources w/ Dynamically Scheduled Resources
Optimizer Optimizer for Distributed DL
Heterogeneous Infra Heterogeneous GPU, Server, Network
15
Flash Storage
(NAS w/ GlusterFS)
Storage & Caching
56GbE
10GbE
10GbE
10GbE
56GbE
56GbE
56GbE
56GbE
Caching
16
Bp3
Bp3
Preemptive Scheduling
PaaS
Ap2
Ap2
Ap2
Ap2
Bp3
Bp3
Cp1
Cp1
Submit Queue Waiting Queue
Cp1
Cp1
Bp3
Bp3
Job A: 4 Hour
Job B: 6 Hour
17
Virtual Scale-Out
Company A Company B Company C
SKT
18
Under The Hood
ImageRegistry
Image BuildServer
ML Infra ManagerAPI server
(gateway)
PaaS IaaS Admin Tool
Web Portal CLI Monitoring
Web Interface
Services
ContainerOrchestration
ClusterManagement
Logging Auth Monitoring
StorageNetworkingGPUCPU Memory
Application Frameworks
Cluster Resource
Manager
Task Workers
Framework FrameworkFramework
Agent #1 Agent #2 Agent #3 Agent #N• • •
Infrastructure
MesosKubernetes
19
IaaS 활용예시
사용자 Portal
자원 할당 개발 환경 설정 Machine Learning 실행 개발 환경 Commit
▪ GPU : V100 * 2▪ CPU : 4 Core▪ Mem : 16GB▪ OS : Ubuntu 16.04▪ Docker Image :
Tensorflow 1.8Cuda 9.0Jupyter Lab
Resource Pool
자원 요청
자원 할당
▪ Ubuntu 16.04▪ Tensorflow 1.8▪ Cuda 9.0▪ Matlab R2018▪ Matlab Deep Learning
Toolbox
▪ Ubuntu 16.04▪ Tensorflow 1.8▪ Cuda 9.0▪ ~~~▪ ~~~
[Installing MatLab]
cd /usr/local/srcsudo tar xf matlab_linux_R2018b.tgzcd MATHWORKS_R2018bsudo ./install
[Installing Machine Learning Toolbox]
~~~~~~~~~
clear
camera = webcam;nnet = alexnet;
while truepicture = camera.snapshot;picture = imresize(picture,m [227,227]);label = claasify(nnet, picture);image(picture);title(char(label));drawnow;
end
20
PaaS 활용예시
코드 및 데이터 준비 Machine Learning 수행 수행 결과 확인
[코드]
[데이터]
def train():mnist = input_data.read_data_sets(
FLAGS.data_dir, fake_data=FLAGS.fake_data)sess = tf.InteractiveSession()with tf.name_scope('input'):
x = tf.placeholder(tf.float32, [None, 784], name='x-input')
y_ = tf.placeholder(tf.int64, [None], name='y-input')
step 0: acc(0.0796), loss(2.6134636)step 10: acc(0.5102), loss(1.8237301)step 20: acc(0.6811), loss(1.3650213)step 30: acc(0.7567), loss(1.0394053)step 40: acc(0.809), loss(0.8181167)step 50: acc(0.8268), loss(0.6837858)step 60: acc(0.8494), loss(0.5858479)step 70: acc(0.8675), loss(0.52006394)step 80: acc(0.8774), loss(0.4760332)step 90: acc(0.8857), loss(0.44460815)Adding run metadata for 99………
Job Queue
GPU Resource
21
A대학교 AI Infra 구축사례
AI Cloud InfraLegacy AI R&D Infra
Infra Resource (CPU/GPU)
A Lab
Infra Resource (CPU/GPU)
B Lab
…
A Lab
B Lab
C Lab
D Lab
IaaS
GPU
CPU
Memory
PaaS
Job Scheduler Distributed DL
22
B대학교 AI Infra 구축사례
AI Cloud InfraLegacy AI Class Infra
Resource관리자
…
AI Class 수강생
Workstationw/ GPU
▪ OS 설치
▪ Hostname / IP 설정
▪ ID 생성, PW 설정
▪ CUDA Driver 설치
▪ ML/DL Framework 설치
▪ 수강생 관리
▪ 접속 Host 안내 (IP, ID)
Resource관리자
AI Class 수강생
…
VirtualizedCloud Infra
▪ SCALE 솔루션 운영/관리
▪ 자원 요청
▪ 수업 진행
▪ 자원 반납
▪ 서버 접속
▪ 수업 진행
▪ 서버 접속 불가
▪ 자원 오/남용
▪ Malware 설치
(코인채굴 코드 등)
IP / ID 유출교수/연구진
▪ 자원 요청
▪ 연구 수행
▪ 자원 반납