GPU cloud with Job scheduler and Container

Serverless GPU Cloudwith Job scheduler and

ContainerAndrew yongjoon kong

CloudComputingCell, kakao

[email protected]

• Cloud Technical Advisory for Government Broad Cast Agency• Adjunct Prof. Ajou Univ• Korea Data Base Agency Acting Professor for Bigdata• Member of National Information Agency Bigdata Advisory committee • Kakao à Daum Kakao à Kakaocorp, Cloud Computing Cell lead• Talks

• Embrace clouds (2017, openstack days, korea) • Full route based network with linux (2016, netdev, Tokyo)• SDN without SDN (2015, openstack, Vancouber)

Who am I

Andrew. Yongjoon kong

Supervised,Korean edition

Korean Edition.

2nd Editions are coming…

Serverless computing is rising

Serverless computing , GPU

Serverless computing , GPU, Docker

Serverless framework

lots of serverless framework:• Apache OpenWhisk• Iron.io• Openstack’s Picasso• Gestalt ( based on DC/OS)• Fission ( based on kubernetes)• Runway ( kakao’s private FaaS)What these framework’s purpose?• connecting, mostly• flow and automation

Serverless framework

Connection is very good virtue in public cloud• there’s no resource depletion in public cloud.

connection/automation is directly related with cost savings

• in private cloud, there’s screams for the resources (especially GPU) from the engineers.

• The thing is that “Winner takes it all” • à care for scheduling

Job scheduler

Scheduling User’s Job based on Algorithm• FIFO• Fair Share• BackFill• Preemption

Job

Job comprises two parts• The resources

• CPU, Compute Nodes, Memory, Disk and Even Walltime

• Job scheduling system manage the quota per queue, user, user group

• The runnable execution • Traditionally, The executable command• e.g. saved_model_cli run --dir /tmp/saved_model_dir --tag_set serve

Job sample

Sample Job script

The traditional issue is how we distribute the commandand the data (you can’t specify node in batch system)

#!/bin/bash #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:00:59 cd /home/rcf-proj3/pv/test/ sourcemkdir /test/test/dir/usr/usc/sas/default/setup.sh sas my.sas

execution

resource

Job scheduler system layout

SharedFile system can handle the file locating issue. à Shared Filesystem is too expensive. àModern

environment it is much easier with the container,

http://beagle.ci.uchicago.edu/using-beagle/

This could be changed bycontainer and registry

Job scheduler system, GPU and Container

add GPU resource to Job Script.use NVIDIA Docker for the command…then scheduler will do the job #!/bin/bash #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:00:59 #PBS -l gpus=8NV_GPU=$NV_GPU nvidia-docker run --net host -e PASSWORD=root -e USERNAME=root -e PORT=$PORT idock.daumkakao.io/dkos/nvidia-cuda-sshd:dev

dockerregistry

computenode

GPU

computenode

GPU

computenode

GPU

master ( scheduler )

job

AI Development Cycle over compute resource

computenode

GPU

computenode

GPU

computenode

GPU

computenode

GPU

Training Model on

large scale with

Massive data

Inference thru. the

Model

Develop model on

personal env.

Abstract these to “job(resource, executor)” The output is abstract to container

dockerregistry master ( scheduler )

abstraction

JOB

AI Service

The output is abstract to container

BTW, you need GPU and Other IT resource to show your effort to the public, as well

And, what about the monitoring & alert ?

The good thing is that if you make your effort with container à kakao cloud can help you

kakao cloud

Service Repo.

Service catalog

notification

scheduling

IaaS:KRANE

CentralizedMeasuringSystem:KEMI

CentralizedDeployingSystem:DKOS

Management Plane

DataCenter Contol/Dataplane

Event / Alert

Initial Setup

Change

IT operations.IT Services.

Some Numbers about kakao cloud

1563 projects

632 pull request since 2014.9

88aboutVMs are created/deleted per day

8703 vms

2,xxxprojects

913pull request since 2014.9

100aboutVMs are created/deleted per day

17,xxxvms

2016.8 2017.9

9x,xxx active cores

KakaocorpSomeinformationaboutkakao cloud

from grizzly to Kilo5 times upgraded

total 4Regionadditional service Heat/Trove/Sahara

from grizzly to Mitaka7 times upgraded

total 4RegionHeat/Trove/Octavia/barbican 2016.82017.10

event monitoring/alert platform kakao, KEMI

PhysicalServers

VirtualInstances Containers

Others(switches,

logs)

monitoring

KEMIIMS

(kakao CMDBAPI)

SB

RuleEngine

Notification ETL

Data Center Information abstraction layer

API

predicting

scheduling

OpenstackHeat

OtherServiceAPI

Data Center (or Service ) Management Activity

control

KEMI stats KEMI log

Deployment abstraction in Kakao, DKOS

Data Center

User:Definesresource

VM

PMcontainer

ServiceCatalogue

CentralizedDeployingSystem(DKOS)

Resource Pool Queuescheduler

manager

DKOS Archtecture

Services over DKOS

DKOS Situation

• Active cluster : 3 digit

• Total compute node : 4digit (vm+pm)

• Container counts : 5 digit

• Managed by?

DKOS Situation

• Why use DKOS(container)?• Container is easy• Container is cool• dc/os is great

• Nop!• Very summit point of integrated/automated infra service api

• authentication, authorization, compute resources, network, volumes

• Metering, logging • Monitoring, Notifications

kakao cloud now support GPU as well

Service Repo.

Service catalog

notification

scheduling

IaaS:KRANE

CentralizedMeasuringSystem:KEMI

CentralizedDeployingSystem:DKOS

Management Plane

DataCenter Contol/Dataplane

Event / Alert

Initial Setup

Change

IT operations.IT Services.

Thanks

Where are you from CMMI-Cloud perspective?

For CMM4, Time to embrace Clouds, not a Cloud

CMM0

legacy

output:cloudTF

CMM1

selfserviceDev

resource

output:krane

(openstackcloud)

CMM2

limitedProd

resources

output:kemi

(MaaS)

CMM3

AutomatedCloudUsage

output:DKOS(CaaS)

CMM4

ManualCloudUsage

--

CMM5

FederatedCloudusage

--

GPU cloud with Job scheduler and Container

Engineering