Serverless GPU Cloud with Job scheduler and Container Andrew yongjoon kong CloudComputingCell, kakao [email protected]
Serverless GPU Cloudwith Job scheduler and
ContainerAndrew yongjoon kong
CloudComputingCell, kakao
• Cloud Technical Advisory for Government Broad Cast Agency• Adjunct Prof. Ajou Univ• Korea Data Base Agency Acting Professor for Bigdata• Member of National Information Agency Bigdata Advisory committee • Kakao à Daum Kakao à Kakaocorp, Cloud Computing Cell lead• Talks
• Embrace clouds (2017, openstack days, korea) • Full route based network with linux (2016, netdev, Tokyo)• SDN without SDN (2015, openstack, Vancouber)
Who am I
Andrew. Yongjoon kong
Supervised,Korean edition
Korean Edition.
2nd Editions are coming…
Serverless computing is rising
Serverless computing , GPU
Serverless computing , GPU, Docker
Serverless framework
lots of serverless framework:• Apache OpenWhisk• Iron.io• Openstack’s Picasso• Gestalt ( based on DC/OS)• Fission ( based on kubernetes)• Runway ( kakao’s private FaaS)What these framework’s purpose?• connecting, mostly• flow and automation
Serverless framework
Connection is very good virtue in public cloud• there’s no resource depletion in public cloud.
connection/automation is directly related with cost savings
• in private cloud, there’s screams for the resources (especially GPU) from the engineers.
• The thing is that “Winner takes it all” • à care for scheduling
Job scheduler
Scheduling User’s Job based on Algorithm• FIFO• Fair Share• BackFill• Preemption
Job
Job comprises two parts• The resources
• CPU, Compute Nodes, Memory, Disk and Even Walltime
• Job scheduling system manage the quota per queue, user, user group
• The runnable execution • Traditionally, The executable command• e.g. saved_model_cli run --dir /tmp/saved_model_dir --tag_set serve
Job sample
Sample Job script
The traditional issue is how we distribute the commandand the data (you can’t specify node in batch system)
#!/bin/bash #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:00:59 cd /home/rcf-proj3/pv/test/ sourcemkdir /test/test/dir/usr/usc/sas/default/setup.sh sas my.sas
execution
resource
Job scheduler system layout
SharedFile system can handle the file locating issue. à Shared Filesystem is too expensive. àModern
environment it is much easier with the container,
http://beagle.ci.uchicago.edu/using-beagle/
This could be changed bycontainer and registry
Job scheduler system, GPU and Container
add GPU resource to Job Script.use NVIDIA Docker for the command…then scheduler will do the job #!/bin/bash #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:00:59 #PBS -l gpus=8NV_GPU=$NV_GPU nvidia-docker run --net host -e PASSWORD=root -e USERNAME=root -e PORT=$PORT idock.daumkakao.io/dkos/nvidia-cuda-sshd:dev
dockerregistry
computenode
GPU
computenode
GPU
computenode
GPU
master ( scheduler )
job
AI Development Cycle over compute resource
computenode
GPU
computenode
GPU
computenode
GPU
computenode
GPU
Training Model on
large scale with
Massive data
Inference thru. the
Model
Develop model on
personal env.
Abstract these to “job(resource, executor)” The output is abstract to container
dockerregistry master ( scheduler )
abstraction
JOB
AI Service
The output is abstract to container
BTW, you need GPU and Other IT resource to show your effort to the public, as well
And, what about the monitoring & alert ?
The good thing is that if you make your effort with container à kakao cloud can help you
kakao cloud
Service Repo.
Service catalog
notification
scheduling
IaaS:KRANE
CentralizedMeasuringSystem:KEMI
CentralizedDeployingSystem:DKOS
Management Plane
DataCenter Contol/Dataplane
Event / Alert
Initial Setup
Change
IT operations.IT Services.
Some Numbers about kakao cloud
1563 projects
632 pull request since 2014.9
88aboutVMs are created/deleted per day
8703 vms
2,xxxprojects
913pull request since 2014.9
100aboutVMs are created/deleted per day
17,xxxvms
2016.8 2017.9
9x,xxx active cores
KakaocorpSomeinformationaboutkakao cloud
from grizzly to Kilo5 times upgraded
total 4Regionadditional service Heat/Trove/Sahara
from grizzly to Mitaka7 times upgraded
total 4RegionHeat/Trove/Octavia/barbican 2016.82017.10
event monitoring/alert platform kakao, KEMI
PhysicalServers
VirtualInstances Containers
Others(switches,
logs)
monitoring
KEMIIMS
(kakao CMDBAPI)
SB
RuleEngine
Notification ETL
Data Center Information abstraction layer
API
predicting
scheduling
OpenstackHeat
OtherServiceAPI
Data Center (or Service ) Management Activity
control
KEMI stats KEMI log
Deployment abstraction in Kakao, DKOS
Data Center
User:Definesresource
VM
PMcontainer
ServiceCatalogue
CentralizedDeployingSystem(DKOS)
Resource Pool Queuescheduler
manager
DKOS Archtecture
Services over DKOS
DKOS Situation
• Active cluster : 3 digit
• Total compute node : 4digit (vm+pm)
• Container counts : 5 digit
• Managed by?
DKOS Situation
• Why use DKOS(container)?• Container is easy• Container is cool• dc/os is great
• Nop!• Very summit point of integrated/automated infra service api
• authentication, authorization, compute resources, network, volumes
• Metering, logging • Monitoring, Notifications
kakao cloud now support GPU as well
Service Repo.
Service catalog
notification
scheduling
IaaS:KRANE
CentralizedMeasuringSystem:KEMI
CentralizedDeployingSystem:DKOS
Management Plane
DataCenter Contol/Dataplane
Event / Alert
Initial Setup
Change
IT operations.IT Services.
Thanks
Where are you from CMMI-Cloud perspective?
For CMM4, Time to embrace Clouds, not a Cloud
CMM0
legacy
output:cloudTF
CMM1
selfserviceDev
resource
output:krane
(openstackcloud)
CMM2
limitedProd
resources
output:kemi
(MaaS)
CMM3
AutomatedCloudUsage
output:DKOS(CaaS)
CMM4
ManualCloudUsage
--
CMM5
FederatedCloudusage
--