Container Orchestration and Management Systems Comparison from Technical View Harry Zhang, Member of #CNCF
Container Orchestration and Management Systems
Comparison from Technical View Harry Zhang, Member of #CNCF
The Scope of This Talk• Kubernetes
• by Cloud Native Computing Foundation
• Docker 1.12+ • by Docker Inc.
• Compose + Swarm is kind of legacy, so they will not be included in this talk
• Mesos • by Apache Software Foundation
• only with Marathon, DC/OS is not included (the scope of later is larger)
Chapter 1: Core Idea and Architecture
Kubernetes• Build right things with containers by following concepts and conventions
• like a “Spring Framework” in container eco-system • Design
• master • api-server, scheduler, controller-manager
• node • kubelet, kube-proxy
• independent binaries • Pros: modular, transparent, manageable • Cons: a little bit complex to setup (1.4 is much better now)
• network & volume plugins • driven by control loops
kubeletSyncLoop
kubeletSyncLoop
proxy
proxy
1 Pod created
etcd
scheduler
api-server
Kubernetes
kubeletSyncLoop
kubeletSyncLoop
proxy
proxy
2 Pod object added
etcd
scheduler
api-server
Kubernetes
kubeletSyncLoop
kubeletSyncLoop
proxy
proxy
3.1 New pod object detected3.2 Bind pod with node
etcd
scheduler
api-server
Kubernetes
kubeletSyncLoop
kubeletSyncLoop
proxy
proxy
4.1 Detected pod bind with me4.2 Start containers in pod
etcd
scheduler
api-server
Kubernetes
kubeletSyncLoop
controller-managerControlLoop
kubeletSyncLoop
proxy
proxy
Objects:podreplicanamespaceserviceendpointjobdeployment volume petset …
etcd
scheduler
api-server
Reconcile:desired world VS real world
handler
Kubernetes
Tips: Control Theory*
*Andrei, Neculai (2005). "Modern Control Theory – A historical Perspective"
• It’s the basic model for: • Kubernetes controller and all other event loops • SwarmKit orchestrator • …
ControlLoop
Docker 1.12+• Build-in cluster support for Docker containers
• powered by swarmkit • SwarmtKit Design
• build-in data store • manager
• several components build into one binary • control loop driven
• worker • use pull model to connect with manager
WARNING: SwarmKit is currently a primitive project, expect change of this part
Allocator
DispatcherScheduler
Orchestrator• API: accept commands from client
• Create object in raft based memory store
• github.com/coreos/etcd/raft for consensus
• github.com/hashicorp/go-memdb for in-memory object storage
• state, cluster, node, service, task, network …
$ docker service createAPI
Store
SwarmKit Manager
Allocator
DispatcherScheduler
• Create Tasks from Service object
• Task: “start a container” etc
• Reconcile loop for Service objects
• Control Theory again
Orchestrator
API
Store
Orchestrator
Service (replica=2)
Task
Task
check if replica=2 or not
SwarmKit Manager
• Allocates IP addresses to Services and Tasks
• (and allocate volumes in the future)
• VIP and ports for Service
• IP for all endpoints (veth pairs) in the network the task is attached to
Orchestrator
DispatcherScheduler
API
Store
Allocator
SwarmKit Manager
Network Create
• Assign Task to Node
• unassignedTasks
• nodeHeap
• search in heap to find the best node which meets the constraints && has lightest workloads
• ReadyFilter, ResourceFilter, ConstraintFilter
Orchestrator
Dispatcher
API
Store
Scheduler
Allocator
SwarmKit Manager
Manager
• Nodes (agents) management
• Dispatch assigned Task to corresponding Node
Orchestrator
API
Store
Allocator
SwarmKit Manager
Scheduler Dispatcher
Dispatcher
Agent
Agent
Agentgrpc stream
grpc stream
grpc stream
Task
• Worker:
• connect to Dispatcher to check assigned tasks
• executor: execute tasks (containers) on this Node
Worker
Executor
Agent Agent
AdapterDocker Daemon
docker.sockWorker
Executor
Worker
Executor
Agent
Mesos 1.0• A distributed systems kernel
• originally designed to run big data job • core idea: fine-grained resource sharing
• Mesos Design • Master + Slave + Zookeeper • two level scheduling
• scheduler + executor = framework • need to use frameworks like Marathon for orchestration and management
• containerizer • multiple container runtime & image support (>=1.0)
MPI job
MPI scheduler
Hadoop job
Hadoop scheduler
Allocation module
Mesosmaster
Mesos slaveMPI
executor
Mesos slaveMPI
executor
tasktask
Resource offer
Pick framework to offer resources to
*Animate: Operating Systems and Systems Programming Lecture 24 Anthony D. Joseph https://cs162.eecs.berkeley.edu/
MPI job
MPI scheduler
Hadoop job
Hadoop scheduler
Allocation module
Mesos master
Mesos slaveMPI
executor
Mesos slaveMPI
executor
tasktask
Pick framework to offer resources toResource
offer
Resource offer = list of (node, availableResources)
E.g. { (node1, <2 CPUs, 4 GB>), (node2, <3 CPUs, 2 GB>) }
*Animate: Operating Systems and Systems Programming Lecture 24 Anthony D. Joseph https://cs162.eecs.berkeley.edu/
MPI job
MPI scheduler
Hadoop job
Hadoop scheduler
Allocation module
Mesos master
Mesos slaveMPI
executorHadoop executor
Mesos slaveMPI
executor
tasktask
Pick framework to offer resources to
taskFramework-specific
scheduling
Resource offer
Launches and isolates executors
*Animate: Operating Systems and Systems Programming Lecture 24 Anthony D. Joseph https://cs162.eecs.berkeley.edu/
How Docker plug into Mesos?• Before 1.0
• Docker Containerizer • Docker image -> task -> mesos-docker-executor -> Docker Daemon
• Mesos 1.0 • Supporting multiple runtime & images • MesosContainerizer
• “Mesos native container stack” • Isolators • Launcher
Mesos slave
Hadoop executor
task
mesos-docker-executor
Checkpoint
Kubernetes Docker SwarmKit Mesos+Marathon
Design control loops drivencontrol loops driven (but in single binary)
two level scheduling
Coordination etcd build-in raft Zookeeper
Container Runtime multiple single, but has potential for more OCI runtimes multiple
Container Image Docker Image, ACI, more in future Docker Image Docker Image, ACI, more
in future Docker Daemon no need need no need
About Build-In Data Store
Pros Cons
easy to setup hard to understand & debug
fewer round trips hard to do backup/restore, migration, monitoring/audit
easy to do performance tuning lack of mgmt API like:etcd admin guide
Chapter 2: Control Panel
Control Panel: Orchestration + Management• “Defines when and what to do next through out the automated workflow”
• workload management • secret management • configuration management • scale and autoscaling • stateful workload • … and more
Workload Managemente.g. “a web server with 2 replicas”
Kubernetes Docker SwarmKit Mesos+Marathon
Description Deployment Service Application
Version Control yes (revision) not yet yes (deployments)
• Kubernetes “Deployment”• $ kubectl create -f <deployment-yaml> • $ kubectl edit <deployment>
• this will open and edit object stored in etcd • update will trigger rolling update
• $ kubectl set image <deployment> • $ kubectl scale —replicas=5 <deployment> … • $ kubectl rollout history <deployment> • $ kubectl rollout undo <deployment> —to-revision=<version>
$ kubectl edit <deployment> …
• Docker SwarmKit “Service”• $ docker service create SERVICE —replicas=5 … • $ docker service scale SERVICE=REPLICAS • $ docker service update [OPTIONS] SERVICE
• rolling update • 30+ update options are supported
• —container-label-add value • —container-label-rm value • --env-add value • --env-rm value • —image string • …
• Mesos + Marathon “Application”• $ dcos marathon app start [--force] <app-id> [<instances>] • $ dcos marathon app update [--force] <app-id> [<properties>…]
• rolling update • app dependencies are respected
• $ dcos marathon app version list [--max-count=<max-count>] <app-id> … • $ dcos marathon deployment list [--json <app-id>] • $ dcos marathon deployment rollback <deployment-id>
Secret Management• Kubernetes
• Secret volume • encrypted and stored in etcd • consumed by ENV or volume
• Docker SwarmKit • under discussion: https://github.com/docker/swarmkit/issues/1329
• Mesos + Marathon • only in DC/OS
• stored in ZooKeeper, exposed as ENV in Marathon
• Another similar feature is Configuration Management
Configuration Management• Kubernetes
• ConfigMap • stored in etcd, consumed by ENV or volume
• $ kubectl create configmap example-redis-config —from-file=docs/redis-config
• Docker SwarmKit • under discussion: https://github.com/docker/swarmkit/issues/1329
• Mesos + Marathon • not yet
Autoscaling• Kubernetes
• HorizontalPodAutoScaler • default: CPU • Custom Metrics:
• user defined endpoint, e.g. http://localhost:9100/metrics • share same metric data structure with CNCF projects like Prometheus
• Docker SwarmKit • not yet: https://github.com/docker/swarmkit/issues/486#issuecomment-219133613
• Mesos + Marathon • a stand-by `marathon-autoscale.py` • autoscales application based on the utilization metrics from Mesos
Stateful Workload• Kubernetes
• PetSet: Replicas with stable membership and volumes • stable hostname • ordinal index • stable storage
• Docker SwarmKit • not yet, and don’t suggest stateful service
• Mesos + Marathon • Stateful Applications
• dynamic reservations, reservation labels, and persistent volumes.
cassandra-0
volume 0
cassandra-0.cassandra.default.svc.cluster.local
cassandra-1
volume 1
cassandra-1.cassandra.default.svc.cluster.local
Chapter 3: Service Discovery & Load Balance
NodeNode
Service Discovery & LB• Kubernetes
• Load Balancer • iptables
• External Access • <externalIP route to node(s)>:<port> • NodePort: <ip of any node>:<NodePort> • External LoadBalancer • Ingress (L7)
• Ingress Pod: Nginx, HAproxy • SSL
• Name Service • build-in skyDNS pod
portal iptables rule 10.10.0.116:8001
random mode iptables rules
Pod 2Pod 1
ingress traffic http://foo.bar.com
Node
Ingress Pod
internal traffic
outside traffic
pod rule 2pod rule 1
WorkerWorker
container sandbox
ingress sandbox
Service Discovery & Load Balance
• Docker SwarmKit • Load Balancer
• ipvs NAT mode
• External Access • Routing Mesh
• Name Service • embedded DNS server
• for service and task Container 2Container 1
ipvs
Gossip to update the iptables & ipvs rules
port mapping
iptables iptables
outside traffic (when service created with -p)
internal traffic
ipvs
• Two kinds of sandboxes
• ingress: on every worker
• container: on workers where task lives
• Two networks are needed
• ingress overlay • user-defined overlay
DNS: svc->vip
ingress sandbox
Service Discovery & Load Balance• Mesos + Marathon
• Load Balancer • Marathon-lb: HAproxy based
• virtual addresses (VIPs) in DC/OS
• External Access • http://<public agent ip>:<servicePort> • external load balancer
• Name Service • Mesos-DNS
SlaveSlave
Marathon-lb
Container 2Container 1
Mesos-DNS
Checkpoint
Kubernetes Docker SwarmKit Mesos+Marathon
Filter iptables VIP iptables VIP no need
LB iptables random mode ipvs NAT mode HAproxy
External Access nodeIP:port, Ingress, external IP/LB
Routing Mesh (ingress overlay)
same as expose HAproxy to public
Update watch etcd gossip marathon_lb.py & template
Chapter 4: Scheduling
Kubernetes• Pod as schedule unit
• this is unique, but why? • Multi-Scheduler
• pod1: scheduler1, pod2 : scheduler2 • QoS tiers
• anyone remember the core idea of Borg? • Guaranteed (requests == limit) • Burstable (requests < limit) • Best-Effort (no request & limit)
• More Borg features are on the way • equivalence class, pod level resource boundary …
Burstable Pod
Docker SwarmKit• Task (container) as schedule unit • Multi-Scheduler
• not yet • Strategy
• pipeline of filters • ReadyFilter ResourceFilter ConstraintFilter
• to sort nodeHeap • QoS tiers
• not yet
Mesos + Marathon• Task as schedule unit (Pod support in plan) • Multi-Scheduler
• Mesos is designed to run multiple frameworks (schedulers) • Strategy
• Two level scheduling (the killing weapon of Mesos) • Twitter scale … • fine-grained resource sharing (like Borg)
• QoS tiers • of course
• And much more • task eviction, data locality, max-min fairness, priority, offer reject, Delay Scheduling • and Big Data of course
Chapter 5: Summary
A Use Case: hyper.sh
• hyper.sh is “Docker Done the Right Way” • $ hyper run mysql • $ hyper run --link mysql wordpress • $ hyper fip attach 22.33.44.55 wordpress
• But Hyper.sh is powered by Kubernetes • and also maintain Kubernetes features
Extensibility Really Matters• Hypernetes (h8s = k8s + HyperContainer) is what’s backing Hyper.sh:
• HyperContainer runtime • Multi-tenant network based on Neutron • Custom Cinder plugin with Ceph backend • Custom HAproxy based Service
• Kubernetes is truly extensible and configurable
Just Personal Idea• So, if
• I am a individual developer/org, trying to find something that is friendly and just works
• I use Docker SwarmKit
• I have a “Twitter scale” cluster to manage or I am a Big Data user
• I need Mesos
• But if what I need is a infrastructure layer to build my systems on top of it in right way
• Kubernetes is the choice
THE END @resouer