Everything You Thought You Already Knew About Orchestration
Laura Frank Director of Engineering, Codeship
Managing Distributed State with Raft Quorum 101 Leader Election Log Replication
Service Scheduling
Failure Recovery
Agenda
bonus debu
gging tips!
They’re trying to get a collection of nodes to behave like a single node. • How does the system maintain state? • How does work get scheduled?
The Big Problem(s)What are tools like Swarm and Kubernetes trying to do?
Manager leader
Worker Worker
Manager follower
Manager follower
Worker Worker Worker
raft consensus group
So You Think You Have Quorum
Quorum
The minimum number of votes that a consensus group needs in order to be allowed to perform an operation.
Without quorum, your system can’t do work.
Math! Managers Quorum Fault Tolerance
1 1 0
2 2 0
3 2 1
4 3 1
5 3 2
6 4 2
7 4 3
(N/2) + 1
In simpler terms, it means a majority
Math! Managers Quorum Fault Tolerance
1 1 0
2 2 0
3 2 1
4 3 1
5 3 2
6 4 2
7 4 3
(N/2) + 1
In simpler terms, it means a majority
Having two managers instead of one actually doubles your chances of losing quorum.
Pay attention to datacenter topology when placing managers.
Quorum With Multiple Regions
Manager Nodes Distribution across 3 Regions
3 1-1-1
5 1-2-2
7 3-2-2
9 3-3-3
magically work
s with
Docker for AW
S
Let’s talk about Raft!
I think I’ll just write my own distributed consensus algorithm.
-no sensible person
Log replication
Leader election
Safety (won’t talk about this much today)
Raft is responsible for…
Being easier to understand
Orchestration systems typically use a key/value store backed by a consensus algorithm
In a lot of cases, that algorithm is Raft!
Raft is used everywhere……that etcd is used
SwarmKit implements the Raft algorithm directly.
In most cases, you don’t want to run work on your manager nodes
docker node update --availability drain <NODE>
Participating in a Raft consensus group is work, too. Make your manager nodes unavailable for tasks:
*I will run work on managers for educational purposes
Leader Election & Log Replication
Manager leader
Manager candidate
Manager follower
Manager offline
demo.consensus.group
The log is the source of truth for your application.
In the context of distributed computing (and this talk), a log is an append-only, time-based record of data.
2 10 30 25 5 12first entry append entry here!
This log is for computers, not humans.
2 10 30 25 5 12Server
12
Client12
In simple systems, the log is pretty straightforward.
In a manager group, that log entry can only “become truth” once it is confirmed from the majority of followers (quorum!)
Client12
Manager follower
Manager follower
Manager leader
demo.consensus.group
In distributed computing, it’s essential that you understand log replication.
bit.ly/logging-post
Debugging Tip
Watch the Raft logs.
Monitor via inotifywait OR just read them directly!
Scheduling
HA application problems
scheduling problems
orchestrator problems
Scheduling constraints
Restrict services to specific nodes, such as specific architectures, security levels, or types
docker service create \ --constraint 'node.labels.type==web' my-app
New in 17.04.0-ceTopology-aware scheduling!!1!
Implements a spread strategy over nodes that belong to a certain category.
Unlike --constraint, this is a “soft” preference
—placement-pref ‘spread=node.labels.dc’
Swarm will not rebalance healthy tasks when a new node comes online
Debugging Tip
Add a manager to your Swarm running with --availability drain and in Engine debug mode
Failure Recovery
Losing quorum
• Bring the downed nodes back online (derp)
Regain quorum
• On a healthy manager, run docker swarm init --force-new-cluster
This will create a new cluster with one healthy manager • You need to promote new managers
The datacenter is on fire
• Bring up a new manager and stop Docker • sudo rm -rf /var/lib/docker/swarm• Copy backup to /var/lib/docker/swarm• Start Docker • docker swarm init (--force-new-cluster)
Restore from a backup in 5 easy steps!
• In general, users shouldn’t be allowed to modify IP addresses of nodes
• Restoring from a backup == old IP address for node1 • Workaround is to use elastic IPs with ability to reassign
But wait, there’s a bug… or a feature
Thank You!
@docker
#dockercon