Top Banner
Heart of the SwarmKit Stephen Day Andrea Luzzardi Aaron Lehmann Docker Distributed Systems Summit, Berlin October 2016 v0
75

Heart of the SwarmKit: Store, Topology & Object Model

Apr 16, 2017

Download

Technology

Docker, Inc.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Heart of the SwarmKit: Store, Topology & Object Model

Heart of the SwarmKit

Stephen DayAndrea LuzzardiAaron Lehmann

Docker Distributed Systems Summit, BerlinOctober 2016

v0

Page 2: Heart of the SwarmKit: Store, Topology & Object Model

Heart of the SwarmKit:Data Model

Stephen DayDocker, Inc.Docker Distributed Systems Summit, BerlinOctober 2016

v0

Page 3: Heart of the SwarmKit: Store, Topology & Object Model

Stephen DayDocker, Inc.github.com/stevvooe@stevvooe

Page 4: Heart of the SwarmKit: Store, Topology & Object Model

SwarmKitA new framework by Docker for building orchestration systems.

Page 5: Heart of the SwarmKit: Store, Topology & Object Model

5

OrchestrationA control system for your cluster

ClusterO

-

Δ StD

D = Desired StateO = OrchestratorC = ClusterSt = State at time tΔ = Operations to converge S to D

https://en.wikipedia.org/wiki/Control_theory

Page 6: Heart of the SwarmKit: Store, Topology & Object Model

6

ConvergenceA functional view

D = Desired StateO = OrchestratorC = ClusterSt = State at time t

f(D, Sn-1, C) → Sn | min(S-D)

Page 7: Heart of the SwarmKit: Store, Topology & Object Model

7

Observability and ControllabilityThe Problem

Low Observability High Observability

Failure Process State User Input

Page 8: Heart of the SwarmKit: Store, Topology & Object Model

8

Data Model Requirements

- Represent difference in cluster state- Maximize Observability- Support Convergence- Do this while being Extensible and Reliable

Page 9: Heart of the SwarmKit: Store, Topology & Object Model

Show me your data structures and I’ll show you your orchestration system

Page 10: Heart of the SwarmKit: Store, Topology & Object Model

10

Services- Express desired state of the cluster- Abstraction to control a set of containers- Enumerates resources, network availability, placement- Leave the details of runtime to container process- Implement these services by distributing processes across a cluster

Node 1 Node 2 Node 3

Page 11: Heart of the SwarmKit: Store, Topology & Object Model

11

Declarative$ docker network create -d overlay backend 31ue4lvbj4m301i7ef3x8022t

$ docker service create -p 6379:6379 --network backend redis bhk0gw6f0bgrbhmedwt5lful6

$ docker service scale serene_euler=3 serene_euler scaled to 3

$ docker service ls ID NAME REPLICAS IMAGE COMMANDdj0jh3bnojtm serene_euler 3/3 redis

Page 12: Heart of the SwarmKit: Store, Topology & Object Model

12

ReconciliationSpec → Object

ObjectCurrent State

SpecDesired State

Page 13: Heart of the SwarmKit: Store, Topology & Object Model

Task Model

Prepare: setup resourcesStart: start the taskWait: wait until task exitsShutdown: stop task, cleanly

Runtime

Page 14: Heart of the SwarmKit: Store, Topology & Object Model

Orchestrator

14

Task ModelAtomic Scheduling Unit of SwarmKit

ObjectCurrent State

SpecDesired State

Task0 Task1… Taskn Scheduler

Page 15: Heart of the SwarmKit: Store, Topology & Object Model

Service Spec

message ServiceSpec {

// Task defines the task template this service will spawn.

TaskSpec task = 2 [(gogoproto.nullable) = false];

// UpdateConfig controls the rate and policy of updates.

UpdateConfig update = 6;

// Service endpoint specifies the user provided configuration

// to properly discover and load balance a service.

EndpointSpec endpoint = 8;

}

Protobuf Example

Page 16: Heart of the SwarmKit: Store, Topology & Object Model

Service Object

message Service {

ServiceSpec spec = 3;

// UpdateStatus contains the status of an update, if one is in

// progress.

UpdateStatus update_status = 5;

// Runtime state of service endpoint. This may be different

// from the spec version because the user may not have entered

// the optional fields like node_port or virtual_ip and it

// could be auto allocated by the system.

Endpoint endpoint = 4;

}

Protobuf Example

Page 17: Heart of the SwarmKit: Store, Topology & Object Model

Manager

TaskTask

Data Flow

ServiceSpec

TaskSpec

Service

ServiceSpec

TaskSpec

Task

TaskSpec

Worker

Page 18: Heart of the SwarmKit: Store, Topology & Object Model

Consistency

Page 19: Heart of the SwarmKit: Store, Topology & Object Model

19

Field Ownership

Only one component of the system can write to a field

Consistency

Page 20: Heart of the SwarmKit: Store, Topology & Object Model

TaskSpec

message TaskSpec {

oneof runtime {

NetworkAttachmentSpec attachment = 8;

ContainerSpec container = 1;

}

// Resource requirements for the container.

ResourceRequirements resources = 2;

// RestartPolicy specifies what to do when a task fails or finishes.

RestartPolicy restart = 4;

// Placement specifies node selection constraints

Placement placement = 5;

// Networks specifies the list of network attachment

// configurations (which specify the network and per-network

// aliases) that this task spec is bound to.

repeated NetworkAttachmentConfig networks = 7;

}

Protobuf Examples

Page 21: Heart of the SwarmKit: Store, Topology & Object Model

Task

message Task {

TaskSpec spec = 3;

string service_id = 4;

uint64 slot = 5;

string node_id = 6;

TaskStatus status = 9;

TaskState desired_state = 10;

repeated NetworkAttachment networks = 11;

Endpoint endpoint = 12;

Driver log_driver = 13;

}

Protobuf ExampleOwner

User

Orchestrator

Allocator

Scheduler

Shared

Page 22: Heart of the SwarmKit: Store, Topology & Object Model

Worker

Pre-Run

Preparing

Manager

Terminal States

Task StateNew Allocated Assigned

Ready Starting

Running

Complete

Shutdown

Failed

Rejected

Page 23: Heart of the SwarmKit: Store, Topology & Object Model

Field HandoffTask Status

State Owner

< Assigned Manager

>= Assigned Worker

Page 24: Heart of the SwarmKit: Store, Topology & Object Model

24

Observability and ControllabilityThe Problem

Low Observability High Observability

Failure Process State User Input

Page 25: Heart of the SwarmKit: Store, Topology & Object Model

25

OrchestrationA control system for your cluster

ClusterO

-

Δ StD

D = Desired StateO = OrchestratorC = ClusterSt = State at time tΔ = Operations to converge S to D

https://en.wikipedia.org/wiki/Control_theory

Page 26: Heart of the SwarmKit: Store, Topology & Object Model

26

ReconciliationSpec → Object

ObjectCurrent State

SpecDesired State

Page 27: Heart of the SwarmKit: Store, Topology & Object Model

SwarmKit doesn’t Quit

Page 28: Heart of the SwarmKit: Store, Topology & Object Model

Heart of the SwarmKit:Topology ManagementSo you’ve got thousands of machines… Now what?

Andrea Luzzardi / [email protected] / @aluzzardiDocker Inc.

Page 29: Heart of the SwarmKit: Store, Topology & Object Model

Push vs Pull Model

Page 30: Heart of the SwarmKit: Store, Topology & Object Model

30

Push vs PullPush Pull

Manager

Worker

ZooKeeper

3 - Payload

1 - Register

2 - Discover Manager

Worker

Registration &Payload

Page 31: Heart of the SwarmKit: Store, Topology & Object Model

31

Push vs PullPush

• Pros: Provides better control over communication rate− Managers decide when to

contact Workers

• Cons: Requires a discovery mechanism− More failure scenarios− Harder to troubleshoot

Pull• Pros: Simpler to operate

− Workers connect to Managers and don’t need to bind

− Can easily traverse networks− Easier to secure− Less moving parts

• Cons: Workers must maintain connection to Managers at all times

Page 32: Heart of the SwarmKit: Store, Topology & Object Model

32

Push vs Pull• SwarmKit adopted the Pull model• Favored operational simplicity• Engineered solutions to provide rate control in pull mode

Page 33: Heart of the SwarmKit: Store, Topology & Object Model

Rate ControlControlling communication rate in a Pull model

Page 34: Heart of the SwarmKit: Store, Topology & Object Model

34

Rate Control: Heartbeats• Manager dictates heartbeat rate to

Workers• Rate is Configurable• Managers agree on same Rate by

Consensus (Raft)• Managers add jitter so pings are spread

over time (avoid bursts)

Manager

Worker

Ping? Pong!Ping me back in 5.2 seconds

Page 35: Heart of the SwarmKit: Store, Topology & Object Model

35

Rate Control: Workloads• Worker opens a gRPC stream to

receive workloads• Manager can send data whenever it

wants to• Manager will send data in batches• Changes are buffered and sent in

batches of 100 or every 100 ms, whichever occurs first

• Adds little delay (at most 100ms) but drastically reduces amount of communication

Manager

Worker

Give me work to do

100ms - [Batch of 12 ]200ms - [Batch of 26 ]300ms - [Batch of 32 ]340ms - [Batch of 100]360ms - [Batch of 100]460ms - [Batch of 42 ]560ms - [Batch of 23 ]

Page 36: Heart of the SwarmKit: Store, Topology & Object Model

ReplicationRunning multiple managers for high availability

Page 37: Heart of the SwarmKit: Store, Topology & Object Model

37

Replication

Manager Manager Manager

Worker

Leader FollowerFollower • Worker can connect to any Manager

• Followers will forward traffic to the Leader

Page 38: Heart of the SwarmKit: Store, Topology & Object Model

38

Replication

Manager Manager Manager

Worker

Leader FollowerFollower • Followers multiplex all workers to the Leader using a single connection

• Backed by gRPC channels (HTTP/2 streams)

• Reduces Leader networking load by spreading the connections evenly

Worker Worker

Example: On a cluster with 10,000 workers and 5 managers, each will only have to handle about 2,000 connections. Each follower will forward its 2,000 workers using a single socket to the leader.

Page 39: Heart of the SwarmKit: Store, Topology & Object Model

39

Replication

Manager Manager Manager

Worker

Leader FollowerFollower • Upon Leader failure, a new one is elected

• All managers start redirecting worker traffic to the new one

• Transparent to workers

Worker Worker

Page 40: Heart of the SwarmKit: Store, Topology & Object Model

40

Replication

Manager Manager Manager

Worker

Follower FollowerLeader • Upon Leader failure, a new one is elected

• All managers start redirecting worker traffic to the new one

• Transparent to workers

Worker Worker

Page 41: Heart of the SwarmKit: Store, Topology & Object Model

41

Replication

Manager 3

Manager 1

Manager 2

Worker

Leader FollowerFollower • Manager sends list of all managers’ addresses to Workers

• When a new manager joins, all workers are notified

• Upon manager failure, workers will reconnect to a different manager

- Manager 1 Addr- Manager 2 Addr- Manager 3 Addr

Page 42: Heart of the SwarmKit: Store, Topology & Object Model

42

Replication

Manager 3

Manager 1

Manager 2

Worker

Leader FollowerFollower • Manager sends list of all managers’ addresses to Workers

• When a new manager joins, all workers are notified

• Upon manager failure, workers will reconnect to a different manager

Page 43: Heart of the SwarmKit: Store, Topology & Object Model

43

Replication

Manager 3

Manager 1

Manager 2

Worker

Leader FollowerFollower • Manager sends list of all managers’ addresses to Workers

• When a new manager joins, all workers are notified

• Upon manager failure, workers will reconnect to a different manager

Reconnect to random manager

Page 44: Heart of the SwarmKit: Store, Topology & Object Model

44

Replication

• gRPC handles connection management− Exponential backoff, reconnection jitter, …− Avoids flooding managers on failover− Connections evenly spread across Managers

• Manager Weights− Allows Manager prioritization / de-prioritization− Gracefully remove Manager from rotation

Page 45: Heart of the SwarmKit: Store, Topology & Object Model

PresenceScalable presence in a distributed environment

Page 46: Heart of the SwarmKit: Store, Topology & Object Model

46

Presence• Leader commits Worker state (Up vs Down) into Raft

− Propagates to all managers− Recoverable in case of leader re-election

• Heartbeat TTLs kept in Leader memory− Too expensive to store “last ping time” in Raft

• Every ping would result in a quorum write− Leader keeps worker<->TTL in a heap (time.AfterFunc)− Upon leader failover workers are given a grace period to reconnect

• Workers considered Unknown until they reconnect• If they do they move back to Up• If they don’t they move to Down

Page 47: Heart of the SwarmKit: Store, Topology & Object Model

Heart of the SwarmKit: Distributed Data StoreAaron LehmannDocker

Page 48: Heart of the SwarmKit: Store, Topology & Object Model

What we store

● State of the cluster● User-defined configuration● Organized into objects:

○ Cluster○ Node○ Service○ Task○ Network○ etc...

48

Page 49: Heart of the SwarmKit: Store, Topology & Object Model

Why embed the distributed data store?

● Ease of setup● Fewer round trips● Can maintain local indices

49

Page 50: Heart of the SwarmKit: Store, Topology & Object Model

In-memory data structures

● Objects are protocol buffers messages● go-memdb used as in-memory database:

https://github.com/hashicorp/go-memdb● Underlying data structure: radix trees

50

Page 51: Heart of the SwarmKit: Store, Topology & Object Model

Radix trees for indexing

Hel

Hello Helpful

Wo

World Work Word

Wor Won

51

Page 52: Heart of the SwarmKit: Store, Topology & Object Model

Radix trees for indexing

id:

id:abcd id:efgh

node:

node:1234:abcd node:1234:efgh node:5678:ijkl

node:1234

node:5678:mnop

node:5678

id:ijkl id:mnop

52

Page 53: Heart of the SwarmKit: Store, Topology & Object Model

Lightweight in-memory snapshots

id:

id:abcd id:efgh id:ijkl id:mnop

Edges are actually pointers

53

Page 54: Heart of the SwarmKit: Store, Topology & Object Model

Lightweight in-memory snapshots

id:

id:abcd id:efgh id:ijkl id:mnop id:qrst

54

Page 55: Heart of the SwarmKit: Store, Topology & Object Model

Lightweight in-memory snapshots

id:

id:abcd id:efgh id:ijkl id:mnop id:qrst

55

Page 56: Heart of the SwarmKit: Store, Topology & Object Model

Lightweight in-memory snapshots

id:

id:abcd id:efgh id:ijkl id:mnop id:qrst

56

Page 57: Heart of the SwarmKit: Store, Topology & Object Model

Transactions

● We provide a transactional interface to read or write data in the store● Read transactions are just atomic snapshots● Write transaction:

○ Take a snapshot○ Make changes○ Replace tree root with modified tree’s root (atomic pointer swap)

● Only one write transaction allowed at once● Commit of write transaction blocks until changes are committed to Raft

57

Page 58: Heart of the SwarmKit: Store, Topology & Object Model

Transaction example: Read

dataStore.View(func(tx store.ReadTx) {

tasks, err = store.FindTasks(tx, store.ByServiceID(serviceID))

if err == nil {

for _, t := range tasks {

fmt.Println(t.ID)

}

}

})

58

Page 59: Heart of the SwarmKit: Store, Topology & Object Model

Transaction example: Write

err := dataStore.Update(func(tx store.Tx) error {

t := store.GetTask(tx, "id1")

if t == nil {

return errors.New("task not found")

}

t.DesiredState = api.TaskStateRunning

return store.UpdateTask(tx, t)

})

59

Page 60: Heart of the SwarmKit: Store, Topology & Object Model

Watches

● Code can register to receive specific creation, update, or deletion events on a Go channel

● Selectors on particular fields in the objects● Currently an internal feature, will expose through API in the future

60

Page 61: Heart of the SwarmKit: Store, Topology & Object Model

Watches

watch, cancelWatch = state.Watch(

r.store.WatchQueue(),

state.EventUpdateTask{

Task: &api.Task{ID: oldTask.ID, Status: api.TaskStatus{State: api.TaskStateRunning}},

Checks: []state.TaskCheckFunc{state.TaskCheckID, state.TaskCheckStateGreaterThan},

},

...

61

Page 62: Heart of the SwarmKit: Store, Topology & Object Model

Watches state.EventUpdateNode{

Node: &api.Node{ID: oldTask.NodeID, Status: api.NodeStatus{State: api.NodeStatus_DOWN}},

Checks: []state.NodeCheckFunc{state.NodeCheckID, state.NodeCheckState},

},

state.EventDeleteNode{

Node: &api.Node{ID: oldTask.NodeID},

Checks: []state.NodeCheckFunc{state.NodeCheckID},

},

})62

Page 63: Heart of the SwarmKit: Store, Topology & Object Model

Replication

● Only Raft leader does writes● During write transaction, log every change as well as updating the radix

tree● The transaction log is serialized and replicated through Raft● Since our internal types are protobuf types, serialization is very easy● Followers replay the log entries into radix tree

63

Page 64: Heart of the SwarmKit: Store, Topology & Object Model

Sequencer

● Every object in the store has a Version field● Version stores the Raft index when the object was last updated● Updates must provide a base Version; are rejected if it is out of date● Similar to CAS● Also exposed through API calls that change objects in the store

64

Page 65: Heart of the SwarmKit: Store, Topology & Object Model

65

Versioned UpdatesConsistency

service := getCurrentService()

spec := service.Spec

spec.Image = "my.serv/myimage:mytag"

update(spec, service.Version)

Page 66: Heart of the SwarmKit: Store, Topology & Object Model

Sequencer

Service ABC

SpecReplicas = 4Image = registry:2.3.0...

Version = 189

Original object:

66

Page 67: Heart of the SwarmKit: Store, Topology & Object Model

Sequencer

Service ABC

SpecReplicas = 4Image = registry:2.3.0...

Version = 189

Service ABC

SpecReplicas = 4Image = registry:2.4.0...

Version = 189

Update request: Original object:

67

Page 68: Heart of the SwarmKit: Store, Topology & Object Model

Sequencer

Service ABC

SpecReplicas = 4Image = registry:2.3.0...

Version = 189

Original object:

Service ABC

SpecReplicas = 4Image = registry:2.4.0...

Version = 189

Update request:

68

Page 69: Heart of the SwarmKit: Store, Topology & Object Model

Sequencer

Service ABC

SpecReplicas = 4Image = registry:2.4.0...

Version = 190

Updated object:

69

Page 70: Heart of the SwarmKit: Store, Topology & Object Model

Sequencer

Service ABC

SpecReplicas = 4Image = registry:2.4.0...

Version = 190

Service ABC

SpecReplicas = 5Image = registry:2.3.0...

Version = 189

Update request: Updated object:

70

Page 71: Heart of the SwarmKit: Store, Topology & Object Model

Sequencer

Service ABC

SpecReplicas = 4Image = registry:2.4.0...

Version = 190

Service ABC

SpecReplicas = 5Image = registry:2.3.0...

Version = 189

Update request: Updated object:

71

Page 72: Heart of the SwarmKit: Store, Topology & Object Model

Write batching

● Every write transaction involves a Raft round trip to get consensus● Costly to do many transactions, but want to limit the size of writes to

Raft● Batch primitive lets the store automatically split a group of changes

across multiple writes to Raft

72

Page 73: Heart of the SwarmKit: Store, Topology & Object Model

Write batching_, err = d.store.Batch(func(batch *store.Batch) error {

for _, n := range nodes {

err := batch.Update(func(tx store.Tx) error {

node := store.GetNode(tx, n.ID)

node.Status = api.NodeStatus{

State: api.NodeStatus_UNKNOWN,

Message: `Node moved to "unknown" state`,

}

return store.UpdateNode(tx, node)

}

}

return nil

}73

Page 74: Heart of the SwarmKit: Store, Topology & Object Model

Future work

● Multi-valued indices● Watch API● Version control?

74

Page 75: Heart of the SwarmKit: Store, Topology & Object Model

THANK YOU