Page 1
Jonathan Boulle@baronboulle | [email protected]
etcd - overview and future
Page 3
Uncoordinated Upgrades
Page 4
... ... ...
... ... ...
Unavailable
Uncoordinated Upgrades
Page 5
MotivationCoreOS cluster reboot lock
- Decrement a semaphore key atomically
- Reboot and wait...
- After reboot increment the semaphore key
Page 6
3
CoreOS updates coordination
Page 7
CoreOS updates coordination
3
Page 8
...
CoreOS updates coordination
2
Page 9
... ... ...
CoreOS updates coordination
0
Page 10
... ... ...
CoreOS updates coordination
0
Page 11
... ...
CoreOS updates coordination
0
Page 12
... ...
CoreOS updates coordination
0
Page 13
... ...
CoreOS updates coordination
1
Page 14
... ...
...
CoreOS updates coordination
0
Page 15
CoreOS updates coordination
Page 16
Store Application Configuration
config
Page 17
config
Start / RestartStart / Restart
Store Application Configuration
Page 18
config
Update
Store Application Configuration
Page 19
config
Unavailable
Store Application Configuration
Page 20
RequirementsStrong Consistency
- mutual exclusive at any time for locking purpose
Highly Available- resilient to single points of failure & network partitions
Watchable- push configuration updates to application
Page 21
RequirementsCAP
- Consistency, Availability, Partition Tolerance: choose 2 - We want CP- We want something like Paxos
Page 22
Common problem
GFS
Paxos
Big Table
Spanner
CFS
Chubby
Google - “All” infrastructure relies on Paxos
Page 23
Common problem
Amazon - Replicated log powers ec2
Microsoft - Boxwood powers storage infrastructure
Hadoop - ZooKeeper is the heart of the ecosystem
Page 24
COMMON PROBLEM
#GIFEE and Cloud Native Solution
Page 25
10,000 Stars on Github
250 contributors
Google, Red Hat, EMC, Cisco, Huawei, Baidu, Alibaba...
Page 26
THE HEART OF CLOUD NATIVE
Kubernetes, Cloud Foundry's Diego,Docker's SwarmKit, many others
Page 27
ETCD KEY VALUE STORE
Fully Replicated, Highly Available,Consistent
Page 28
PUT(foo, bar), GET(foo), DELETE(foo)
Watch(foo)
CAS(foo, bar, bar1)
Key-value Operations
Page 29
DEMO
play.etcd.io
Page 30
Runtime ReconfigurationPoint-in-time Backup
Extensive Metrics
etcd Operationality
Page 31
ETCD v3
Successor of etcd v2
Page 32
ETCD v3
Better Performance
Page 33
ETCD v3
Massively Scalable
Page 34
ETCD v3
More Efficient & Powerful APIs
Page 35
gRPC Based API
~4x Faster vs JSON
HTTP/2 Improves Efficiency
Page 36
Multi-VersionPut(foo, bar)
Put(foo, bar1)
Put(foo, bar2)
Get(foo) -> bar2
Page 37
Multi-VersionPut(foo, bar)
Put(foo, bar1)
Put(foo, bar2)
Get(foo, 1) -> bar
Page 38
Tx.If(Compare(Value("foo"), ">", "bar"),Compare(Version("foo"), "=", 2),...).Then(Put("ok","true")...).Else( Put("ok","false")...).Commit()
Mini-Transactions
Page 39
l = CreateLease(15 * second)
Put(foo, bar, l)
l.KeepAlive()
l.Revoke()
Leases
Page 40
w = Watch(foo)for {r = w.Recv()print(r.Event) // PUTprint(r.KV) // foo,bar
}
Streaming Watch
Page 41
Synchronization LoC
Page 42
ETCD v2
machine coordination -> O(10k)
Page 43
ETCD v3
app/container coordination -> O(1M)
Page 44
Reliability99% at small scale is easy
- Failure is infrequent and human manageable
99% at large scale is not enough- Not manageable by humans
99.99% at large scale- Reliable systems at bottom layer
Page 45
HOW DO WE ACHIEVE RELIABILITY
WAL, Snapshots, Testing
Page 46
Write Ahead Log
Append only- Simple is good
Rolling CRC protected- Storage & OSes can be unreliable
Page 47
Snapshots
Torturing DBs for Fun and Profit (OSDI2014)- The simpler database is safer- LMDB was the winner
Boltdb an append only B+Tree- A simpler LMDB written in Go
Page 48
Testing Clusters FailureInject failures into running clusters
White box runtime checking- Hash state of the system- Progress of the system
Page 49
Testing Cluster Health with Failures
Issue lock operations across clusterEnsure the correctness of client library
Page 50
TESTING CLUSTER
dash.etcd.io
Page 51
Punishing Functional Tests
Page 52
Punishing Functional Tests
Page 53
etcd/raft Reliability
Designed for testability and flexibility
Used by large scale db systems and others- Cockroachdb, TiKV, Dgraph
Page 54
etcd vs others
Do one thing
Page 55
etcd vs others
Only do the One Thing
Page 56
etcd vs others
Do it Really Well
Page 57
etcd Reliability
Do it Really Well
Page 58
ETCD v3.0 BETA
Efficient and Scalable
Page 59
BETA AVAILABLE TODAY
github.com/coreos/etcd
Page 60
FUTURE WORK
Proxy, Caching, Watch Coalescing, Secondary Index
Page 61
ETCD and KUBERNETES
The Data Store
Page 62
workerkubelet
workerkubelet
workerkubelet
scheduler& API
workerkubelet
workerkubelet
workerkubelet
workerkubelet
Page 63
etcd and Kubernetes
- Kubernetes currently uses the V2 API
- Work very actively in process to migrate to V3
- Opt-in currently, default in future
Page 64
etcd v3 and Kubernetes
- Follow along:https://github.com/kubernetes/kubernetes/issues/22448
- Try it out!
Page 65
etcd v3 will support Kubernetesas it scales to 5.000 nodes and beyond
Page 66
Performance 1K keys
Page 67
Performance
Snapshot caused performance degradation
etcd2 - 600K keys
Page 68
Performance etcd2 - 600K keys
Snapshot triggered elections
Page 69
ZooKeeper Performance
Non-blocking full snapshotEfficient memory management
Page 70
Performance ZooKeeper default
Page 71
Performance
Snapshot triggeredelection
ZooKeeper default
Page 72
Performance
Snapshot
ZooKeeper default
Page 73
Performance
GC
ZooKeeper snapshot disabled
Page 74
Reliable Performance
- Similar to ZooKeeper with snapshot disabled- Incremental snapshot
- No Garbage Collection Pauses- Off-heap storage
Page 75
Performance etcd3 /ZooKeeper snapshot disabled
Page 76
Performance etcd3 /ZooKeeper snapshot disabled
Page 77
Memory
10GB
2.4GB
0.8GB
512MB data - 2M 256B keys
Page 78
GET INVOLVED
github.com/coreos/etcd