Anti-Entropy using CRDTs on HA Datastores › system › files › presentation-slides › ...Anti-Entropy using CRDTs on HA Datastores Sailesh Mukil Senior Software Engineer, Netflix

Post on 27-Jun-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Anti-Entropy using CRDTs on HA Datastores

Sailesh MukilSenior Software Engineer, Netflix

Timeline

NETFLIX

Cassandra adoption

2011 2013 2016

Multi-region Dynomite

Dynomite

NETFLIX

Makes non-distributed datastores, distributed

NETFLIX

Datastore

33% 33%

33%

Dynomite Overview

NETFLIX

Replica 1 Replica 2 Replica 3

Dynomite Overview

NETFLIX

Replica 1 Replica 2 Replica 3

Client

NETFLIX

Replica 1 Replica 2 Replica 3

Client

NETFLIX

Replica 1 Replica 2 Replica 3

Client

NETFLIX

Dynomite overview

● Global replication● High availability● Shared nothing● Auto-sharding● Linear scale

● Pluggable datastores (Redis primarily)

● Multiple quorum levels

● Supports datastore API

NETFLIX

Dynomite footprint @ Netflix

● ~1000 customer facing nodes● ~1M OPS/s● Largest cluster holds ~6 TB

The problem

NETFLIX

Entropy in the system

NETFLIX

R-2 R-3

R-1

Entropy in the system SET K 123

NETFLIX

R-2 R-3

R-1

Entropy in the system SET K 123

K: 123

K: 123 K: 123

NETFLIX

R-2 R-3

R-1

Entropy in the system

K: 123

K: 123 K: 123

OK

NETFLIX

R-2 R-3

R-1

Entropy in the system

K: 123

K: 123 K: 123

SET K 456

NETFLIX

R-2 R-3

R-1

Entropy in the system

K: 123

K: 123 K: 123

SET K 456

K: 456

NETFLIX

R-2 R-3

R-1

Entropy in the system

K: 123

K: 123 K: 123K: 456

ERR

NETFLIX

R-2 R-3

R-1

Entropy in the system

K: 123

K: 123 K: 123K: 456

SET K 789

NETFLIX

R-2 R-3

R-1

Entropy in the system

K: 123

K: 123 K: 123K: 456

SET K 789

K: 789

NETFLIX

R-2 R-3

R-1

Entropy in the system

K: 123

K: 123 K: 123K: 456K: 789

ERR

NETFLIX

R-2 R-3

R-1

K: 123

K: 123 K: 123K: 456

GET K

K: 789

GET K

NETFLIX

R-2 R-3

R-1

K: 123

K: 123 K: 123K: 456K: 789

789 456

NETFLIX

R-2 R-3

R-1

K: 123

K: 123 K: 123K: 456

GET K (w/quorum)

K: 789

NETFLIX

R-2 R-3

R-1

K: 123

K: 123 K: 123K: 456

GET K (w/quorum)

K: 789

NETFLIX

R-2 R-3

R-1

K: 123

K: 123 K: 123K: 456

GET K (w/quorum)

K: 789

123

456

NETFLIX

R-2 R-3

R-1

K: 123

K: 123 K: 123K: 456K: 789

ERR: QUORUM FAILED

NETFLIX

R-2 R-3

R-1

K: 123

K: 123 K: 123K: 456K: 789

123

456

ERR: QUORUM FAILED

NETFLIX

Replicas will go out of sync

Timeline

NETFLIX

Cassandra adoption

2011 2013 2016

Multi-region DynomiteDynomite w/ CRDTs

2019

NETFLIX

Last Writer Wins Vector Clocks

Achieving anti-entropy(traditionally)

● Uses Physical timestamps● Clock skew

● Shows causal relationships● But not for concurrent writes

The solution

NETFLIX

Conflict free replicated data types

Conflict free replicated data types

NETFLIX

SECTION DIVIDER

A CRDT is a data structure which can be replicated across the network, where the replicas can be updated independently and concurrently without coordination between the replicas, and where it is always mathematically possible to resolve inconsistencies which might result.

NETFLIX

Associative Commutative Idempotent

Grouping of operations does not matter

(X + Y) + Z = X + (Y + Z)

Order of operations do not matter

X + Y = Y + X

Duplication of operations does not

matter

X + X = X

NETFLIX

Update Merge

Types of operations on CRDTs

● Updates local state ● Converges replica states

NETFLIX

When we write, we update

When we repair, we merge

Read repair = merge on read path

Introduction to CRDTs

NETFLIX

CRDTs provide strong eventual consistency

Introduction to CRDTs

NETFLIX

R-2 R-3

R-1

Naive distributed counter

CTR: 1

CTR: 1 CTR: 1

INCR CTR

NETFLIX

R-2 R-3

R-1

Naive distributed counter

CTR: 1

CTR: 1 CTR: 1

DECR CTR INCR CTR

CTR: 0 CTR: 2

NETFLIX

R-2 R-3

R-1

Naive distributed counter

CTR: 1

CTR: 1 CTR: 1CTR: 0 CTR: 2

Repair based on timestamp?

Latest value is 2, which is incorrect

CRDT: PNCounters

NETFLIX

Each replica maintains 2 “local” counters● Positive counter: Tracks increments● Negative counter: Tracks decrements

Final counter value:(Sum of all PCounters - Sum of all NCounters)

NETFLIX

R-2 R-3

R-1

CRDT: PNCounter INCR CTR

0 0 0

0 0 0CTR:

0 0 0

0 0 0CTR:

0 0 0

0 0 0CTR:

1

1 1

0 00 0 01 0 0

0 0 01

NETFLIX

R-2 R-3

R-1

CRDT: PNCounter

0 0 0

0 0 0CTR:

0 0 0

00 0CTR:

0 0 0

0 0 0CTR:

1

1 1

DECR CTR INCR CTR

1

1

NETFLIX

R-2 R-3

R-1

CRDT: PNCounter

0 0 0

0 0 0CTR:

0 0 0

00 0CTR:

0 0 1

0 0 0CTR:

1

1 1

1

1

CTR = 0

CTR = 1

CTR = 2

NETFLIX

R-2 R-3

R-1

CRDT: PNCounter

0 0 0

0 0 0CTR:

0 0 0

00 0CTR:

0 0 1

0 0 0CTR:

1

1 1

1

1

GET CTR

0 00 01

1

0 10 0 01

NETFLIX

R-2 R-3

R-1

CRDT: PNCounter

0 0 0

00 0CTR:

0 0 0

00 0CTR:

0 0 1

00 0CTR:

1

1 1

1

1

GET CTR1

1

repair(merge)

repair(merge)

repair(merge)

1

1

NETFLIX

R-2 R-3

R-1

CRDT: PNCounter

0 0 0

00 0CTR:

0 0 0

00 0CTR:

0 0 1

00 0CTR:

1

1 1

1

1

1

1

1

1

CTR = 1

CTR = 1

CTR = 1

CRDT: LWW-Element Set

NETFLIX

Used to maintain key metadata● Add set: Latest update timestamps for keys● Remove set: Timestamps at which keys were removed

Registers can take arbitrary values● Hence we still require LWW to resolve conflicts

Used for registers, hashmaps and sorted sets

NETFLIX

R-2 R-3

R-1

LWW-Element Set SET K1 123 (t1)

add

rem

add

rem

add

rem

K1t1

K1t1

K1t1

K1: 123

K1: 123 K1: 123

NETFLIX

R-2 R-3

R-1

LWW-Element Set add

rem

add

rem

add

rem

K1t1

K1t1

K1t1

K1: 123

K1: 123 K1: 123

SET K1 456 (t2)

t2

K1: 456

NETFLIX

R-2 R-3

R-1

LWW-Element Set add

rem

add

rem

add

rem

K1t1

K1t1

K1t1

K1: 123

K1: 123

t2

SET K2 999 (t3)

K2t3

K2: 999

K2t3

K2: 999

K1: 456

NETFLIX

R-2 R-3

R-1

LWW-Element Set add

rem

add

rem

add

rem

K1t1

K1t1

K1t1

K1: 123

K1: 123

t2K2t3

K2: 999

K2t3

K2: 999

K1: 456

GET K1

K1 = 456 (t2)K1 = 123 (t1)

t2 > t1=> 456 latest value

t2

K1: 456

repair

NETFLIX

R-2 R-3

R-1

LWW-Element Set add

rem

add

rem

add

rem

K1

K1t1

K1t1

K1: 123

t2K2t3

K2: 999

K2t3

K2: 999

K1: 456

t2

K1: 456

“456”

repair

t2

K1: 456

NETFLIX

R-2 R-3

R-1

LWW-Element Set add

rem

add

rem

add

rem

K1

K1 K1t1t2

K2t3

K2: 999

K2t3

K2: 999

K1: 456

t2

K1: 456

t2

K1: 456

GET K2

(nil)K2 = 999 (t3)

NETFLIX

R-2 R-3

R-1

LWW-Element Set add

rem

add

rem

add

rem

K1

K1 K1t1t2

K2t3

K2: 999

K2t3

K2: 999

K1: 456

t2

K1: 456

t2

K1: 456

“999”

repair

K2t3

K2: 999

NETFLIX

R-2 R-3

R-1

LWW-Element Set add

rem

add

rem

add

rem

K1

K1 K1t1t2

K2t3

K2: 999

K2t3

K2: 999

K1: 456

t2

K1: 456

t2

K1: 456

K2t3

K2: 999

DEL K2 (t4)

K2t4

NETFLIX

R-2 R-3

R-1

LWW-Element Set add

rem

add

rem

add

rem

K1

K1 K1t1t2

K2t3

K2t3

K2: 999

K1: 456

t2

K1: 456

t2

K1: 456

K2t3

K2: 999

GET K2“999”

K2t4

NETFLIX

R-2 R-3

R-1

LWW-Element Set add

rem

add

rem

add

rem

K1

K1 K1t1t2

K2t3

K2t3

K2: 999

K1: 456

t2

K1: 456

t2

K1: 456

K2t3

K2: 999

GET K2

K2 del @t4

K2t4

K2 = 999 (t3)

K2t4

NETFLIX

R-2 R-3

R-1

LWW-Element Set add

rem

add

rem

add

rem

K1

K1 K1t1t2

K2t3

K2t3

K2: 999

K1: 456

t2

K1: 456

t2

K1: 456

K2t3

(nil)

K2t4

DEL K2 (t4)

K2t4

K2t4

repair

Implementation challenges (LWW-element set)

NETFLIX

Redis doesn’t maintain timestampsDynomite can track the timestamp of the client request

Implementation challenges (LWW-element set)

NETFLIX

We’d like Dynomite to remain statelessStore the metadata inside Redis

Implementation challenges (LWW-element set)

NETFLIX

Operations must modify data and metadata atomicallyRewrite operations into Redis Lua scripts (guarantees atomicity)

Implementation challenges (LWW-element set)

NETFLIX

Does the remove set grow forever?Delete metadata ASAP from remove set if ALL replicas agreeBackground thread cleans restMaintain remove set as sorted set

Implementation challenges (LWW-element set)

NETFLIX

What does an example Lua script look like?Check if update is oldDiscard if it isUpdate data + metadata otherwise

NETFLIX

Repairs occur on read path in DynomiteRepairs for point reads only

Background repairs

NETFLIX

(Note: Ongoing work)

NETFLIX

Repairing on range reads is expensiveEg: Give me all members of a set

Return everything in this hashmapReturn me a range from this sorted set

Background repairs

NETFLIX

How do we target keys that need repairing?Full key walk? (like Cassandra)

Background repairs

NETFLIX

How do we target keys that need repairing?Maintain list of recently written to keys

Background repairs

Run merge operation on them (async)But, merge operation on large structures are expensive

NETFLIX

Delta-state CRDTs

Maintain list of recent mutations done to keys

Background repairs

Ship only delta-state instead of entire data structure for mergeConfirm which replicas have received it

NETFLIX

0

00CTR:

0 0

00CTR:

1 1

1

1

1

Background repairs What is a delta-state?

INCR CTR

2

0

0 1

2

Full state

R1 R2

NETFLIX

0

00CTR:

0

00CTR:

1

1

1

1

Background repairs What is a delta-state?

INCR CTR

2R1 = 2

Delta state

2

R1 R2

NETFLIX

Background repairs What is a delta-state?

R1 R3

R2

R2 R3Mutations

𝜹-1𝜹-2

𝜹-3

𝜹-4

NETFLIX

Background repairs What is a delta-state?

R1 R3

R2

R2 R3Mutations

𝜹-1𝜹-2

𝜹-3

𝜹-4

ACK

ACK

NETFLIX

Background repairs What is a delta-state?

R1 R3

R2

R2 R3Mutations

𝜹-1𝜹-2

𝜹-3

𝜹-4

ack ack

ackack

NETFLIX

Background repairs What is a delta-state?

R1 R3

R2

R2 R3Mutations

𝜹-1𝜹-2

𝜹-3

𝜹-4

ack ack

ackack

ACK

NETFLIX

Background repairs What is a delta-state?

R1 R3

R2

R2 R3Mutations

𝜹-1𝜹-2

𝜹-3

𝜹-4

ack ack

ackack

NETFLIX

Challenge with Delta-state CRDTsDurability

Background repairs

Practical overhead of maintaining list

Sailesh Mukilsmukil@netflix

Thank You.

top related