Top Banner
Distributed Systems Consistency & Replication Alberto Montresor University of Trento, Italy 2011/11/02 Redundancy is our main avenue of survival Robert Silverberg, “Shadrach in the furnace” Alberto Montresor (UniTN) DS - Replication 2011/11/02 1 / 82
82
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 09 Replication

Distributed SystemsConsistency & Replication

Alberto Montresor

University of Trento, Italy

2011/11/02

Redundancy is our main avenue of survivalRobert Silverberg, “Shadrach in the furnace”

../references

Alberto Montresor (UniTN) DS - Replication 2011/11/02 1 / 82

Page 2: 09 Replication

Contents1 Introduction to replicated systems2 Consistency models

IntroductionStrict consistencyLinearizabilitySequential consistencyEventual ConsistencyClient-centric consistency

3 Replication architecturesOverviewPrimary-BackupQuorum protocolsState machinesClient-centric consistency

4 CAP Theorem5 Bibliography

Alberto Montresor (UniTN) DS - Replication 2011/11/02 2 / 82

Page 3: 09 Replication

Introduction to replicated systems

Introduction

Definition (Availability)

The probability that a system will provide its required service, or theratio of the total time a system is capable of being used during a giveninterval to the length of the interval:

A =E[uptime]

E[uptime + downtime]

Example

One single server

On average, crashes once per week(mtbf: 10.080′)

Two minutes to reboot (mtbr: 2′)

A =10080

10080 + 2= 0.9998

Alberto Montresor (UniTN) DS - Replication 2011/11/02 3 / 82

Page 4: 09 Replication

Introduction to replicated systems

Introduction

Definition (Availability)

The probability that a system will provide its required service, or theratio of the total time a system is capable of being used during a giveninterval to the length of the interval:

A =E[uptime]

E[uptime + downtime]

Example

Ten servers

mtbf, mtbr as before

All needed at the same time toperform the service

pf =2

10082A = (1− pf )10 = 0.998

Alberto Montresor (UniTN) DS - Replication 2011/11/02 4 / 82

Page 5: 09 Replication

Introduction to replicated systems

Introduction

Definition (Availability)

The probability that a system will provide its required service, or theratio of the total time a system is capable of being used during a giveninterval to the length of the interval:

A =E[uptime]

E[uptime + downtime]

Example

Ten servers

mtbf, mtbr as before

One replica needed to perform theservice

pf =2

10082A = 1− (pf )10 = 1− 10−38

Alberto Montresor (UniTN) DS - Replication 2011/11/02 5 / 82

Page 6: 09 Replication

Introduction to replicated systems

Replication

How to increase availability:

Avoid single point of failures

Use replication (time/space)

Replication in space:

Run parallel copies

Vote on replica output

High-availability, high-cost

Replication in time:

When a replica fails, restart it (or replace it)

Lower maintenance, lower availability

Alberto Montresor (UniTN) DS - Replication 2011/11/02 6 / 82

Page 7: 09 Replication

Introduction to replicated systems

Replication

Replication advantages:

Replicating a service increases its availability

Performance benefits:I Geographical co-locationI Load-balancingI No bottlenecks

Replication drawbacks:

Trade-off between availability and consistency

Transparent replication is difficult

Alberto Montresor (UniTN) DS - Replication 2011/11/02 7 / 82

Page 8: 09 Replication

Introduction to replicated systems

Consistency problem

The consistency problem:

Whenever a copy is modified, that copy becomes different from therest

Modifications have to be carried out on all copies to ensureconsistency

Conflicting operations - from the world of transactions:

Read–write conflict: concurrent read operation and write operation

Write–write conflict: two concurrent write operations

Alberto Montresor (UniTN) DS - Replication 2011/11/02 8 / 82

Page 9: 09 Replication

Introduction to replicated systems

Consistency problem

The goal

We generally need to ensure that all conflicting operations are done inthe same order everywhere

The problem

Guaranteeing global ordering on conflicting operations may be a costlyoperation, downgrading scalability

The solution

Weaken consistency requirements so that hopefully globalsynchronization can be avoided

Alberto Montresor (UniTN) DS - Replication 2011/11/02 9 / 82

Page 10: 09 Replication

Introduction to replicated systems

Consistency example

Example (Flight reservation database)

At 9.36, all seats of flight 48 are booked

At 9.37, Jane cancel its reservation on flight 48

At 9.38, Michael tries to reserve a seat on flight 48I the answer is fully booked

At 9.39, George tries to reserve a set on flight 48I the seat is granted

What do you think?

Alberto Montresor (UniTN) DS - Replication 2011/11/02 10 / 82

Page 11: 09 Replication

Consistency models Introduction

Consistency models

Definition (Consistency model)

A contract between a distributed data store and a set of processes,which specifies what the results of read/write operations are in thepresence of concurrency

Definition (Distributed data store)

A distributed collection of storageentities accessible to clients

Distributed database, file system

Shared memory in a parallel system

Alberto Montresor (UniTN) DS - Replication 2011/11/02 11 / 82

Page 12: 09 Replication

Consistency models Introduction

Consistency models

Werner Vogels

Whether or not inconsistenciesare acceptable depends on theclient application. In all casesthe developer must be aware thatconsistency guarantees areprovided by the storage systemsand must be taken into accountwhen developing applications.

W. Vogels. Eventual consistent. Comm. of the

ACM, 52(1):40–44, 2009Amazon’s vice-presidentand Chief Scientific Officer

Alberto Montresor (UniTN) DS - Replication 2011/11/02 12 / 82

Page 13: 09 Replication

Consistency models Introduction

Consistency models

Strong consistency models

Strict consistency

Linearizability

Sequential consistency

Weak consistency models

Eventual consistency

Causal consistency

Client-centric consistency models

Read-after-read (monotonic read)

Write-after-write (monotonic write)

Read-after-write (read your writes)

Write-after-read (write follows read)

Alberto Montresor (UniTN) DS - Replication 2011/11/02 13 / 82

Page 14: 09 Replication

Consistency models Introduction

Notation

Write operation: wi(s, a)Process i has written a on variable s

Read operation: ri(s)→ aProcess i has read a from variable s

p1

p2

p3

w1(s, 100) w

1(s, 99)

r2(s) → 100

r2(s) → 99 w

3(s, 100)

Alberto Montresor (UniTN) DS - Replication 2011/11/02 14 / 82

Page 15: 09 Replication

Consistency models Strict consistency

Strict consistency

Definition (Strict consistency)

A read operation must return the result of the latest write operationwhich occurred on the data item

Implementation:

Only possible with a global, perfectly synchronized clock

Only possible if all writes instantaneously visible to all

It makes sense, though:

it is the model of uniprocessor systems!

Alberto Montresor (UniTN) DS - Replication 2011/11/02 15 / 82

Page 16: 09 Replication

Consistency models Linearizability

Linearizability

Definition (Linearizability, Herlihy and Wing, 1991)

1 The result of any execution is the same as if the operations by allprocesses on the data store were executed in some sequential order

2 The operation of each process appear in this sequence in the orderspecified by its program

3 If t1 < t2 are the times at which two distinct processes performoperations o1 and o2, then o1 must appear before o2 in thesequence

Alberto Montresor (UniTN) DS - Replication 2011/11/02 16 / 82

Page 17: 09 Replication

Consistency models Linearizability

Linearizability

Example

Is the example below linearizable? (Read: given a replicationprotocol that produces these actions, is this protocol linearizable?)

Are the following linear sequences possible linearizations?I w1(s, 100) w1(s, 99) r2(s)→ 100 r3(s)→ 99 w3(s, 100)I w1(s, 100) r2(s)→ 100 w1(s, 99) r3(s)→ 99 w3(s, 100)

p1

p2

p3

w1(s, 100) w

1(s, 99)

r2(s) → 100

r2(s) → 99 w

3(s, 100)

Alberto Montresor (UniTN) DS - Replication 2011/11/02 17 / 82

Page 18: 09 Replication

Consistency models Sequential consistency

Sequential Consistency

Definition (Sequential Consistency, Lamport, 1978)

1 The result of any execution is the same as if the operations by allprocesses on the data store were executed in some sequential order

2 The operation of each process appear in this sequence in the orderspecified by its program

3 If t1 < t2 are the times at which two distinct processes performoperations o1 and o2, then o1 must appear before o2 in thesequence

Comments:

Much more common

Alberto Montresor (UniTN) DS - Replication 2011/11/02 18 / 82

Page 19: 09 Replication

Consistency models Sequential consistency

Sequential Consistency

Example

Is the example below sequentially consistent?

Is the following sequence a sequentially consistent one?I w1(s, 100) r2(s)→ 100 w1(s, 99) r3(s)→ 99 w3(s, 100)

p1

p2

p3

w1(s, 100) w

1(s, 99)

r2(s) → 100

r2(s) → 99 w

3(s, 100)

Alberto Montresor (UniTN) DS - Replication 2011/11/02 19 / 82

Page 20: 09 Replication

Consistency models Sequential consistency

Sequential Consistency

Example

Is the example below sequentially consistent?

Is the following sequence a sequentially consistent one?I From 1,2: w1(s, 99) r2(s)→ 99 w2(s, 100)I From 3: r3(s, 100) r3(s)→ 99

p1

p2

p3

w1(s, 98) w

1(s, 99)

r2(s) → 99

r3(s) → 100 r

3(s, 99)

w2(s, 100)

Alberto Montresor (UniTN) DS - Replication 2011/11/02 20 / 82

Page 21: 09 Replication

Consistency models Sequential consistency

Sequential Consistency

Process p1x← 1print y, z

Process p2y ← 1print x, z

Process p3z ← 1print x, y

Initially, all variables have value 0

How many “potential executions” (without conditions 1,2)?6! = 720

How many “valid executions” (without condition 1)?(5!/4) · 3 = 90

How many “potential outputs” (signatures ordered by p1, p2, p3)?26 = 64

Alberto Montresor (UniTN) DS - Replication 2011/11/02 21 / 82

Page 22: 09 Replication

Consistency models Sequential consistency

Sequential Consistency

Process p1x← 1print y, z

Process p2y ← 1print x, z

Process p3z ← 1print x, y

How many “sequentially consistent outputs”? < 64

Example: Is 000000 sequentially consistent? NoI All print operations “happen before” the updates - impossible

Example: Is 001001 sequentially consistent? NoI print yz = 00 after x← 1, before y ← 1, z ← 1I x← 1, print yz = 00, y ← 1, print xz = 10, z ← 1, print xy = 11,

No!I x← 1, print yz = 00, z ← 1 – no (z was never equal to 1)

Alberto Montresor (UniTN) DS - Replication 2011/11/02 22 / 82

Page 23: 09 Replication

Consistency models Sequential consistency

Causal Consistency – (Hutto and Ahamad, 1990)

Definition (Causal Consistency)

All writes that are (potentially) causally related must be seen by everyprocess in causal order

Define “causally related”:

a read followed by a write, on the same process:I the write is (potentially) causally related by the read

a write followed by a read of the same value, on diff. process:I the read is (potentially) causally related by the write

Example of use:

Bulletin board

Alberto Montresor (UniTN) DS - Replication 2011/11/02 23 / 82

Page 24: 09 Replication

Consistency models Sequential consistency

Causal Consistency

Example

Is the following example causally consistent?

Is the following example sequentially consistent?

p1

p2

p3

w1(s, 99)

r3(s) → 100 r

3(s) → 99

w2(s, 100)

r4(s) → 99p

4r4(s) → 100

Alberto Montresor (UniTN) DS - Replication 2011/11/02 24 / 82

Page 25: 09 Replication

Consistency models Sequential consistency

Causal Consistency

Example

Is the following example causally consistent?

Is the following example sequentially consistent?

p1

p2

p3

w1(msg

1, 99)

w2(msg

2, 100)

p4

r2(msg

1) → 99r

2(msg

1) → 99

r3(msg

1) → 99 r

3(msg

2) → 100

r4(msg

1) → 99r

4(msg

2) → 100

Alberto Montresor (UniTN) DS - Replication 2011/11/02 25 / 82

Page 26: 09 Replication

Consistency models Sequential consistency

Several other models

FIFO/PRAM Consistency (Lipton and Sandberg, 1988)

Release Consistency (Gharachorloo et al, 1990)

Entry Consistency (Bershad et al, 1993)

. . .

Alberto Montresor (UniTN) DS - Replication 2011/11/02 26 / 82

Page 27: 09 Replication

Consistency models Eventual Consistency

Eventual Consistency

Scenario: consider a system where

updates are rare

concurrent updates are absent, or can be easily resolved in anautomatic way

Example: DNS

Updates are rare w.r.t. to reads!

Only a centralized authority can update the system; no concurrentupdates.

Do we need sequential consistency in this case?

Alberto Montresor (UniTN) DS - Replication 2011/11/02 27 / 82

Page 28: 09 Replication

Consistency models Eventual Consistency

Eventual Consistency

Definition (Eventual consistency)

If no updates take place for a long time, all replicas will graduallybecome consistent (i.e., the same)

Comment:

The consistency policy of epidemic protocols

This is not a safety property, is a liveness one

What happens in our three- process example with prints?

Alberto Montresor (UniTN) DS - Replication 2011/11/02 28 / 82

Page 29: 09 Replication

Consistency models Eventual Consistency

Eventual Consistency

Process p1x← 1print y, z

Process p2y ← 1print x, z

Process p3z ← 1print x, y

Example: Is 000000 eventual consistent? Yes

In general, all the potential 64 outputs are possible

Alberto Montresor (UniTN) DS - Replication 2011/11/02 29 / 82

Page 30: 09 Replication

Consistency models Eventual Consistency

Consistency for mobile users

Consider a replicated database that you access through your notebook.The notebook acts as a front-end to the database

Alberto Montresor (UniTN) DS - Replication 2011/11/02 30 / 82

Page 31: 09 Replication

Consistency models Client-centric consistency

Consistency for mobile users

Problem: Eventual Consistency is not sufficient

You move from location A to location B

Unless you use the same server, you may detect inconsistencies:I your updates at A may not have yet been propagated to BI you may be reading newer entries than the ones available at AI your updates at B may eventually conflict with those at A

What we can do?The only thing you really care is that the entries you updated and/orread at A, are in B the way you left them in A. In that case, thedatabase will appear to be consistent to you

Alberto Montresor (UniTN) DS - Replication 2011/11/02 31 / 82

Page 32: 09 Replication

Consistency models Client-centric consistency

Client-centric consistency

Idea

In some cases, we can avoid system-wide consistency, by concentratingon what specific clients want, instead of what should be provided byservers

Models:

Read-after-read / Monotonic reads

Write-after-write / Monotonic writes

Read-after-write / Read-your-writes

Write-after-read / Write-follows-reads

Alberto Montresor (UniTN) DS - Replication 2011/11/02 32 / 82

Page 33: 09 Replication

Consistency models Client-centric consistency

Notation

xi denotes the version of data item x at local copy Li

WS (xi) denotes the set of the write/update operations that havecaused xi to assume such value on Li

WS (xi;xj) denotes the fact that the operations in WS (xi) havebeen also performed at local copy Lj

Time specifications should be added to this notation; in the nextslides we will use a space-time diagram, instead

Alberto Montresor (UniTN) DS - Replication 2011/11/02 33 / 82

Page 34: 09 Replication

Consistency models Client-centric consistency

Monotonic reads – Read-after-read

Definition (Monotonic reads)

If a process reads the value of a data item x, any successive readoperation on x by that process will always return that same or a morerecent value

Example

Reading incoming mail on aweb-server. Each time youconnect to a different e-mailserver, that server fetches (atleast) all the updates from theserver you previously visited

L1

L2

r(x1)

r(x2)

WS(x1)

WS(x1 ; x

2)

L1

L2

r(x1)

r(x2)

WS(x1)

Alberto Montresor (UniTN) DS - Replication 2011/11/02 34 / 82

Page 35: 09 Replication

Consistency models Client-centric consistency

Monotonic writes – Write-after-write

Definition (Monotonic writes)

A write operation by a process on a data item x is completed beforeany successive write operation on x by the same process

Example

Maintaining versions ofreplicated files in the correctorder everywhere (propagatethe previous version to theserver where the newest versionis installed)

L1

L2

w(x1)

w(x2)WS(x

1 ; x

2)

L1

L2

w(x1)

w(x2)

WS(x1)

WS(x1)

Alberto Montresor (UniTN) DS - Replication 2011/11/02 35 / 82

Page 36: 09 Replication

Consistency models Client-centric consistency

Read your writes – Read-after-write

Definition (Read your writes)

The effect of a write operation by a process on data item x, will alwaysbe seen by a successive read operation on x by the same process

Example

Updating your Web page andguaranteeing that your Webbrowser shows the newestversion instead of its cachedcopy

L1

L2

w(x1)

r(x2)

WS(x1)

WS(x1 ; x

2)

L1

L2

w(x1)

r(x2)

WS(x1)

Alberto Montresor (UniTN) DS - Replication 2011/11/02 36 / 82

Page 37: 09 Replication

Consistency models Client-centric consistency

Writes follow read – Write-after-read

Definition (Writes follow read)

A write operation by a process P on a data item x following a previousread operation on x by P , is guaranteed to take place on the same or amore recent value of x that was read

Example (Newsgroup)

To guarantee that users of anetwork newsgroup see aposting of a reaction to anarticle only after they have seenthe original article

L1

L2

r(x1)

w(x2)

WS(x1)

WS(x1 ; x

2)

L1

L2

r(x1)

w(x2)

WS(x1)

WS(x1 ; x

2)

Alberto Montresor (UniTN) DS - Replication 2011/11/02 37 / 82

Page 38: 09 Replication

Consistency models Client-centric consistency

Session consistency

Definition (Session consistency)

A practical version of read-your-writes, where processes access adata storage in the context of a session

As long as the session exists, the system guaranteesread-your-writes

If the session terminates because of a failure, a new session mustbe created

Guarantees are limited to sessions

Alberto Montresor (UniTN) DS - Replication 2011/11/02 38 / 82

Page 39: 09 Replication

Consistency models Client-centric consistency

Client-centric consistency

Relevant bibliography

A. S. Tanenbaum and M. van Steen. Distributed Systems: Principles andParadigms. Prentice-Hall, 2nd edition, 2007. [Chapter 7]

D. Terry, M. Theimer, K. Petersen, A. Demers, M. Spreitzer, and C. Hauser.Managing update conflicts in Bayou, a weakly connected replicated storagesystem. In Proc. of the 15th ACM symposium on Operating systems principles,SOSP’95, pages 172–182. ACM, 1995.http://www.disi.unitn.it/~montreso/ds/papers/bayou.pdf

Alberto Montresor (UniTN) DS - Replication 2011/11/02 39 / 82

Page 40: 09 Replication

Consistency models Client-centric consistency

Reality Check

Amazon S3

Amazon S3 (Simple Storage Service) is an online storage web serviceoffered by Amazon Web Services. S3 is designed to provide 99.99%availability and 99.999999999% durability of objects over a given year.

From Amazon S3’s FAQ

Q: What data consistency model does Amazon S3 employ?Amazon S3 buckets in the US West (Northern California),EU (Ireland), Asia Pacific (Singapore), and Asia Pacific(Tokyo) Regions provide read-after-write consistency forPUTS of new objects and eventual consistency for overwritePUTS and DELETES. Amazon S3 buckets in the USStandard Region provide eventual consistency.

Alberto Montresor (UniTN) DS - Replication 2011/11/02 40 / 82

Page 41: 09 Replication

Consistency models Client-centric consistency

Reality Check

Berkeley DB

Oracle’s Berkeley DB is a computer software library that provides ahigh-performance embedded database for key/value data. Used inPostfix, Subversion, SpamAssassin, BitCoin.

From the Berkeley DB manual

In a distributed system, the changes made at the master arenot always instantaneously available at every replica, althoughthey eventually will be. In general, replicas not directlyinvolved in contributing to a transaction commit will lagbehind other replicas because they do not synchronize theircommits with the master. For this reason, you might want tomake use of the read-your-writes consistency feature.

Alberto Montresor (UniTN) DS - Replication 2011/11/02 41 / 82

Page 42: 09 Replication

Consistency models Client-centric consistency

Reality Check

Apache ZooKeeper

Apache ZooKeeper is a software project of the Apache SoftwareFoundation, providing an open source centralized configuration serviceand naming registry for large distributed systems. ZooKeeper is a subproject of Hadoop.

From ZooKeeper

Sequential Consistency: Updates from a client will be appliedin the order that they were sent.

What?

Alberto Montresor (UniTN) DS - Replication 2011/11/02 42 / 82

Page 43: 09 Replication

Consistency models Client-centric consistency

Reality Check

Relevant bibliography

H. Wada, A. Fekete, L. Zhao, K. Lee, and A. Liu. Data consistency properties andthe trade-offs in commercial cloud storage: the consumers’ perspective. In Proc. of5th Biennial Conference on Innovative Data Systems Research (CIDR’11), pages134–143, Asilomar, CA, USA, Jan. 2011.http://www.disi.unitn.it/~montreso/ds/papers/ConsistencyCloud.pdf

Alberto Montresor (UniTN) DS - Replication 2011/11/02 43 / 82

Page 44: 09 Replication

Replication architectures Overview

Passive replication

Clients communicate with primary server

Updates are forwarded from primary to backups

Queries are replied by the primary

PrimaryServer

BackupServer

BackupServer

Clients

Alberto Montresor (UniTN) DS - Replication 2011/11/02 44 / 82

Page 45: 09 Replication

Replication architectures Overview

Active replication

Several (all) replicas handle the invocation and send the response

Updates must be applied in the same order – total order broadcast

Replica

Replica

Replica

Clients Clients

Alberto Montresor (UniTN) DS - Replication 2011/11/02 45 / 82

Page 46: 09 Replication

Replication architectures Overview

Passive vs Active

Passive replication

Computation is performed only at primary

If state updates are large, can waste network bandwidth

Can handle non-determinism

Active replication

Small recovery delay after failures

If operations are compute intensive, can waste computationalresources

Only deterministic

Alberto Montresor (UniTN) DS - Replication 2011/11/02 46 / 82

Page 47: 09 Replication

Replication architectures Overview

Consistency protocols

Primary-based protocolsI DefinitionI Lower bounds

Replicated-write protocolsI Majority, quorum-basedI State machine approach

Client-centric protocolsI Monotonic readsI Read-your-writes

Alberto Montresor (UniTN) DS - Replication 2011/11/02 47 / 82

Page 48: 09 Replication

Replication architectures Primary-Backup

Primary-Backup

The idea

Clients communicate with a single replica (the primary)

The primary updates the other replicas (backup)

Backups detect the failure of the primary using a timeoutmechanism

Clients learn from the service when the primary fails and theservice “fail over” to a backup

Note: non-deterministic events are executed only at the primary

Alberto Montresor (UniTN) DS - Replication 2011/11/02 48 / 82

Page 49: 09 Replication

Replication architectures Primary-Backup

How to evaluate a primary-backup protocol

Definition (Degree of replication)

Number of servers used to implement the service; the smaller, thebetter

Definition (Blocking time)

The worst-case period between a request and its response in anyfailure-free execution

Definition (Failover time)

The worst-case period during which request can be lost because thereis no primary

Alberto Montresor (UniTN) DS - Replication 2011/11/02 49 / 82

Page 50: 09 Replication

Replication architectures Primary-Backup

Definitions

Definition (Service outage)

The service has a server outage at t if some correct client sends arequest at time t to the service, but does not receive a response

Definition ((k,∆)-bofo service - “bounded outage, finitely often”)

A service in which all server outages can be grouped into at most kintervals of time, each of at most length ∆

Alberto Montresor (UniTN) DS - Replication 2011/11/02 50 / 82

Page 51: 09 Replication

Replication architectures Primary-Backup

Specification

PB1 At any time, there is at most one server pi that acts as a primary

PB2 If a client request arrives at a server that is not the currentprimary, then the request is ignored

PB3 There exist fixed values k and ∆ such that the service behaves likea single (k,∆)-bofo service

Alberto Montresor (UniTN) DS - Replication 2011/11/02 51 / 82

Page 52: 09 Replication

Replication architectures Primary-Backup

Primary-backup – Simple protocol

System model:

point-to-point communication

no communication failures

upper bound δ on message delivery time

FIFO channels

at most one server crashes

Two servers:

The primary p1

The backup p2

Variables:

At server pi, primary = true if pi acts as the current primary

At clients, primary is equal to the identifier of the current primary

Alberto Montresor (UniTN) DS - Replication 2011/11/02 52 / 82

Page 53: 09 Replication

Replication architectures Primary-Backup

Primary-backup – Simple protocol

Protocol executed by the primary p1

upon initialization doprimary ← true

upon receive 〈req, r〉 from c dostate ← update(state, r) % Update local statesend 〈state, state〉 to p2 % Send update to backupsend 〈rep, reply(r)〉 to c % Reply to client

repeat every τ secondssend 〈hb〉 to p2 % Heartbeat message

upon recovery after a failure do{ start behaving like a backup }

Alberto Montresor (UniTN) DS - Replication 2011/11/02 53 / 82

Page 54: 09 Replication

Replication architectures Primary-Backup

Primary-backup – Simple protocol

Protocol executed by the backup p2

upon initialization doprimary ← false

upon receive 〈state, s〉 dostate ← s % Update local state

upon not receiving a heartbeat for τ + δ seconds doprimary ← true % Becomes new primarysend 〈newp〉 to c % Inform the client of new primary{ start behaving like a primary }

Alberto Montresor (UniTN) DS - Replication 2011/11/02 54 / 82

Page 55: 09 Replication

Replication architectures Primary-Backup

Primary-backup – Client code

Protocol executed by client c

upon initialization doprimary ← p1 % Initial primary

upon receive 〈newp〉 from p2 doprimary ← p2 % Backup

upon operation(r) dowhile not received a reply do

send 〈req, r〉 to primarywait receive 〈rep, v〉 or receive 〈newp〉

return v

Alberto Montresor (UniTN) DS - Replication 2011/11/02 55 / 82

Page 56: 09 Replication

Replication architectures Primary-Backup

Simple protocol – Proof of correctness

PB1 At any time, there is at most one server pi that acts as a primary

Proof I primary1 = true ∧ primary2 = false until the failure of p1I primary2 = false until the expiration of the timeoutI primary2 = true after the expiration of the timeoutI Failover time: τ + 2δ

c

p1

p2

δ

τREQ REP

HBHB

δ τ

HB

NEWPSTATE

δ

Alberto Montresor (UniTN) DS - Replication 2011/11/02 56 / 82

Page 57: 09 Replication

Replication architectures Primary-Backup

Simple protocol – Proof of correctness

PB2 If a client request arrives at a server that is not the currentprimary, then the request is ignored

Proof Trivially follows from the protocol

Alberto Montresor (UniTN) DS - Replication 2011/11/02 57 / 82

Page 58: 09 Replication

Replication architectures Primary-Backup

Simple protocol – Proof of correctness

PB3 There exist fixed values k and ∆ such that the service behaves likea single (k,∆)-bofo service

Proof Find k, ∆I At most one process can fail: k = 1I ∆ = τ + 4δ:

F assume p1 crashes at tcF any client request sent to p1 at time tc − δ or later may be lostF p2 may not become the new primary until tc + τ + 2δF client may not learn that p2 is new primary for another δ

c

p1

p2

δ

REQ

HB HB

δ τ

HB NEWP

τ + 4δ

δ

REP

δ

REQ

Alberto Montresor (UniTN) DS - Replication 2011/11/02 58 / 82

Page 59: 09 Replication

Replication architectures Primary-Backup

Simple protocol – Questions

Question

What kind of consistency model is provided by this simple protocol?

Answer: Linearizability1 The result of any execution is the same as if the operations by all

processes on the data store were executed in some sequential order

2 The operation of each process appear in this sequence in the orderspecified by its program

3 If t1 < t2 are the times at which two distinct processes performoperations o1 and o2, then o1 must appear before o2 in thesequence

Alberto Montresor (UniTN) DS - Replication 2011/11/02 59 / 82

Page 60: 09 Replication

Replication architectures Primary-Backup

Primary-backup – Multiple backups

System model:

point-to-point communication

Perfect Channels

perfect failure detector P

FIFO channels

at most f < n servers crash

n servers:

p1, . . . , pn

Alberto Montresor (UniTN) DS - Replication 2011/11/02 60 / 82

Page 61: 09 Replication

Replication architectures Primary-Backup

Primary-backup – Multiple backups

Protocol executed by process pi

upon receive 〈req, id , r〉 from c doservers ← servers − {pj : pj ∈ servers ∧ j < i}if id 6∈ state then

state ← update(state, r)

send 〈state, state, id〉 to serverswait receive〈state, id〉 from serverssend 〈rep, id , reply(r)〉 to c

upon suspect(pj) doservers ← servers − {pj}

upon receive 〈state, id , s〉 from pk doservers ← servers − {pj : pj ∈ servers ∧ j < k}if pk ∈ servers then

state ← ssend 〈state, id〉 to pk

Alberto Montresor (UniTN) DS - Replication 2011/11/02 61 / 82

Page 62: 09 Replication

Replication architectures Primary-Backup

Primary-Backup – Client code

Protocol executed by client c

upon initialization doMap response ← new Map();

upon receive 〈rep, id , v〉 doresponse[id ]← v

upon suspect(pj) doservers ← servers − {pj}

upon operation(r) doid ← newId()while servers 6= ∅ or response[id ] = nil do

pk ← min(servers)send 〈req, id , r〉 to pkwait response[id ] 6= nil or pk /∈ ∅

return response[id ]

Alberto Montresor (UniTN) DS - Replication 2011/11/02 62 / 82

Page 63: 09 Replication

Replication architectures Primary-Backup

Primary-backup – Multiple backups

How large is the failover time?τ + 2δ, as before (hidden in the Failure Detector)

How large is the outage period ∆?(τ + 2δ)(n− 1)

What kind of consistency model we obtain if all operations arehandled by the primary?Linearizability

What kind of consistency model we obtain if only write operationsare handled by the primary?Sequential consistency

Alberto Montresor (UniTN) DS - Replication 2011/11/02 63 / 82

Page 64: 09 Replication

Replication architectures Primary-Backup

Lower bounds

Assuming that no more than f components can fail, what are thesmallest possible values (lower bounds) of

I the degree of replicationI the failover time?I the blocking time

Knowing the lower bounds for a problem enables to evaluate thequality of a protocol

Tight lower bounds → optimal protocols

Components:I ProcessesI Point-to-point linksI Up to f crash+link failures → at most f processes may crash or f

links may crash or f1 links + f2 processes = f components

Alberto Montresor (UniTN) DS - Replication 2011/11/02 64 / 82

Page 65: 09 Replication

Replication architectures Primary-Backup

Lower bounds

Failure Degree of Blocking FailoverModel Replication Time Time

crash n > f 0 fδ

crash+link n > f + 1 0 2fδ

rec-omission⌊3f2

⌋2δ 2fδ

send-omission n > f 2δ 2fδ

omission n > 2f 2δ 2fδ

Alberto Montresor (UniTN) DS - Replication 2011/11/02 65 / 82

Page 66: 09 Replication

Replication architectures Primary-Backup

Lower bounds

Crash+link

To tolerate up to f crash+link failures, more than f + 1 servers areneeded

Proof – by contradiction

Suppose n = f + 1 servers is sufficient

divide the n servers in two subsets Aand B1 . . . Bf

if all server in B crash, A mustbecome primary

if A crashes, one of servers Bi mustbecome primary

what if all f links between A and Bi

fails?

A

B1

B2

B3

Alberto Montresor (UniTN) DS - Replication 2011/11/02 66 / 82

Page 67: 09 Replication

Replication architectures Primary-Backup

Multiple primaries

Alberto Montresor (UniTN) DS - Replication 2011/11/02 67 / 82

Page 68: 09 Replication

Replication architectures Quorum protocols

Quorum protocols (Gifford, 1979)

Definition

Quorum-based protocols guarantee that each operation is carried outin such a way that a majority vote (a quorum) is established.

Write quorum nW : the number of replicas that need toacknowledge the receipt of the update to complete the update

Read quorum nR: the number of replicas that are contacted whena data object is accessed through a read operation

Constraints

nR + nW > n (prevent R-W conflicts)

nW > n/2 (prevent W-W conflicts)

The algorithm

To read, the most up-to-date entry is taken

Quorums guarantee that the last written entry will be present

Alberto Montresor (UniTN) DS - Replication 2011/11/02 68 / 82

Page 69: 09 Replication

Replication architectures Quorum protocols

Quorum protocols (Gifford, 1979)

Alberto Montresor (UniTN) DS - Replication 2011/11/02 69 / 82

Page 70: 09 Replication

Replication architectures Quorum protocols

Quorum protocols (Gifford, 1979)

Alberto Montresor (UniTN) DS - Replication 2011/11/02 70 / 82

Page 71: 09 Replication

Replication architectures State machines

State machine

Definition (State machine)

A state machine consists of:

State variables

Commands which transforms its stateI Implemented by deterministic programsI Atomic with respect to other commands

Specification

Agreement: every correct replica receives the same set ofcommands

Order: every non-faulty state machine processes the commands itreceives in the same order

Alberto Montresor (UniTN) DS - Replication 2011/11/02 71 / 82

Page 72: 09 Replication

Replication architectures State machines

Implementing linearizability – General scheme

Implementation

The initiator A-broadcasts all read, write requests to all servers

When the message is A-delivered at the initiator, it replies to theclient

Correctness

All replicas execute read, write in the same order

Assumptions

Synchronous system

Asynchronous system with �S failure detector

Alberto Montresor (UniTN) DS - Replication 2011/11/02 72 / 82

Page 73: 09 Replication

Replication architectures State machines

Implementing sequential consistency – General scheme

Implementation

The initiator A-broadcasts write requests to all servers

When the message is A-delivered, the replica updates its local copy

Read request are replied immediately by the initiator

Correctness

Writes are executed in the same order everywhere

Reads are consistent with local order

Assumptions

Synchronous system

Asynchronous system with �S failure detector

Alberto Montresor (UniTN) DS - Replication 2011/11/02 73 / 82

Page 74: 09 Replication

Replication architectures State machines

Implementing causal consistency – General scheme

Implementation

The initiator C-broadcasts write requests to all servers

When the message is C-delivered, the replica updates its local copy

Read request are replied immediately by the initiator

Correctness

Writes are executed in a causal order

Reads are consistent with local (and causal) order

Assumptions

Asynchronous system

Alberto Montresor (UniTN) DS - Replication 2011/11/02 74 / 82

Page 75: 09 Replication

Replication architectures State machines

Hypervisor-based fault tolerance

Implement state machine on virtual machines running on the sameinstruction-set as underlying hardware

Undetectable by higher layers of software

One of the great come-backs in systems research!I CP-67 for IBM 369 [1970]I Xen [SOSP 2003], VMware

State transition should be deterministic

...but some VM instructions are not (e.g. time-of-day)!

Two types of commandsI Virtual-machine instructionsI Virtual-machine interrupts (with DMA input)

Interrupts must be delivered at the same point in cmd sequence

Alberto Montresor (UniTN) DS - Replication 2011/11/02 75 / 82

Page 76: 09 Replication

Replication architectures State machines

Hypervisor-based fault tolerance

Thomas C. Bressoud, Fred B. Schneider. Hypervisor-based FaultTolerance. ACM TOCS, 14(1):80-107

John R. Douceur an Jon Howell. Replicated Virtual Machines.Microsoft Research TR-2005-119

I Technical paper associated to a patent

Brendan Cully et al. Remus: High Availability via AsynchronousVirtual Machine Replication. NSDI’08.

I Best paper awardI Real implementation for XEN

Alberto Montresor (UniTN) DS - Replication 2011/11/02 76 / 82

Page 77: 09 Replication

Replication architectures Client-centric consistency

Client-centric consistency - Naive implementations

Each write operation is assigned a unique identifierI Done by the server where the operation is requested

For each client c, we keep track of:I Read set WS r: contains write operations relevant to the read

operations performed by cI Write set WSw: contains write operations relevant to the write

operations performed by c

For each server, we keep track of:I Write set WS : contains the write operations executed so far

Alberto Montresor (UniTN) DS - Replication 2011/11/02 77 / 82

Page 78: 09 Replication

Replication architectures Client-centric consistency

Monotonic reads - Naive implementation

To perform a read operation or, a client:I send or and its read set WS r to the server

The serverI Checks whether all the writes in WS r have been executed locally

(WS r ⊆WS?)I If not, asks the appropriate servers the missing operations OI Applies the operations O and add them to WSI Returns the requested value and the WS set to the client

The clientI Adds WS to its local read set: WS r = WS r ∪WS

Alberto Montresor (UniTN) DS - Replication 2011/11/02 78 / 82

Page 79: 09 Replication

Replication architectures Client-centric consistency

Read your writes - Naive implementation

To perform a read operation or, a client:I send or and its write set WSw to the server

The serverI Checks whether all the writes in WSw have been executed locally

(WSw ⊆WS?)I If not, asks the appropriate servers the missing operations OI Applies the operations O and add them to WSI Returns the requested value to the client

To perform a write operation ow, a client cI send ow to the serverI add ow to the write set WSw

Alberto Montresor (UniTN) DS - Replication 2011/11/02 79 / 82

Page 80: 09 Replication

CAP Theorem

CAP theorem

Theorem (Impossibility of CAP)

It is impossible for a web service to provide more than two of thefollowing three guarantees:

Consistency

Availability

Partition-tolerance

This is the reason why Amazon Web Services only provideeventual consistency

I W. Vogels. Eventual consistent. Comm. of the ACM, 52(1):40–44, 2009

Similar stands have been taken for example by HPI HP. There is no free lunch with distributed systems.

http://www.disi.unitn.it/~montreso/ds/papers/NoFreeLunchDS.pdf

Alberto Montresor (UniTN) DS - Replication 2011/11/02 80 / 82

Page 81: 09 Replication

CAP Theorem

CAP theorem

History:

First introduced by Eric Brewer in a keynote at PODC’00I E. A. Brewer. Towards robust distributed systems (abstract). In Proc.

of the 19th ACM symposium on Principles of distributed computing,

PODC’00, page 7. ACM, 2000

Formally proved by Gilbert and Lynch two years laterI S. Gilbert and N. Lynch. Brewer’s conjecture and the feasibility of

consistent, available, partition-tolerant web services. SIGACT News,

33:51–59, June 2002.

http://www.disi.unitn.it/~montreso/ds/papers/CapProof.pdf

Alberto Montresor (UniTN) DS - Replication 2011/11/02 81 / 82

Page 82: 09 Replication

Bibliography

Reading material

W. Vogels. Eventual consistent. Comm. of the ACM, 52(1):40–44, 2009.http://www.disi.unitn.it/~montreso/ds/papers/EventualConsistent.pdf

N. Budhiraja, K. Marzullo, F. Schneider, and S. Toueg. The primary-backupapproach. In S. Mullender, editor, Distributed Systems (2nd ed.).Addison-Wesley, 1993.http://www.disi.unitn.it/~montreso/ds/papers/PrimaryBackup.pdf

F. Schneider. Replication management using the state machine approach. InS. Mullender, editor, Distributed Systems (2nd ed.). Addison-Wesley, 1993.http://www.disi.unitn.it/~montreso/ds/papers/StateMachine.pdf

E. A. Brewer.

Towards robust distributed systems (abstract).In Proc. of the 19th ACM symposium on Principles of distributed computing, PODC’00, page 7. ACM,2000.

N. Budhiraja, K. Marzullo, F. Schneider, and S. Toueg.

The primary-backup approach.In S. Mullender, editor, Distributed Systems (2nd ed.). Addison-Wesley, 1993.http://www.disi.unitn.it/~montreso/ds/papers/PrimaryBackup.pdf.

S. Gilbert and N. Lynch.

Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services.SIGACT News, 33:51–59, June 2002.http://www.disi.unitn.it/~montreso/ds/papers/CapProof.pdf.

HP.

There is no free lunch with distributed systems.http://www.disi.unitn.it/~montreso/ds/papers/NoFreeLunchDS.pdf.

F. Schneider.

Replication management using the state machine approach.In S. Mullender, editor, Distributed Systems (2nd ed.). Addison-Wesley, 1993.http://www.disi.unitn.it/~montreso/ds/papers/StateMachine.pdf.

A. S. Tanenbaum and M. van Steen.

Distributed Systems: Principles and Paradigms.Prentice-Hall, 2nd edition, 2007.

D. Terry, M. Theimer, K. Petersen, A. Demers, M. Spreitzer, and C. Hauser.

Managing update conflicts in Bayou, a weakly connected replicated storage system.In Proc. of the 15th ACM symposium on Operating systems principles, SOSP’95, pages 172–182.ACM, 1995.http://www.disi.unitn.it/~montreso/ds/papers/bayou.pdf.

W. Vogels.

Eventual consistent.Comm. of the ACM, 52(1):40–44, 2009.

W. Vogels.

Eventual consistent.Comm. of the ACM, 52(1):40–44, 2009.http://www.disi.unitn.it/~montreso/ds/papers/EventualConsistent.pdf.

H. Wada, A. Fekete, L. Zhao, K. Lee, and A. Liu.

Data consistency properties and the trade-offs in commercial cloud storage: the consumers’perspective.In Proc. of 5th Biennial Conference on Innovative Data Systems Research (CIDR’11), pages 134–143,Asilomar, CA, USA, Jan. 2011.http://www.disi.unitn.it/~montreso/ds/papers/ConsistencyCloud.pdf.

Alberto Montresor (UniTN) DS - Replication 2011/11/02 82 / 82