Ordering and dURABILITY IN Isis 2

1

ORDERING AND DURABILITY IN ISIS2

Ken BirmanCornell University

2

Isis2 System

Core functionality: groups of objects … fault-tolerance, speed (parallelism),

coordination Intended for use in very large-scale settings

The local object instance functions as a gateway Read-only operations performed on local state Update operations update all the replicas

myGroupstate transfer

“joinmyGroup”

update update

3

Terminology we’ve used Process group: A term for a collection of programs

that are all running (perhaps on different machines, perhaps on the same machine) and that use Isis2

Each process group has a name (you pick it) You can have multiple groups in one application

Message: Data encoded to be sent between programs

State transfer: Data to initialize a new group member

Update: Any action that changes the shared data Lookup: Any action that only queries the data Multicast: A message sent to every group member

4

A distributed request that

updates group “state”...

Some service

A B C D

Example: Cloud-Hosted Service

SafeSend

SafeSend

SafeSend

... and the response

Standard Web-Services method

invocation

5

Multicast properties In the figure, “SafeSend” is a “multicast”

A message that can be sent to a whole group

What properties do these multicasts need to keep the group members consistent?

In Isis2 we focus on Ordering properties: relative to group membership

changes, and relative to other multicasts Durability guarantees: what happens if a crash

occurs?

6

In Isis2 new View upcalls are synchronized relative to message delivery

Key idea: View ordering

7

Membership changes When a group gains or loses a member, the Isis2

Oracle sequences the new view relative to other multicasts. Thus any multicast is delivered in the same view, from the perspective of all recipients.

Also, if a multicast is sent to the group in some view, it reaches all members of the group (of course if some crash, they might not process the message)

State transfers occur after every multicast has been delivered in the prior view and before any are delivered in the new view

p

q

r

s

t

Time: 0 10 20 30 40 50 60 70

Group View is synchronized

relative to multicasts

8

Message Ordering The basic idea of Isis2 is to deliver all multicasts in the

same order at all group members receiving them

This keeps the data consistent and allows you to implement “state machine” algorithms: group members perform any desired actions in the same state and in the same order

But we offer various implementations of multicast and if you use them very wisely, some are faster than others. The caveat is that the fast versions can only be used in certain situations, which we’ll discuss.

p

q

r

s

t

Time: 0 10 20 30 40 50 60 70

9

A multicast arrives in a group… What information is “the same” for all recipients?

If they call g.GetView(), or remembered properties of the most recently delivered view, all see same view

Also, everyone got the message And the requested ordering was enforced by Isis2

What aspects might differ, for different receivers? Each has its own “rank” in the membership list,

obtained by calling v.GetMyRank() or v.GetRankOf(who)

10

What if a failure happens just as a multicast is being sent?

What about failures?

11

Delayed delivery In Isis2, a multicast send will often delay

(in the platform) for a little while before delivery occurs

As a result, the sender does not know that the group view will be the same when the message is delivered

p

q

r

s

t

Time: 0 10 20 30 40 50 60 70

This multicast might have been

“sent” in the prior view when r, s and

t weren’t yet members!

12

How can we know for sure? Suppose the sender of a Query needs to know

how many members processed the query, e.g. to notice that some reply is missing due to a failure. What can it do to know? One option is to have the receivers include View

information (such as how many members were in the View, what rank each replying member had) in the Reply()

The sender is also a receiver, so another approach is for the sender to wait for its own multicast or Query to be delivered and then make note of the View

13

How do we know who sent a message?

You can just include the sender’s Address in the arguments to the message

Cool Isis2 fact: After you see a View notifying you that some

member has failed or voluntarily left the group, you will never receive additional multicasts from that sender!

If a process leaves a group but then tries to send in it, Isis2 throws an exception in that sender.

14

No messages from the dead In the Isis2 system, you never receive

messages from the deceased

Isis2 watches for “late” messages that came from a process which is already considered to have died

It actively blocks such messages and won’t deliver them

Thus if you reconfigure after a failure, and reassign roles, you can’t get a kind of split-brain effect due to late delivery of a message

http://www.shop.darkfuse.com/messages-from-the-dead-sandy-deluca-kindle-ebook/

15

Ordering Properties The most important form of message

ordering is “total order”

Obtained by using g.OrderedSend or g.SafeSend

They both provide the same ordering guarantee. They have different durability properties

Everyone receives these in the same order.

p

q

r

s

t

Time: 0 10 20 30 40 50 60 70

Everyone receives A first

Everyone receives B second

A

B

16

Weaker ordering Some applications want the lowest

possible message latency OrderedSend will usually achieve this best

delay, but not always. (Slower case: when multiple group members are calling OrderedSend concurrently)

SafeSend uses a much slower approach. For the very best speed, protocols

guaranteed to be faster are available: Send and RawSend

17

A FIFO Ordering situation Suppose one process sends all the

multicasts that update some variable in a group. What ordering is really needed?

In this group, only the oldest living membersends multicasts

FIFO suffices!

p

q

r

s

t

Time: 0 10 20 30 40 50 60 70

We say that p is the leader. It has rank

0

After p and q fail, r is the leader. It has rank 0

in the new view

18

A FIFO Ordering Situation In this group we really only need to

deliver messages in the order the leader sent them

For this purpose, the Send primitive is ideal Send respects the FIFO order its sender

used Guaranteed to be extremely fast

RawSend: Send, but with no effort to guarantee reliability. Respects FIFO order… unless message is lost

19

What if two senders use Send? When different senders use Send, the

ordering will depend on when the messages showed up!

Different members might see different orderings

Example: r sees A B … but p sees B A

p

q

r

s

t

Time: 0 10 20 30 40 50 60 70

A

B

20

When is FIFO good enough? Suppose our group manages a collection of

data items Each item has its own leader and only the

leader sends updates for that item Consistency: It suiffices to apply updates in the

order they were sent. g.Send() will do this!

But beware… Multicasts from different senders

can interleave in unpredictable ways

http://www.terrainparksafety.org/wp-content/uploads/2013/04/Caution.jpg

21

When would you use RawSend? This primitive doesn’t guarantee reliability

We use it when reporting data from real-time sensors We want the data delivered in order (new

data replaces older data). RawSend is still FIFO ordered

But if data is lost, there is no point “wasting time” in the platform retransmitting it.

22

What about Query ordering? Each kind of multicast has an associated

QueryMulticast Matching QueryRawSend RawQuerySend QueryOrderedSend OrderedQuerySafeSend SafeQuery

23

CausalSend Included mostly for academic reasons,

but not used very often in Isis2

Intended for situation in which the leader role moves around for each data item First p is in charge, then q is the leader for

a while, then r, then back to p… CausalSend will respect the FIFO order

“with moving leaders”. But we don’t recommend using it.

24

CausalSend picture: B is “after” A

p

q

r

s

t

Time: 0 10 20 30 40 50 60 70

A

B

25

Causality idea If B “might have been caused by A”, then

B is causally ordered after A (we write A B)

CausalSend tracks these causality dependencies and makes sure that if A B, then B will be delivered after A

But the Isis2 implementation of CausalSend is slow and this is why it isn’t used very often

26

Exactly what happens in the event of a failure?

Durability

27

Durability

A durability guarantee is the property that information will survive a failure

There are several cases to think about What if the sender of a multicast fails but someone

received the multicast? What if the sender and every receiver (so far) fails? What if a whole group fails, but later restarts? What if the group is managing a replicated database

or files that aren’t even on the same computers?

p

q

r

s

t

Time: 0 10 20 30 40 50 60 70

28

Soft State in the Cloud Many Isis2 applications run in cloud settings..

And the cloud favors “soft state” After a node crashes, the entire VM is reloaded Thus any local state (even local files) are restored

to their original state! All local data vanishes

We say that a group manages “hard state” if the group members can fail and yet their state lives on In the cloud a hard-state node costs more $$$

29

Two cases thus arise Durability for soft-state scenarios

Here the entire state “lives in the group members”

They might have files, but the files won’t be preserved if those members crash and later restart, even on the same nodes.

Very common in today’s cloud

Durability for hard-state cases Here the state really is outside the group

30

Multicast durability Isis2 offers all-or-nothing delivery

guarantees

Either every group member receives your multicast, or no group member receives it, even if the sender fails. As we saw, if a sender fails, its messages will be delivered before Isis2 reports the failure

But this statement didn’t explain what happens when a receiver crashes “instantly”

31

Two options: Optimistic/Pessimistic Optimistic case (Send, CausalSend, OrderedSend):

Messages are delivered instantly on arrival (low delay) But if the sender and all receivers with copies fail, an

optimistic message is lost forever even though it might have been delivered to some processes right before they crashed

An optimistic protocol always looks like it was all-or-nothing, but if you could see the details, you might see that in fact, it was delivered, but then “forgotten”

32

Optimistic delivery Consider messages B and C

B was delivered to r,s and t. But it didn’t reach p and q because of a network failure.

C was delivered by p and q but never reached r,s,t

But notice that p and q both crashed In a soft-state case, no evidence survived (unless

they talked to someone outside the group – an external client, for example)

In effect, the surviving portion of the system is consistent

p

q

r

s

t

Time: 0 10 20 30 40 50 60 70

A

B

C

33

Optimistic delivery is fastest We deliver messages as soon as they

arrive

But the price of this speed (which is a big benefit) is that these two “bad cases” can arise. Nobody can tell when these things happen,

unless p or q talked to an external client … which leads to the idea of g.Flush(k)

34

How does Flush(k) work?

g.Flush(n) pauses until n group members definitely have all the prior optimistic multicasts. g.Flush() waits for all members, but this is

slow Normally n=2 or n=3 is fine…

By calling g.Flush(2) or g.Flush(3) before talking to an external client, we can be sure these bad cases will not occur!

35

With g.Flush(k)… … those stray delivery events can still occur, but

we know that no external observer notices them! If g.Flush(3) is called prior to talking to the observer,

then until there are 3 or more copies of the message, the Flush waits.

In our example the crash would have occurred while we were waiting for g.Flush() to finish

If a tree falls in a forest… If a message is delivered but every processthat saw it crashes, the effect is the sameas if the message wasn’t delivered!

36

With g.Flush(k)… … those stray delivery events can still occur, but

we know that no external observer notices them! If g.Flush(3) is called prior to talking to the observer,

then until there are 3 or more copies of the message, the Flush waits.

In our example the crash would have occurred while we were waiting for g.Flush() to finish

If a tree falls in a forest… If a message is delivered but every processthat saw it crashes, the effect is the sameas if the message wasn’t delivered!

37

When to call g.Flush(k) Use this primitive

When working with optimistic multicast protocols like Send, OrderedSend

Call it prior to interacting with something outside of the group, like an external client who issued a request

With g.Flush after g.OrderedSend, we get the guarantee that the group won’t forget the update. Without g.Flush, an unlikely failure sequence could cause a problem (sender+first recipients all die).

38

Pessimistic Delivery SafeSend is much more pessimistic

This protocol is a kind of 2-phase commit Gives the message to recipients, and they hold it

(Two cases: In-memory logging, or on-disk logging)

When all have confirmed receipt, then delivery is authorized

No g.Flush(): it wouldn’t ever need to wait

39

Where’s the durable state? SafeSend raises a question of where the

state lives

For our optimistic protocols, state lives in the group

But Isis2 can also support two more cases State lives in a checkpoint that will be reloaded

if the whole group shuts down and restarts State lives in a database or in files external to

the group SafeSend with disk logging aims at this second

case

40

Should I always use SafeSend? The SafeSend protocol is very costly and

scales poorly, so it isn’t a great choice in the cloud

Also, using it correctly is a bit tricky

Better rule of thumb: use g.OrderedSend+g.Flush

41

Sidebar: Paxos family of protocols Experts in this area will know about Leslie Lamport’s

famous Paxos protocol (Wikipedia has a nice writeup) It provides ordered, durable “actions” These are often updates to a replicated database

SafeSend is the Isis2 name for Paxos

You don’t really need to learn about Paxos to understand how SafeSend works, but I’ll include some comments aimed at people who do know about Paxos in this lecture, simply because that work is so famous.

42

How Paxos works Paxos is basically a kind of 2-phase

commit In the first phase a leader proposes some

action (for us, a multicast) A quorum of group members (the

acceptors) need to vote in favor of the proposed ordering for the message, and they need to first save it in a durable place (usually a log that lives on the disk)

In the second phase, delivery occurs (in Paxos: the learners are informed about the new event)

43

Paxos has a notion similar to Flush(k)

In Paxos you can specify the number of “acceptors” that must have a copy of a message before it can be delivered. In Isis2 this same parameter is available by

means of a parameter you can set (g.SetSafeSendThreshold(k)) SafeSend is a true implementation of Paxos if

this number is more than half the group members.

With k smaller, like k=2 or k=3, but in a big group SafeSend starts to act exactly like g.OrderedSend()+g.Flush(k)

44

Isis2: Send v.s. SafeSendSend scales best, but SafeSend with

modern disks (RAM-like performance) and small numbers of acceptors isn’t terrible.

45

Variance from mean, 32-

member case

Jitter: how “steady” are latencies?

The “spread” of latencies is muchbetter (tighter) with Send: the 2-phase

SafeSend protocol is sensitive to scheduling delays

46

Flush delay as function of shard sizeFlush is fairly fast if we only wait foracks from 3-5 members, but is slow

if we wait for acks from all members.After we saw this graph, we changedIsis2 to let users set the threshold.

47

Putting our insights to work…

Several ways to make data durable

48

Checkpointing Any group can be made durable using a

checkpointing file Call g.Persistent(filename) Checkpoint will periodically be saved, or you can force

the creation of checkpoints at times convenient to you Entire group shares a single checkpoint file and it

would normally live in the global file system. It should not live in any sort of soft-state file system!

On restart from a total shutdown, checkpoint is reloaded and the group recovers to its old state

49

External databases If a group is being used to replicate

something like a set of external mySQL databases, recovering the group state just isn’t good enough

We also need to make sure the mySQL replicas are in the identical states after a recovery

This is the case where we use SafeSend with the disklogging option enabled

50

What is the disklogger? The disklogger is a special form of logged

checkpoint, similar to the one used for g.Persistent() But whereas normally there is just one durability

log, this log is replicated with one copy per acceptor

Messages delivered by SafeSend are appended to this log during phase one

When an acceptor restarts, its log is scanned and “replayed”. Isis2 will garbage collect a message once all the learners have seen it

51



Some service

A B C D


SafeSend

SafeSend

SafeSend



invocation

DB

DB

DB

DB

Use the Isis2 version of Paxos to replicate an external database

52



Some service

A B C D


Send

Send

Send



invocation

In-memory collecti

on

In-memory collecti

on

In-memory collecti

on

In-memory collecti

on

Cheaper multicast+Flush suffices with in-memory replicas

or other situations with soft state, like files local to the replicas on VMs that

will be reloaded if a crash occurs

g.Flush()

53

Check your understanding

Suppose we use SafeSend as shown in the figure, with 4 group members, and all are acceptors

You send 1 message. How many disk writes occur? At least 4 (one per log) and

perhaps 8 (the database may have a log too). Also, database needs to be updated!

54

Recovery with an external database is a pain!

g.SetDurabilityMethod

55

SetDurabilityMethod You must tell SafeSend to use the

DiskLogger durability method

When you do this, SafeSend has an extremely strong guarantee: it won’t ever forget messages, until is it explictly told to do so by your code

This yields a version suitable for use when replicating a database

56

Recovering a database replica After restarting a failed database replica, SafeSend

with the DiskLogger durability method will replay all messages that it knows about

Your job is to make sure all of these updates have been applied to the database, exactly once

After that you tell SafeSend it can safely garbage collect these messages, and it does so when every group member has told it that the message is safe to garbage collect (at that point, it truncates the disk log)

57

Why not always use SafeSend? SafeSend is harder to use

Must write code to handle replay of the log after recovery.

And SafeSend is also slower

Many people who assume Paxos is lightweight are surprised that all Paxos systems have high costs Paxos is really a kind of durable database – a

database of messages!

58

Durability Summary To recap:

If your application maintains data purely inside the members of the group, or purely in memory, you can use the standard “optimistic” methods Call g.Flush(k) if worried about the tree-in-the-forest case

Use checkpointing to a log (g.Persistent()) to make the group state survive complete shutdowns

But switch to SafeSend for the strongest durability requirements. You’ll need to enable the DiskLogger durability method, and to write code to handle restarts and to tell SafeSend when it can garbage collect the log.

59

How does one make a checkpoint?

Making Checkpoints

60

State transfer

In general, group members manage data (state)

When s and t join in this example, they don’t have the current state for the group. They obtain it via a state transfer: the white arrow. In this example, p “writes down” its state (a

checkpoint) Then s and t “load” the state (they read the

checkpoint)

p

q

r

s

t

Time: 0 10 20 30 40 50 60 70 White Arrow is a state transfer

61

Making a checkpoint You can save any state you wish

You can call SendChkpt as many times as needed

int istuff; double dstuff; g.MakeChkpt += (Isis.ChkptMaker)delegate(View nv) { g.SendChkpt(istuff); // Checkpoint a single integer g.SendChkpt(dstuff); // Checkpoint a single floating point value g.EndOfChkpt(); // Finished making the checkpoint }; g.LoadChkpt += (loadichkpt)delegate(int what) { IsisSystem.WriteLine(name + ": Got integer checkpoint: istuff=" + what); istuff = what; }; g.LoadChkpt += (loaddchkpt)delegate(double what) { IsisSystem.WriteLine(name + ": Got double checkpoint: dstuff=" + what); dstuff = what; };

62

Steps The MakeCheckpt method is called from

time to time in your program. You can control exactly when this will

happen

That updates the log files

Later, after restart, the LoadCheckpt method(s) will be called to reload the saved state

63

To make a group persistent, store it in a global file system

It will be loaded into the NEXT instance that runs int istuff;

double dstuff; g.MakeChkpt += (Isis.ChkptMaker)delegate(View nv) { g.SendChkpt(istuff); // Checkpoint a single integer g.SendChkpt(dstuff); // Checkpoint a single floating point value g.EndOfChkpt(); // Finished making the checkpoint }; g.LoadChkpt += (loadichkpt)delegate(int what) { IsisSystem.WriteLine(name + ": Got integer checkpoint: istuff=" + what); istuff = what; }; g.LoadChkpt += (loaddchkpt)delegate(double what) { IsisSystem.WriteLine(name + ": Got floating point checkpoint: dstuff=" + what); dstuff = what; };

Note: You must also call myGroup.Persistent(gname);This tells Isis2 to keep checkpoints in a file (in this case with the same name as the group).

There are also ways to control when the checkpoint will be made

64

Why did we register two loaders? Isis2 is polymorphic

Each method can be defined many times with different type signatures

As events occur, upcalls are done to the ones that match

In our examples we had just one argument to SendChkpt(), but we could have given many:

Any data type is allowed but you must register user-defined types with Isis first

g.SendChkpt(x, y, z, ....);

65

State transfer uses checkpoints! If the checkpoint methods are defined, Isis2 will

ask for a checkpoint just as a new member joins The old member makes the checkpoint The new member loads it

This initializes the joining member

myGroupstate transfer

update update

66

Can we tell what a checkpoint will be used for? Can we do “per use” checkpoints?

Persistent or just State Transfer?

67

What are checkpoints used for? When you define a checkpoint

create/load method, that automatically enables state transfer for joining members

With g.Persistent(), a checkpoint plays two roles; they are also logged into a recovery log file that will be reread after recovery from a total shutdown

68

State transfer could be s..l..o..w..And while it happens, the group freezes up!

What if the group state is large?

p

q

r

s

t

Time: 0 10 20 30 40 50 60 70

A

B

69

What if the state is very large? Really large states can be slow to transfer. While

they are being sent, the group itself might hiccup

Best solution? Pre-transfer that huge state, perhaps using the highly efficient “Isis OOB” tool Out of band transfer is minimally disruptive and faster

too because the Isis2 system optimizes heavily for this But perhaps a few updates might occur after the pre-

transfer and before the member is added. So you can include an argument to Join that tells how

big the pre-transfer was, or what “time” it was made. Then the checkpoint only needs to include the delta!

70

Pretransfer In this picture we send

data to r, s and t “out of band”

Isis2 has a tool for that, the OOB file transfer tool. Ideal for big copying

p

q

r

s

t

Time: 0 10 20 30 40 50 60 70

When they join, we send just

the residual delta…

71

Enabling this feature Instead of calling g.Join(), call g.Join(offset)

Offset tells the group how much of the state you have.

It shows up in the View argument to the make checkpoint method

Offset 0 means “send the whole state”

Example: pretransfer included updates 0… 12345. So you call g.Join(12345). The state transfer contains just updates 12346-12348…

72

What happens in an application that experiences many “events” all at the same time?

When does State Transfer occur?

Isis2 has a strong consistency model: a new form of virtual synchrony.

73

Virtual synchrony is a “consistency” model: Membership epochs: begin when a new configuration

is installed and reported by delivery of a new “view” and associated state

Protocols run “during” a single epoch: rather than overcome failure, we reconfigure when a failure occurs

p

q

r

s

t

Time: 0 10 20 30 40 50 60 70

p

q

r

s

t

Time: 0 10 20 30 40 50 60 70

Synchronous execution Virtually synchronous execution

Non-replicated reference execution

A=3

B=7

B = B-A A=A+1

74

What Isis2 ensures is that... State transfer “seems” to occur at the

instant when a new view is delivered (all prior multicasts have already been performed) This means that the member preparing the

state has the correct values for state variables needed by joining member!

It is “safe” to send this state If desired, there is a way for you to

specify which member will send state to each joining process

75 How do Queries handle failure?

Queries when failures occur…

Group g = new Group(“myGroup”);Dictionary<string,double> Values = new

Dictionary<string,double>();g.ViewHandlers += delegate(View v) {

Console.Title = “myGroup members: “+v.members;};g.Handlers[UPDATE] += delegate(string s, double v) { Values[s] = v;};g.Handlers[LOOKUP] += delegate(string s) { g.Reply(Values[s]);};g.Join();

g.Send(UPDATE, “Harry”, 20.75);

List<double> resultlist = new List<double>();nr = g.Query(ALL, LOOKUP, “Harry”, EOL, resultlist);

First sets up group

Join makes this entity a member. State transfer isn’t shown

Then can multicast, query. Runtime callbacks to the “delegates” as events arrive

Easy to request security (g.SetSecure), persistence

“Consistency” model dictates the ordering seen for event upcalls and the assumptions user can make

76

77

This example used g.Reply Also available:

g.AbortReply() – throws exception in the Query caller g.NullReply() – Member doesn’t contribute any value

but the caller won’t wait for it (useful with ALL) g.NoReply() – A risky option: like NullReply but no

message of any kind is sent to the caller

Query can also specify an Isis “Timeout” new Timeout(delay_ms, action) Action is: TO_NULLREPLY, TO_FAILURE, TO_ABORT

78

How can a caller sense missing replies?

The caller is told how many replies it got If you expected 3 but got 2, either someone

failed, or they used g.NullReply() to “opt out”

But when you issue the Query you won’t know who is going to be in the group at the time of delivery! This is why it often makes sense for replies

to specify that “this is reply R of N” (R=rank, N=size of view)

79

Lecture Summary Isis2 gives you control over

How durable multicasts and group data will be

How strongly ordered they will be Whether to wait until a multicast has

reached k of the destinations before you talk to external observers

Using these forms of control, you can program exactly the behavior you need in a given setting

Ordering and dURABILITY IN Isis 2

Documents

new view group view

group gains

group members consistent

updates group state

new group memberupdate

new view relative

group membership changes

isis2each process group