Top Banner
Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS European Workshop, 23 September 2002
30

Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Jan 13, 2016

Download

Documents

Melanie Payne
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Rewind, Repair, Replay:Three R’s to improve

dependability

Aaron Brown and David Patterson

ROC Research GroupUniversity of California at Berkeley

SIGOPS European Workshop, 23 September 2002

Aaron Brown
As part of ROC..., one of the topics we've been working on is building tools to better support human operators, since human operator errors are one of the biggest impediments to achieving highly-dependable systems. What I'm going to talk about today is one such tool--essentially, an undo mechanism that lets human operators recover quickly when they make mistakes or take actions that have unintended or unexpected consequences. The hope is that by providing fast recovery and letting operators go back in time to retroactively fix bad decisions, we will end up with systems that can be fixed more quickly and thus will have better overall dependability.
Aaron Brown
I'm going to talk about systems that support Rewind, Repair, and Replay...in our view, Three R's that will help produce more dependable computer systems.
Page 2: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 2

What if computer systems could travel in time?

• We could have retroactive repair– travel back and fix problems before they had a

chance to corrupt data

• We could eliminate human operator error– make a mistake? Just travel back and try it again.

• Our systems could be more robust– we could eliminate the dangers of upgrades– we could better tolerate buggy software– we might even be able to tolerate viruses and

hackers• We could make more dependable systems

Page 3: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 3

Sci-fi vs. computer time travel• Sci-fi time travel

– our hero loses a loved one or lives through disaster

– hero uses time machine to travel back in time

– hero alters the past to avert the future disaster

– hero returns to the present; past changes have been merged into the original timeline

• Computer time travel– human error, software

bug, or attack causes data loss

– Rewind: roll system state backwards in time

– Repair: make changes to avert foretold disaster

– Replay: roll system state forward, merging the original timeline with the effects of repairs

• Three R’s are the fundamental primitives of computer time travel

Page 4: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 4

Key properties of the 3R’s• Recovery from problems at any system

layer– rewind, repair, replay cover OS through

application

• Recovery from unanticipated problems– arbitrary repair

• No assumptions about correct application behavior– physical rewind

• Integrated interface– provide “undo for sysadmins”

Page 5: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 5

What about existing approaches?

Approach Rewind

Repair

Replay Comments

Backups, checkpoints, snapshots, no-overwrite storage

physical read-only view of

history

RDBMS log replay

physical

application-level only; cannot alter committed transactions

Workflow w/ compensating transactions

phys/log

limited apps; mechanisms not usefully integrated for time travel

Timewarp(PARC collaborative productivity apps)

logical

(limited)

application-level only; repair limited to well-understood history edits

Aaron Brown
Before we go any further, we should stop and ask if any existing systems already provide these primitives/this model of time travel?
Page 6: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 6

Designing a 3R system• Goals

– application-neutrality– provide abstractions for reasoning about 3R behavior

• Target domain: network services– accessed by remote users via well-defined interfaces – email, messaging, e-commerce, auctions, forums,

web hosting, enterprise applications (J2EE, .NET), ...

• Challenges, learned from first attempt– integrating history and repair during replay– managing inconsistency in externally-visible state

Aaron Brown
This is, as we saw, the area where most existing systems are lacking
Aaron Brown
Tell story of first implementation, BRIEFLY!
Page 7: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 7

Basic architecture• Application-independent undo manager

– coordinates 3R cycle; manages external inconsistencies

– linked via a set of APIs to application, time-travel storage, history log, and control UI

App. ServiceIncludes: - user state - application - operating system

History

Log

UndoManage

r

Time-travelstorage layer

control

3R API

ControlUI

Aaron Brown
Key here is in this 3R API--the linkage between this generic undo manager and the specific application. To have that link and keep the undo manager app-independent, it needs to have a model of the app.
Page 8: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 8

Abstracting the application service

• To the undo manager, the application is:– a collection of state– a history of events affecting the state

» an event is typically a user interaction with the service

– a model of acceptable external consistency

• These are encoded into application-defined verbs– high-level encodings of user interactions (events)

» records of intent to alter state, not actual state changes

– reference application state by opaque UIDs– provide policies that define external consistency

Aaron Brown
This history defines the application's timeline
Page 9: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 9

Verbs and the 3R cycle• Normal operation

– undo manager logs application-provided verbs to disk

App. ServiceIncludes: - user state - application - operating system

History

Log

UndoManage

r

Time-travelstorage layer

control

Verbs

ControlUI

Userinteraction

Aaron Brown
So let's take a look at how these verbs fit into the 3R cycle
Page 10: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 10

Verbs and the 3R cycle• Rewind

– time-travel storage layer reverts system hard state to rewind point

– all changes since rewind point are discarded

App. ServiceIncludes: - user state - application - operating system

History

Log

UndoManage

r

Time-travelstorage layer

control

ControlUI

Aaron Brown
Entirely physical--no verbs involved here
Page 11: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 11

Verbs and the 3R cycle• Repair

– operator edits logged history and/or makes arbitrary changes to system

App. ServiceIncludes: - user state - application - operating system

History

Log

UndoManage

r

Time-travelstorage layer

control

ControlUIRepairs

Edits

Page 12: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 12

Verbs and the 3R cycle• Replay

– undo manager feeds verbs back to application for re-execution in the context of repaired system

App. ServiceIncludes: - user state - application - operating system

History

Log

UndoManage

r

Time-travelstorage layer

control

Verbs

ControlUI

Page 13: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 13

The fundamental roles of verbs

• Providing application-independence– verbs encapsulate application semantics, but remain

semi-opaque to undo manager

• Integration of repair into history– high-level specification of intent makes verbs

relatively independent of system changes– verbs are re-executed, not restored, so they inherit

effects of repairs

• Scoping restored history– only changes logged as verbs will be preserved by

3Rs» effects of bugs, corruption, human error are discarded

– can reason about what is preserved/lost in 3R cycle

Aaron Brown
give example of swapping out email server implementations
Page 14: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 14

Managing external inconsistency

• External inconsistency == time paradox?– system is internally-consistent after a 3R cycle– but external observers see inexplicable state

changes– external inconsistency is OK unless affected state

was externalized (observed) before the 3R cycle

• Coping with external inconsistency– cannot eliminate– must manage: ignore, explain, compensate,

encompass

• Verbs let us manage external inconsistency

Aaron Brown
If you think of sci-fi time travel, the world is always internally self-consistent, but paradoxes arise when there's an outside observer that's isolated somehow from the timeline--often, this is the time-traveler himself.
Page 15: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 15

Managing inconsistency with verbs

• To detect inconsistencies:– verbs specify the state that they depend upon– undo manager tracks signatures of that state– if verb is altered or if signatures don’t match, there is

an inconsistency» applications supporting relaxed consistency can

replace signature-check with arbitrary consistency predicates

• To detect state viewed externally:– verbs indicate what state they externalize

» example: IMAP fetch verb externalizes email message

• To handle externalized inconsistencies:– verb supplies compensation functions

Aaron Brown
that can examine state of system and decide what, IF ANYTHING, to do
Aaron Brown
I can give some examples of this if you ask me offline or during the Q&A
Page 16: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 16

Email example: original timeline

Systemboundary

Systemstate

Verbs

Historylog

Time

Inbox

Folder1olleH

MoveMsg

Move

Externalizes:— ContentDep: —ExistsDep: Inbox,

Folder1

olleH

FetchMsg

Fet

ch

m

Externalizes:m ContentDep: mExistsDep: m, Folder1

+ Signature(m)=“olleH”

Hello

olleH

DeliverMsg

Deliver

m

Externalizes:— ContentDep: —ExistsDep: Inbox

+ input “Hello”

Aaron Brown
content dep => "content matters to external consistency"
Page 17: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 17

olleH

MoveMsg

Move

Externalizes:— ContentDep: —ExistsDep: Inbox,

Folder1

olleH

FetchMsg

Fet

ch

m

Externalizes:m ContentDep: mExistsDep: m, Folder1

+ Signature(m)=“olleH”

Hello

olleH

DeliverMsg

Deliver

m

Externalizes:— ContentDep: —ExistsDep: Inbox

+ input “Hello”

XHello

DeliverMsg

Externalizes:— ContentDep: —ExistsDep: Inbox

+ input “Hello”

Hello

Deliver

m

Email example: replay timeline

Systemboundary

Systemstate

Verbs

Historylog

Time

Inbox

Folder1 Hello

MoveMsg

Move

Externalizes:— ContentDep: —ExistsDep: Inbox,

Folder1

Hello

FetchMsg

Fet

ch

m

Externalizes:m ContentDep: mExistsDep: m, Folder1

+ Signature(m)=“olleH”

mismatch! => inconsistency

Page 18: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 18

Recap: 3R architecture• Goal: application-neutral implementation of

3R’s– verb abstraction couples generic undo manager

to app. – verbs provide tools to reason about 3R behavior

• Challenges– integrating history and repair during replay

» re-executing verbs restores intent of history

– managing inconsistency in externally-visible state» verbs track externalization, state dependencies,

and define compensations

Page 19: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 19

Status• Prototype implementation of 3R primitives

nearly complete– app-independent undo manager written in Java– all APIs defined as Java interfaces– Network Appliance filer as time-travel storage layer– BerkeleyDB as history log

• First target app: web-based email service– 3R-enhanced JavaMail API provider classes

» plus additional hooks to verb-ify operator maintenance tasks like account creation

– JWebMail web front-end– RDBMS-based backend mail store (DB2 or MySQL)– implementation in progress

Page 20: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 20

Open issues & future work• Resource impact of the 3R’s

– what are the performance/space penalties for the 3R’s?

• Verb definition– can we specify verbs & consistency policy declaratively?

• Providing the 3R’s at multiple granularities– can we track & manage cross-granularity

dependencies?

• Measuring the dependability benefit of 3R’s– how do we build recovery/dependability benchmarks?

• Other uses for verb-based characterizations– easy georeplication? online self-checking? automatic

verification of upgrades?

Page 21: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 21

Conclusions• We can build time travel for computers

– using the 3R’s: Rewind, Repair, Replay

• An architecture for the 3R primitives– generic undo manager coupled to application by

verbs

• Verbs are a useful abstraction for the 3R’s– can use to reason about effects of 3R’s on state– help address problem of external inconsistencies

• Prototype 3R-enabled email system under construction– hope to demonstrate increased dependability and

faster recovery from problems

Page 22: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Rewind, Repair, Replay:Three R’s to improve

dependability

For more information:

http://roc.cs.berkeley.edu/[email protected]

Page 23: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 23

Backup slides

Page 24: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 24

Verbs vs. transactions• Both encapsulate state-altering events• But, unlike transactions:

– verbs are higher-level, recording end-user intent, not specific state changes

– verbs do not depend on internal data models (but do depend on external protocols)

» transactions are the reverse

– verbs do not necessarily conform to ACID consistency

» verbs inherit consistency model provided by application at the external-protocol level

Page 25: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 25

Implementing verbs• Verbs are defined by a type hierarchy

– base type defines interfaces for state dependencies, externalizations, predicates, compensations

– applications subclass the base type for their verbs» additions to the type are opaque to the undo manager

• Referencing state– all user-visible state named by time-invariant UIDs– undo manager requires signature method for all state

• Consistency predicates and compensations are application-supplied functions– they encode the app’s external consistency model

Page 26: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 26

Defining verbs• Currently, verbs are defined procedurally

– provide dependency information via lists of state IDs– provide functions for special consistency predicates – provide functions for compensation

• Better: declarative specification– compile textual specification into verb code using

libraries of predicates and compensation fns– reduces complexity of adding 3R’s to the application – increases confidence in undo system via easier

testing

Aaron Brown
With all these verbs to define, it sure would be nice if there were an easier way to define them
Aaron Brown
declarative spec. gets turned into Verb classes with stubs for compensation
Aaron Brown
parse specs, systematically violate predicates/ dependencies, verify compensation
Page 27: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 27

External consistency policies• Verbs capture external consistency

policies• Example: email

– message order in folder is irrelevant» AppendMessage verb does not express dependency on

content of target folder, only its existence

– content of messages is relevant, except for headers» ReadMsg verb depends on hash of target message body;

if changed, compensate by inserting explanatory text

• Example: e-commerce– order total depends on item prices, not descriptions

» Checkout verb depends on prices of items in cart, not their hash-values; if sum of prices changed, compensate by emailing customer for approval

Aaron Brown
so here, instead of using a standard dependency you'd write a special predicate to check just the body
Page 28: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 28

External consistency policies (2)

• Example: auctions– new bid must be larger than prior bids

» PlaceBid verb depends on content of all bids in bid set; if one is now larger than new bid, compensate by canceling new bid and informing bidder

Page 29: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 29

Application implications• To support the 3R’s, an application

must have:– a high-level, verb-structured interface/API for

user, operator, and external actions– a state model where all user-visible state:

» is nameable via the API» is tagged with GUIDs» supports a signature/hash method

– a relaxed external consistency model that allows compensation for externalized inconsistent verbs

Page 30: Rewind, Repair, Replay: Three R’s to improve dependability Aaron Brown and David Patterson ROC Research Group University of California at Berkeley SIGOPS.

Slide 30

Example: a 3R email store

• State– mailstores, folders, messages, user properties, aliases

• Verbs– transport: create/delete/alter mapping; deliver msg– directory: create/alter/delete user-entry;

create/alter/delete filter-rule; add/remove maildrop– store: create/delete store; create/rename/delete

folder; expunge folder; list folder; set folder flags; copy msg; append msg; fetch msg; set msg flags

StoreTransportWebUI

Directory/Auth.

SMTP

HTTP

LDAP, internal

IMAP, internal

internal

UndoMgr verbs

verbs