Carnegie Mellon Increasing Intrusion Tolerance Via Scalable Redundancy Mike Reiter [email protected] Natassa Ailamaki Greg Ganger Priya Narasimhan Chuck Cranor.

Carnegie Mellon

Increasing Intrusion Tolerance Via Scalable Redundancy

Mike Reiter

[email protected]

Natassa Ailamaki Greg Ganger Priya Narasimhan Chuck Cranor

Carnegie Mellon

Technical Objective To design, prototype and evaluate new protocols for

implementing intrusion-tolerant services that scale better Here, “scale” refers to efficiency as number of servers and number of

failures tolerated grows

Targeting three types of services Read-write data objects Custom “flat” object types for particular applications, notably

directories for implementing an intrusion-tolerant file system Arbitrary objects that support object nesting

Carnegie Mellon

Expected Impact Significant efficiency and scalability benefits over today’s

protocols for intrusion tolerance

For example, for data services, we anticipate At-least twofold latency improvement even at small configurations

(e.g., tolerating 3-5 Byzantine server failures) over current best And improvements will grow as system scales up

A twofold improvement in throughput, again growing with system size

Without such improvements, intrusion tolerance will remain relegated to small deployments in narrow application areas

Carnegie Mellon

The Problem Space Distributed services manage redundant state across servers to

tolerate faults We consider tolerance to Byzantine faults, as might result from an

intrusion into a server or client A faulty server or client may behave arbitrarily

We also make no timing assumptions in this work An “asynchronous” system

Primary existing practice: replicated state machines Offers no load dispersion, requires data replication, and degrades as

system scales in terms of # messages

Carnegie Mellon

Evaluation Baseline for current work: the BFT library

Popular, publicly available implementation of Byzantine fault-tolerant state machine replication (by Castro & Liskov)

Reported to be an efficient implementation of that approach

Two measures Average latency of operations, from client’s perspective Peak sustainable throughput of operations

Our consistency definition: linearizability of invocations

Carnegie Mellon

Data block

Background - Read/Write protocol Servers provide read/write block interface Servers version blocks on every write Decentralized, optimistic, scalable, Byzantine fault-tolerant

Client

Servers

DD D DDD D D

Carnegie Mellon

R/W semantics R/W protocol appropriate for block storage But R/W protocol inappropriate for building general services

Doesn’t provide replicated state machine semantics

A metadata service for a R/W-based block store motivated us to develop a protocol with stronger semantics

Carnegie Mellon

Directory

DD D D

Client A Client B

R/W semantics insufficient for metadata Consider 2 clients inserting a file in the same directory Last write wins; good for blocks, bad for directories

Directory Directory

Directory

DD D DDD D D

Directory

Carnegie Mellon

Query/Update (Q/U) protocol A protocol with replicated state machine semantics

Provides linearizable query and update operations Protocol properties

Decentralized Handles Byzantine clients & server failures, asynchronous Efficient common case operation

Optimistic protocol leverages versioning servers Single-phase queries and updates, if concurrency- and failure-free Avoids expensive cryptography (digital signatures)

Scalable Avoids server-to-server broadcast

Atomic multi-object updates

Carnegie Mellon

Outline Motivation Query/Update protocol

Overview Query, update operations Validation, object syncing, multi-object operations

Evaluation

Carnegie Mellon

Directory

Read/conditional-write primitive Servers accept an update operation only if the object hasn’t been modified

since read

DD D D

Client A Client BDirectory Directory

DD D D

Directory

directory

Carnegie Mellon

Handling Byzantine clients For Byzantine fault-tolerance, clients must pass operation to servers

Constrains clients to narrow object interface Servers apply operation to old object to validate new object

Op Op Op Op

Op

DD D D

Directory

DD D D

Directory

directorydirectory + op

Carnegie Mellon

Clients and objects Client just sends operations

Client does not read/write object Server applies operation to local object

Op Op Op Op

DD D DDD D Dhistoryop

Carnegie Mellon

Query/Update protocol Servers host objects

Optimistic protocol versioning Export an operation interface (more than read/write)

Can export any deterministic operation Server exports three types of operations:

Server

543

A

10

B

98

C

Read History (object)Returns timestamp vector

Query (Object,Version)Read-only; returns object state; e.g., getattr

Update (Object, OHS, Value)Mutating; updates object, conditioned on object not having been modified; e.g., setattr

Carnegie Mellon



Evaluation

Carnegie Mellon

Read history operation Client requests version history of an object Each server replies with a list of timestamps

21 1

321

21

21

Object History Set (OHS)

read-historyhistory-reply

Tim

e

2 2 2

1 1 1

2

Carnegie Mellon

Query operation Client performs read history operation

Constructs OHS and identifies Latest version that is complete

Client queries Latest version at server

21 1

321

2 21

21


queryquery-reply

Latest

Tim

e

2 2 2 2

2 2 2

1 1 1


Carnegie Mellon

Update operation Client performs read-history operation

Constructs OHS and identifies Latest version that is complete Client sends operation and OHS to servers

Operation is conditioned on OHS

321

3

1

321

32

1

321


updateupdate-reply

Tim

e

OHSOHSOHSOHS

Latest2 2 2

1 1 1

2


Carnegie Mellon

Server validation for update operations A server needs to verify that the client conditioned

operation on Latest Validation steps:

Ensure read/conditional-write semantics Check that local history matches that in OHS Classify Latest write version

– Ensures operation is based on appropriate timestamp

Protection against Byzantine failures Check authenticators

– Ensures integrity of OHS

Carnegie Mellon

Server validation example Earlier example of 2 clients concurrently updating same directory Servers reject client B’s operation, due to “stale” OHS

1 1

321 11


updateTim

e

1 1 1 1 1 1

Client A Client B

2 22 2

Carnegie Mellon

Q/U protocol details Handling Byzantine clients and server faults

Through validating timestamps and OHS

During classification of Latest, may require repair Incomplete operations: use barriers to fix failures Flexible protocol – can handle different types/# of faults

For asynchronous with Byzantine clients: N = 3t + 2b + 1, to tolerate t server faults, b of which are

Byzantine

Object syncing Multi-object operations

Carnegie Mellon

Object syncing A server may not have the latest version of an object If a server lacks latest version of object, the OHS

contains information about which other servers have that version

The server must sync the object with another server Hashes in OHS allow server to validate the synced object

Carnegie Mellon

Multi-object operation

An update can span multiple objects A client must construct OHS for each object Servers perform validation for each object Operations perform atomically across multiple objects

Carnegie Mellon



Evaluation

Carnegie Mellon

Prototype evaluation Built a counter object using Q/U and BFT protocols

inc method increments counter and returns new value fetch method returns current counter value

Light-weight operations to demonstrate network and computation overhead inherent to protocols

Both Q/U and BFT implement efficient, optimistic queries Evaluation focuses on updates

Q/U common case: no concurrency; preferred quorums BFT common case: shared counter to allow batching

Carnegie Mellon

Experimental setup Cluster of Pentium 4 2.8 GHz, 1GB RAM 1 Gb switched Ethernet, 18.3 Gbps/35.7 mpps switch

No background traffic

Working size of experiments fit in server memory To focus on protocol overhead, not on disk accesses

Experiments are run for 30 seconds Measurements from middle 10 seconds

Carnegie Mellon

Fault scalability (1) Investigate throughput as the number of server faults

(b) tolerated increases

Measured saturated throughput Ran with 1, 3, 5, …, 20 clients with 2 outstanding reqs

For each b, selected highest throughput value

Carnegie Mellon

Fault scalability (2)

Carnegie Mellon

Throughput and response time under load (1)

Investigate throughput & response time under load Demonstrates protocol behavior beyond saturated throughput data

point

Increased number of clients from 1 to 20 for b = 1

Carnegie Mellon

Throughput and response time under load (2)

Carnegie Mellon

Conclusions Developed the Q/U protocol for accessing shared objects in a

distributed system Fault-scalable Byzantine fault-tolerant Optimistic, efficient Atomic multi-object operations

Evaluation Protocol scales with number of failures tolerated Throughput & response time consistent under load

Carnegie Mellon Increasing Intrusion Tolerance Via Scalable Redundancy Mike Reiter [email protected] Natassa Ailamaki Greg Ganger Priya Narasimhan Chuck Cranor.

Documents