Carnegie Mello Increasing Intrusion Tolerance Via Scalable Redundancy Mike Reiter [email protected] Natassa Ailamaki Greg Ganger Priya Narasimhan Chuck Cranor
Jan 20, 2016
Carnegie Mellon
Increasing Intrusion Tolerance Via Scalable Redundancy
Mike Reiter
Natassa Ailamaki Greg Ganger Priya Narasimhan Chuck Cranor
Carnegie Mellon
Technical Objective To design, prototype and evaluate new protocols for
implementing intrusion-tolerant services that scale better Here, “scale” refers to efficiency as number of servers and number of
failures tolerated grows
Targeting three types of services Read-write data objects Custom “flat” object types for particular applications, notably
directories for implementing an intrusion-tolerant file system Arbitrary objects that support object nesting
Carnegie Mellon
Expected Impact Significant efficiency and scalability benefits over today’s
protocols for intrusion tolerance
For example, for data services, we anticipate At-least twofold latency improvement even at small configurations
(e.g., tolerating 3-5 Byzantine server failures) over current best And improvements will grow as system scales up
A twofold improvement in throughput, again growing with system size
Without such improvements, intrusion tolerance will remain relegated to small deployments in narrow application areas
Carnegie Mellon
The Problem Space Distributed services manage redundant state across servers to
tolerate faults We consider tolerance to Byzantine faults, as might result from an
intrusion into a server or client A faulty server or client may behave arbitrarily
We also make no timing assumptions in this work An “asynchronous” system
Primary existing practice: replicated state machines Offers no load dispersion, requires data replication, and degrades as
system scales in terms of # messages
Carnegie Mellon
Evaluation Baseline for current work: the BFT library
Popular, publicly available implementation of Byzantine fault-tolerant state machine replication (by Castro & Liskov)
Reported to be an efficient implementation of that approach
Two measures Average latency of operations, from client’s perspective Peak sustainable throughput of operations
Our consistency definition: linearizability of invocations
Carnegie Mellon
Data block
Background - Read/Write protocol Servers provide read/write block interface Servers version blocks on every write Decentralized, optimistic, scalable, Byzantine fault-tolerant
Client
Servers
DD D DDD D D
Carnegie Mellon
R/W semantics R/W protocol appropriate for block storage But R/W protocol inappropriate for building general services
Doesn’t provide replicated state machine semantics
A metadata service for a R/W-based block store motivated us to develop a protocol with stronger semantics
Carnegie Mellon
Directory
DD D D
Client A Client B
R/W semantics insufficient for metadata Consider 2 clients inserting a file in the same directory Last write wins; good for blocks, bad for directories
Directory Directory
Directory
DD D DDD D D
Directory
Carnegie Mellon
Query/Update (Q/U) protocol A protocol with replicated state machine semantics
Provides linearizable query and update operations Protocol properties
Decentralized Handles Byzantine clients & server failures, asynchronous Efficient common case operation
Optimistic protocol leverages versioning servers Single-phase queries and updates, if concurrency- and failure-free Avoids expensive cryptography (digital signatures)
Scalable Avoids server-to-server broadcast
Atomic multi-object updates
Carnegie Mellon
Outline Motivation Query/Update protocol
Overview Query, update operations Validation, object syncing, multi-object operations
Evaluation
Carnegie Mellon
Directory
Read/conditional-write primitive Servers accept an update operation only if the object hasn’t been modified
since read
DD D D
Client A Client BDirectory Directory
DD D D
Directory
directory
Carnegie Mellon
Handling Byzantine clients For Byzantine fault-tolerance, clients must pass operation to servers
Constrains clients to narrow object interface Servers apply operation to old object to validate new object
Op Op Op Op
Op
DD D D
Directory
DD D D
Directory
directorydirectory + op
Carnegie Mellon
Clients and objects Client just sends operations
Client does not read/write object Server applies operation to local object
Op Op Op Op
DD D DDD D Dhistoryop
Carnegie Mellon
Query/Update protocol Servers host objects
Optimistic protocol versioning Export an operation interface (more than read/write)
Can export any deterministic operation Server exports three types of operations:
Server
543
A
10
B
98
C
Read History (object)Returns timestamp vector
Query (Object,Version)Read-only; returns object state; e.g., getattr
Update (Object, OHS, Value)Mutating; updates object, conditioned on object not having been modified; e.g., setattr
Carnegie Mellon
Outline Motivation Query/Update protocol
Overview Query, update operations Validation, object syncing, multi-object operations
Evaluation
Carnegie Mellon
Read history operation Client requests version history of an object Each server replies with a list of timestamps
21 1
321
21
21
Object History Set (OHS)
read-historyhistory-reply
Tim
e
2 2 2
1 1 1
2
Carnegie Mellon
Query operation Client performs read history operation
Constructs OHS and identifies Latest version that is complete
Client queries Latest version at server
21 1
321
2 21
21
read-historyhistory-reply
queryquery-reply
Latest
Tim
e
2 2 2 2
2 2 2
1 1 1
Object History Set (OHS)
Carnegie Mellon
Update operation Client performs read-history operation
Constructs OHS and identifies Latest version that is complete Client sends operation and OHS to servers
Operation is conditioned on OHS
321
3
1
321
32
1
321
read-historyhistory-reply
updateupdate-reply
Tim
e
OHSOHSOHSOHS
Latest2 2 2
1 1 1
2
Object History Set (OHS)
Carnegie Mellon
Server validation for update operations A server needs to verify that the client conditioned
operation on Latest Validation steps:
Ensure read/conditional-write semantics Check that local history matches that in OHS Classify Latest write version
– Ensures operation is based on appropriate timestamp
Protection against Byzantine failures Check authenticators
– Ensures integrity of OHS
Carnegie Mellon
Server validation example Earlier example of 2 clients concurrently updating same directory Servers reject client B’s operation, due to “stale” OHS
1 1
321 11
read-historyhistory-reply
updateTim
e
1 1 1 1 1 1
Client A Client B
2 22 2
Carnegie Mellon
Q/U protocol details Handling Byzantine clients and server faults
Through validating timestamps and OHS
During classification of Latest, may require repair Incomplete operations: use barriers to fix failures Flexible protocol – can handle different types/# of faults
For asynchronous with Byzantine clients: N = 3t + 2b + 1, to tolerate t server faults, b of which are
Byzantine
Object syncing Multi-object operations
Carnegie Mellon
Object syncing A server may not have the latest version of an object If a server lacks latest version of object, the OHS
contains information about which other servers have that version
The server must sync the object with another server Hashes in OHS allow server to validate the synced object
Carnegie Mellon
Multi-object operation
An update can span multiple objects A client must construct OHS for each object Servers perform validation for each object Operations perform atomically across multiple objects
Carnegie Mellon
Outline Motivation Query/Update protocol
Overview Query, update operations Validation, object syncing, multi-object operations
Evaluation
Carnegie Mellon
Prototype evaluation Built a counter object using Q/U and BFT protocols
inc method increments counter and returns new value fetch method returns current counter value
Light-weight operations to demonstrate network and computation overhead inherent to protocols
Both Q/U and BFT implement efficient, optimistic queries Evaluation focuses on updates
Q/U common case: no concurrency; preferred quorums BFT common case: shared counter to allow batching
Carnegie Mellon
Experimental setup Cluster of Pentium 4 2.8 GHz, 1GB RAM 1 Gb switched Ethernet, 18.3 Gbps/35.7 mpps switch
No background traffic
Working size of experiments fit in server memory To focus on protocol overhead, not on disk accesses
Experiments are run for 30 seconds Measurements from middle 10 seconds
Carnegie Mellon
Fault scalability (1) Investigate throughput as the number of server faults
(b) tolerated increases
Measured saturated throughput Ran with 1, 3, 5, …, 20 clients with 2 outstanding reqs
For each b, selected highest throughput value
Carnegie Mellon
Fault scalability (2)
Carnegie Mellon
Throughput and response time under load (1)
Investigate throughput & response time under load Demonstrates protocol behavior beyond saturated throughput data
point
Increased number of clients from 1 to 20 for b = 1
Carnegie Mellon
Throughput and response time under load (2)
Carnegie Mellon
Conclusions Developed the Q/U protocol for accessing shared objects in a
distributed system Fault-scalable Byzantine fault-tolerant Optimistic, efficient Atomic multi-object operations
Evaluation Protocol scales with number of failures tolerated Throughput & response time consistent under load