1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.

1

Principles of Reliable Distributed Systems

Tutorial 12: Frangipani

Spring 2009

Alex Shraer

2

Frangipani File SystemFrangipani File System

Thekkath, Mann, and Lee, SOSP 1997

3

Frangipani

• Scalable file system built at SRC-DEC

• Published in SOSP’97

• Uses failure detection, Paxos, leases,…

• Two layers:– Petal: virtual disk from many “storage bricks”– Frangipani file system and lock service

4

Motivation

• Large-scale distributed file systems are hard to administer

• Hard to add/remove machines (servers)

• Hard to add/remove disks (storage space)

• Hard to manage set of current components

• Hard to manage locks

5

Petal: Distributed Virtual Disks

C. A. Thekkath and E. K. LeeSystems Research Center

Digital Equipment CorporationASPLOS’96

6

Client’s View

7

Petal Overview

• Petal provides virtual disks– Large (264 bytes), sparse virtual space

– Disk storage allocated on demand

– Accessible to all file servers over a network

• Virtual disks implemented by– Cooperating CPUs executing Petal software

– Ordinary disks attached to the CPUs

– A scalable interconnection network

8

Petal Prototype

9

Global State Management

• Uses Paxos– Global state is replicated across all servers

• Metadata (disk allocation) only!

– Consistent in the face of server and network failures

– A majority is needed to update the global state– Any server can be added/removed in the

presence of failed servers

10

Key Petal Features

• Storage is incrementally expandable• Data is optionally mirrored over multiple servers• Metadata is replicated on all servers• Transparent addition and deletion of servers• Supports read-only snapshots of virtual disks• Client API looks like block-level disk device• Throughput

– Scales linearly with additional servers– Degrades gracefully with failures

11

Frangipani: A Scalable Distributed File System

C. A. Thekkath, T. Mann, and E. K. LeeSystems Research Center

Digital Equipment CorporationSOSP’97

12

Frangipani Features

• Behaves like a local file system– Multiple machines cooperatively manage

a Petal disk– Users on any machine see a consistent

view of data

• Exhibits good performance, scaling, and load balancing

• Easy to administer

13

Ease of Administration

• Frangipani machines are modular– Can be added and deleted transparently

• Common free space pool – Users don’t have to be moved

• Automatically recovers from crashes

• Consistent backup without halting the system

14

Frangipani Structure

• Distributed file system built atop a shared virtual disk (Petal)

• Frangipani servers do not communicate with each other directly– Only through Petal

• Simplifies managemant– Addition/removal of servers

15

Frangipani Layering

16

Standard Organization

17

Components of Frangipani

• File system core– Implements the file system (FS) interface– Uses FS mechanisms (buffer cache etc.)– Exploits Petal’s large virtual space

• Locks with leases– Granted for finite time, must be refreshed

• Write-ahead redo log– Performance optimization + failure recovery

18

Locks• Multiple reader/single writer• Granularity: lock per entire file or directory• A lock is really a lease – it expires

– After 30 seconds in their implementation

• Assumption?

19

Using Locks

• Frangipani servers are clients of lock service

• Dirty data is written to disk (Petal) before the lock is given to another machine

• Locks are cached by servers that acquire them– Soft state: no need to explicitly release locks– Uses lease timeouts for lock recovery

20

Distributed Lock Management

• A set of lock servers collaboratively manage locks– Run Paxos among them– Consensus on global state: set of locks each server is

responsible for, list of current lock servers, lock allocation to clients

– Need majority to make progress• Using leases requires assuming loosely

synchronized clocks– Expired leases should not be accepted

• Why Paxos then?– To overcome network partitions

21

Logging

• Frangipani uses a write ahead redo log for metadata– Log records are kept on Petal (why?)

• Data is written to Petal – On sync, fsync, or every 30 seconds– On lock revocation or when the log wraps

• Each server has a separate log– Reduces contention– Independent recovery

22

Recovery

• Recovery initiated due to failure detection– By the lock service– Failure detection implemented using heartbeats

• Any server can recover operations for a failed server– Log is available via Petal

23

Conclusions

• Fault-tolerance in the real world• Overcome crashes and network partitions

using consensus-based replication – Paxos

• Un-contended good performance – Using locks

• Implement locks as leases for robustness• Logging for recovery

1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.

Documents

system slide

locks slide

file servers

frangipani layering

petal prototype slide

lock service slide

petal overview petal

alex shraer slide