1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

Post on 20-Dec-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

11COMP 734 – Fall 2011

COMP 734 -- Distributed File Systems

With Case Studies:

NFS, Andrew and Google

22COMP 734 – Fall 2011

Distributed File Systems Were Phase 0 of “Cloud” Computing

Mobility(user & data) Sharing

AdministrationCosts

ContentManagement

BackupSecurity

Performance???Phase 0: Centralized Data, Distributed Processing

33COMP 734 – Fall 2011

Distributed File System Clients and Servers

request

(e.g., read)response

(e.g., file block)

Client Server

Client

ClientMost Distributed File Systems Use

Remote Procedure Calls (RPC)

network

44COMP 734 – Fall 2011

RPC Structure (Birrell & Nelson)1

1 Fig. 1 (slightly modified)

localreturn

localcall

unpackresult

packargs

unpackargs

return

wait

receive

call

work

importer exporter

interface

client

exporter

interface

clientstub

importer

serverstub

serverRPC

runtimeRPC

runtime

Caller machine Callee machine

transmitcall packet

transmitresult packet

receive

packresult

network

Birrell, A. D. and B. J. Nelson, Implementing Remote Procedure Calls, ACM Transactions on Computer Systems, Vol. 2, No. 1, February 1984, pp. 39-59

55COMP 734 – Fall 2011

Unix Local File Access Semantics – Multiple Processes Read/Write a File Concurrently

write by A

write by B

read by C B Ahappens before

A Bhappens before

write by A

write by B

read by C

writes are “atomic”

reads always get the atomic result of the most recently completed write

write by A

write by B

read by C

66COMP 734 – Fall 2011

What Happens If Clients Cache File/Directory Content?

read()

response

Client Server

Client

Clientwrite()

write()

clientcache

clientcache

Do the Cache Consistency Semantics Match Local Concurrent Access Semantics?

read()

network

77COMP 734 – Fall 2011

File Usage Observation #1:Most Files are Small

Source: Mary Baker, et at, “Measurements of a Distributed File System,” Proceedings 13th ACM SOSP, 1991, pp. 198-212. Source: Andrew W. Leung, et at, “Measurement and Analysis of Large-Scale Network File System Workloads,” Proceedings USENIX Annual Technical Conference, 2008,

pp. 213-226.

Unix, 1991

80% of files < 10 KB

Windows, 2008

80% of files < 30 KB

88COMP 734 – Fall 2011

File Usage Observation #2:Most Bytes Are Transferred from Large Files (and Large Files Are Larger)

Source: Mary Baker, et at, “Measurements of a Distributed File System,” Proceedings 13th ACM SOSP, 1991, pp. 198-212. Source: Andrew W. Leung, et at, “Measurement and Analysis of Large-Scale Network File System Workloads,” Proceedings USENIX Annual Technical Conference, 2008,

pp. 213-226.

Unix, 1991

80% of bytes from files > 10 KB

Windows, 2008

70% of bytes from files > 30 KB

99COMP 734 – Fall 2011

File Usage Observation #3:Most Bytes Are Transferred in Long Sequential Runs – Most Often the Whole File

Source: Mary Baker, et at, “Measurements of a Distributed File System,” Proceedings 13th ACM SOSP, 1991, pp. 198-212. Source: Andrew W. Leung, et at, “Measurement and Analysis of Large-Scale Network File System Workloads,” Proceedings USENIX Annual Technical Conference, 2008,

pp. 213-226.

Unix, 1991 Windows, 2008

85% of sequential bytes from runs > 100 KB

60% of sequential bytes from runs > 100 KB

1010COMP 734 – Fall 2011

Chronology of Early Distributed File Systems – Designs for Scalability

Time

Clie

nts

sup

por

ted

by

on

e in

sta

llatio

n

product

open source (AFS)

1111COMP 734 – Fall 2011

NFS-2 Design Goals

Transparent File Naming Scalable -- O(100s) Performance approximating local files Fault tolerant (client, server, network faults) Unix file-sharing semantics (almost) No change to Unix C library/system calls

1212COMP 734 – Fall 2011

bob

NFS File Naming – Exporting Names

/

local

/

local usr

usr

sam joe

doc

readme.docreport.doc

usr

fred

/

tools

source

main.cdata.h

proj usr

sam joe

doc

readme.docreport.doc

bin

tools

bin

toolssource

main.cdata.h

proj

host#1

host#2

host#3

exportfs /bin exportfs /usr

exportfs /usr/proj

1313COMP 734 – Fall 2011

bob

NFS File Naming – Import (mount) Names

/

local

/

local usr

usr

sam joe

doc

readme.docreport.doc

sam joe

doc

readme.docreport.doc

usr

fred

/

tools

tools

source

main.cdata.h

proj

source

main.cdata.h

usr

sam joe

doc

readme.docreport.doc

bin

tools

bin

toolssource

main.cdata.h

proj

host#1

host#2

host#3

exportfs /bin exportfs /usr

exportfs /usr/projmount host#3:/bin /tools

mount host#1:/usr/proj /local

mount host#3:/usr /usr

1414COMP 734 – Fall 2011

NFS-2 Remote Procedure Calls

NFS server is Stateless- each call is self-contained and independent

- avoids complex state recovery issues in the event of failures

- each call has an explicit file handle parameter

1515COMP 734 – Fall 2011

NFS-2 Read/Write with Client Cache

application

OS kernel

OS kernel

read(fh, buffer, length)

file system NFS client file systemNFS server

Buffercacheblocks

network

1616COMP 734 – Fall 2011

NFS-2 Cache Operation and Consistency Semantics

Application writes modify blocks in buffer cache and on the server with write() RPC (“write-through”)

Consistency validations of cached data compare the last-modified timestamp in the cache with the value on the server If server timestamp is later, the cached data is discarded Note the need for (approximately) synchronized clocks on client and server

Cached file data is validated (a getattr() RPC) each time the file is opened.

Cached data is validated (a getattr() RPC) if an application accesses it and a time threshold has passed since it was last validated validations are done each time a last-modified attribute is returned on RPC for

read, lookup, etc.) 3 second threshold for files 30 second threshold for directories

If the time threshold has NOT passed, the cached file/directory data is assumed valid and the application is given access to it.

1717COMP 734 – Fall 2011

Andrew Design Goals

Transparent file naming in single name space Scalable -- O(1000s) Performance approximating local files Easy administration and operation “Flexible” protections (directory scope) Clear (non-Unix) consistency semantics Security (authentication)

1818COMP 734 – Fall 2011

Users Share a Single Name Space

/

etc bincache

/

etc bincache

/

etc bincache

/

etc bincache

pkg home

smithfd reiter

doc

proj

tools

win32

afs

1919COMP 734 – Fall 2011

Server “Cells” Form a “Global” File System

/

etc bincache

/

etc bincache

home

ted carol

doc

home

alice bob

doc

home

smithfd reiter

doc

/afs/cs.unc.edu

/afs/cern.ch/afs/cs.cmu.edu

2020COMP 734 – Fall 2011

Andrew File Access (Open)

application

OS kernelOS kernel

open(/afs/usr/fds/foo.c)

file system Client file systemServer

On-disk cache

(1)

(1) - open request passed to Andrew client

(2)

(2) - client checks cache for valid file copy

(3)

(3) - if not in cache, fetch whole file from server and write to cache; else (4)(4) - when file is in cache, return handle to local file

(4)

network

2121COMP 734 – Fall 2011

Andrew File Access (Read/Write)

application

OS kernelOS kernel

file system Client file system Server

On-disk cache

(5) - read and write operations take place on local cache copy

(5)

read(fh, buffer, length)write(fh, buffer, length)

network

2222COMP 734 – Fall 2011

Andrew File Access (Close-Write)

application

OS kernelOS kernel

close(fh)

file system Client file systemServer

On-disk cache

(6)

(6) - close request passed to Andrew client

(7)

(7) - client writes whole file back to server from cache

(8) - server copy of file is entirely replaced

(8)

network

2323COMP 734 – Fall 2011

Andrew Cache Operation and Consistency Semantics

Directory lookup uses valid cache copy; directory updates (e.g., create or remove) to cache “write-through” to server.

When file or directory data is fetched, the server “guarantees” to notify (callback) the client before changing the server’s copy

Cached data is used without checking until a callback is received for it or 10 minutes has elapsed without communication with its server

On receiving a callback for a file or directory, the client invalidates the cached copy

Cached data can also be revalidated (and new callbacks established) by the client with an RPC to the server avoids discarding all cache content after network partition or client crash

2424COMP 734 – Fall 2011

Andrew Benchmark -- Server CPU Utilization

Source: Howard, et al, “Scale and Performance in a Distributed File System”, ACM TOCS, vol. 6, no. 1, February 1988, pp. 51-81.

Andrew server utilization increases slowly with load

NFS server utilization saturates quickly with load

2525COMP 734 – Fall 2011

Google is Really Different….

Huge Datacenters in 30+ Worldwide Locations

Datacenters house multiple server clusters

Even nearby in Lenior, NC

each > football field

4 story cooling towers

The Dalles, OR (2006)

2007

2008

2626COMP 734 – Fall 2011

Google is Really Different….

Each cluster has hundreds/thousands of Linux systems on Ethernet switches

500,000+ total servers

2727COMP 734 – Fall 2011

Custom Design Servers

Clusters of low-cost commodity hardware Custom design using high-volume components SATA disks, not SAS (high capacity, low cost, somewhat

less reliable) No “server-class” machines Battery power backup

2828COMP 734 – Fall 2011

Facebook Enters the Custom Server Race (April 7, 2011)

Announces the Open Compute Project (the Green Data Center)

2929COMP 734 – Fall 2011

Google File System Design Goals

Familiar operations but NOT Unix/Posix Standard Specialized operation for Google applications

record_append()

Scalable -- O(1000s) of clients per cluster Performance optimized for throughput

No client caches (big files, little cache locality) Highly available and fault tolerant Relaxed file consistency semantics

Applications written to deal with consistency issues

3030COMP 734 – Fall 2011

File and Usage Characteristics

Many files are 100s of MB or 10s of GB Results from web crawls, query logs, archives, etc. Relatively small number of files (millions/cluster)

File operations: Large sequential (streaming) reads/writes Small random reads (rare random writes)

Files are mostly “write-once, read-many.” File writes are dominated by appends, many from hundreds of

concurrent processes (e.g., web crawlers)

process

process

process

Appended file

3131COMP 734 – Fall 2011

GFS Basics

Files named with conventional pathname hierarchy (but no actual directory files) E.g., /dir1/dir2/dir3/foobar

Files are composed of 64 MB “chunks” (Linux typically uses 4 KB blocks)

Each GFS cluster has many servers (Linux processes): One primary Master Server Several “Shadow” Master Servers Hundreds of “Chunk” Servers

Each chunk is represented by a normal Linux file Linux file system buffer provides caching and read-ahead Linux file system extends file space as needed to chunk size

Each chunk is replicated (3 replicas default) Chunks are check-summed in 64KB blocks for data integrity

3232COMP 734 – Fall 2011

GFS Protocols for File Reads

Minimizes client interaction with master: - Data operations directly with chunk servers. - Clients cache chunk metadata until new open or timeout

Ghemawat, S., H. Gobioff, and S-T. Leung, The Google File System, Proceedings of ACM SOSP 2003, pp. 29-43

3333COMP 734 – Fall 2011

Master Server Functions

Maintain file name space (atomic create, delete names) Maintain chunk metadata

Assign immutable globally-unique 64-bit identifier Mapping from files name to chunk(s) Current chunk replica locations

Refresh dynamically from chunk servers

Maintain access control data Manage replicas and other chunk-related actions

Assign primary replica and version number Garbage collect deleted chunks and stale replicas Migrate chunks for load balancing Re-replicate chunks when servers fail

Heartbeat and state-exchange messages with chunk servers

3434COMP 734 – Fall 2011

GFS Relaxed Consistency Model

Writes that are large or cross chunk boundaries may be broken into multiple smaller ones by GFS

Sequential writes (successful): One copy semantics*, writes serialized.

Concurrent writes (successful): One copy semantics Writes not serialized in overlapping regions

Sequential or concurrent writes (failure): Replicas may differ Application should retry

All replicas equal

*Informally, there exists exactly one current value at all replicas and that value is returned for a read of any replica

3535COMP 734 – Fall 2011

GFS Applications Deal with Relaxed Consistency

Writes Retry in case of failure at any replica Regular checkpoints after successful sequences Include application-generated record identifiers and

checksums Reads

Use checksum validation and record identifiers to discard padding and duplicates.

3636COMP 734 – Fall 2011

GFS Chunk Replication (1/2)

12

12

12

Master

Client

C1

C2

C3

primaryClient

Find

Loca

tion

C1,

C2(

prim

ary)

,C3

1. Client contacts master to get replica state and caches it

LRU buffers at chunk servers

2. Client picks any chunk server and pushes data. Servers forward data along “best” path to others.

ACK

ACK

ACK

3737COMP 734 – Fall 2011

GFS Chunk Replication (2/2)

12

12

12

Master

Client

C1

C2

C3

Client

Write

Write

12

12

write order

write order

ACK

ACK

success/failure

3. Client sends write request to primary

4. Primary assigns write order and forwards to replicas

5. Primary collects ACKs and responds to client. Applications must retry write if there is any failure.

success/

failure

3838COMP 734 – Fall 2011

GFS record_append()

Client specifies only data content and region size; server returns actual offset to region

Guaranteed to append at least once atomically File may contain padding and duplicates

Padding if region size won’t fit in chunk Duplicates if it fails at some replicas and client

must retry record_append() If record_append() completes successfully, all

replicas will contain at least one copy of the region at the same offset

3939COMP 734 – Fall 2011

GFS Record Append (1/3)

12

12

12

Master

Client

C1

C2

C3

primaryClient

Find

Loca

tion

C1,

C2(

prim

ary)

,C3

1. Client contacts master to get replica state and caches it

LRU buffers at chunk servers

2. Client picks any chunk server and pushes data. Servers forward data along “best” path to others.

ACK

ACK

ACK

4040COMP 734 – Fall 2011

GFS Record Append (2/3)

12

12

12

Master

Client

C1

C2

C3

Client

Write

Write

1@2@

1@2@

write order

write order

ACK

ACK

offset/failure

3. Client sends write request to primary

4. If record fits in last chunk, primary assigns write order and offset and forwards to replicas

5. Primary collects ACKs and responds to client with assigned offset. Applications must retry write if there is any failure.

success/

failure

4141COMP 734 – Fall 2011

GFS Record Append (3/3)

12

12

12

Master

Client

C1

C2

C3

Write

Retry on next chunk

3. Client sends write request to primary

4. If record overflows last chunk, primary and replicas pad last chunk and offset points to next chunk

Pad to next chunk

Pad to next chunk

5. Client must retry write from beginning

top related