Top Banner
1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google
41

1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

11COMP 734 – Fall 2011

COMP 734 -- Distributed File Systems

With Case Studies:

NFS, Andrew and Google

Page 2: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

22COMP 734 – Fall 2011

Distributed File Systems Were Phase 0 of “Cloud” Computing

Mobility(user & data) Sharing

AdministrationCosts

ContentManagement

BackupSecurity

Performance???Phase 0: Centralized Data, Distributed Processing

Page 3: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

33COMP 734 – Fall 2011

Distributed File System Clients and Servers

request

(e.g., read)response

(e.g., file block)

Client Server

Client

ClientMost Distributed File Systems Use

Remote Procedure Calls (RPC)

network

Page 4: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

44COMP 734 – Fall 2011

RPC Structure (Birrell & Nelson)1

1 Fig. 1 (slightly modified)

localreturn

localcall

unpackresult

packargs

unpackargs

return

wait

receive

call

work

importer exporter

interface

client

exporter

interface

clientstub

importer

serverstub

serverRPC

runtimeRPC

runtime

Caller machine Callee machine

transmitcall packet

transmitresult packet

receive

packresult

network

Birrell, A. D. and B. J. Nelson, Implementing Remote Procedure Calls, ACM Transactions on Computer Systems, Vol. 2, No. 1, February 1984, pp. 39-59

Page 5: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

55COMP 734 – Fall 2011

Unix Local File Access Semantics – Multiple Processes Read/Write a File Concurrently

write by A

write by B

read by C B Ahappens before

A Bhappens before

write by A

write by B

read by C

writes are “atomic”

reads always get the atomic result of the most recently completed write

write by A

write by B

read by C

Page 6: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

66COMP 734 – Fall 2011

What Happens If Clients Cache File/Directory Content?

read()

response

Client Server

Client

Clientwrite()

write()

clientcache

clientcache

Do the Cache Consistency Semantics Match Local Concurrent Access Semantics?

read()

network

Page 7: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

77COMP 734 – Fall 2011

File Usage Observation #1:Most Files are Small

Source: Mary Baker, et at, “Measurements of a Distributed File System,” Proceedings 13th ACM SOSP, 1991, pp. 198-212. Source: Andrew W. Leung, et at, “Measurement and Analysis of Large-Scale Network File System Workloads,” Proceedings USENIX Annual Technical Conference, 2008,

pp. 213-226.

Unix, 1991

80% of files < 10 KB

Windows, 2008

80% of files < 30 KB

Page 8: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

88COMP 734 – Fall 2011

File Usage Observation #2:Most Bytes Are Transferred from Large Files (and Large Files Are Larger)

Source: Mary Baker, et at, “Measurements of a Distributed File System,” Proceedings 13th ACM SOSP, 1991, pp. 198-212. Source: Andrew W. Leung, et at, “Measurement and Analysis of Large-Scale Network File System Workloads,” Proceedings USENIX Annual Technical Conference, 2008,

pp. 213-226.

Unix, 1991

80% of bytes from files > 10 KB

Windows, 2008

70% of bytes from files > 30 KB

Page 9: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

99COMP 734 – Fall 2011

File Usage Observation #3:Most Bytes Are Transferred in Long Sequential Runs – Most Often the Whole File

Source: Mary Baker, et at, “Measurements of a Distributed File System,” Proceedings 13th ACM SOSP, 1991, pp. 198-212. Source: Andrew W. Leung, et at, “Measurement and Analysis of Large-Scale Network File System Workloads,” Proceedings USENIX Annual Technical Conference, 2008,

pp. 213-226.

Unix, 1991 Windows, 2008

85% of sequential bytes from runs > 100 KB

60% of sequential bytes from runs > 100 KB

Page 10: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

1010COMP 734 – Fall 2011

Chronology of Early Distributed File Systems – Designs for Scalability

Time

Clie

nts

sup

por

ted

by

on

e in

sta

llatio

n

product

open source (AFS)

Page 11: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

1111COMP 734 – Fall 2011

NFS-2 Design Goals

Transparent File Naming Scalable -- O(100s) Performance approximating local files Fault tolerant (client, server, network faults) Unix file-sharing semantics (almost) No change to Unix C library/system calls

Page 12: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

1212COMP 734 – Fall 2011

bob

NFS File Naming – Exporting Names

/

local

/

local usr

usr

sam joe

doc

readme.docreport.doc

usr

fred

/

tools

source

main.cdata.h

proj usr

sam joe

doc

readme.docreport.doc

bin

tools

bin

toolssource

main.cdata.h

proj

host#1

host#2

host#3

exportfs /bin exportfs /usr

exportfs /usr/proj

Page 13: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

1313COMP 734 – Fall 2011

bob

NFS File Naming – Import (mount) Names

/

local

/

local usr

usr

sam joe

doc

readme.docreport.doc

sam joe

doc

readme.docreport.doc

usr

fred

/

tools

tools

source

main.cdata.h

proj

source

main.cdata.h

usr

sam joe

doc

readme.docreport.doc

bin

tools

bin

toolssource

main.cdata.h

proj

host#1

host#2

host#3

exportfs /bin exportfs /usr

exportfs /usr/projmount host#3:/bin /tools

mount host#1:/usr/proj /local

mount host#3:/usr /usr

Page 14: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

1414COMP 734 – Fall 2011

NFS-2 Remote Procedure Calls

NFS server is Stateless- each call is self-contained and independent

- avoids complex state recovery issues in the event of failures

- each call has an explicit file handle parameter

Page 15: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

1515COMP 734 – Fall 2011

NFS-2 Read/Write with Client Cache

application

OS kernel

OS kernel

read(fh, buffer, length)

file system NFS client file systemNFS server

Buffercacheblocks

network

Page 16: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

1616COMP 734 – Fall 2011

NFS-2 Cache Operation and Consistency Semantics

Application writes modify blocks in buffer cache and on the server with write() RPC (“write-through”)

Consistency validations of cached data compare the last-modified timestamp in the cache with the value on the server If server timestamp is later, the cached data is discarded Note the need for (approximately) synchronized clocks on client and server

Cached file data is validated (a getattr() RPC) each time the file is opened.

Cached data is validated (a getattr() RPC) if an application accesses it and a time threshold has passed since it was last validated validations are done each time a last-modified attribute is returned on RPC for

read, lookup, etc.) 3 second threshold for files 30 second threshold for directories

If the time threshold has NOT passed, the cached file/directory data is assumed valid and the application is given access to it.

Page 17: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

1717COMP 734 – Fall 2011

Andrew Design Goals

Transparent file naming in single name space Scalable -- O(1000s) Performance approximating local files Easy administration and operation “Flexible” protections (directory scope) Clear (non-Unix) consistency semantics Security (authentication)

Page 18: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

1818COMP 734 – Fall 2011

Users Share a Single Name Space

/

etc bincache

/

etc bincache

/

etc bincache

/

etc bincache

pkg home

smithfd reiter

doc

proj

tools

win32

afs

Page 19: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

1919COMP 734 – Fall 2011

Server “Cells” Form a “Global” File System

/

etc bincache

/

etc bincache

home

ted carol

doc

home

alice bob

doc

home

smithfd reiter

doc

/afs/cs.unc.edu

/afs/cern.ch/afs/cs.cmu.edu

Page 20: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

2020COMP 734 – Fall 2011

Andrew File Access (Open)

application

OS kernelOS kernel

open(/afs/usr/fds/foo.c)

file system Client file systemServer

On-disk cache

(1)

(1) - open request passed to Andrew client

(2)

(2) - client checks cache for valid file copy

(3)

(3) - if not in cache, fetch whole file from server and write to cache; else (4)(4) - when file is in cache, return handle to local file

(4)

network

Page 21: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

2121COMP 734 – Fall 2011

Andrew File Access (Read/Write)

application

OS kernelOS kernel

file system Client file system Server

On-disk cache

(5) - read and write operations take place on local cache copy

(5)

read(fh, buffer, length)write(fh, buffer, length)

network

Page 22: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

2222COMP 734 – Fall 2011

Andrew File Access (Close-Write)

application

OS kernelOS kernel

close(fh)

file system Client file systemServer

On-disk cache

(6)

(6) - close request passed to Andrew client

(7)

(7) - client writes whole file back to server from cache

(8) - server copy of file is entirely replaced

(8)

network

Page 23: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

2323COMP 734 – Fall 2011

Andrew Cache Operation and Consistency Semantics

Directory lookup uses valid cache copy; directory updates (e.g., create or remove) to cache “write-through” to server.

When file or directory data is fetched, the server “guarantees” to notify (callback) the client before changing the server’s copy

Cached data is used without checking until a callback is received for it or 10 minutes has elapsed without communication with its server

On receiving a callback for a file or directory, the client invalidates the cached copy

Cached data can also be revalidated (and new callbacks established) by the client with an RPC to the server avoids discarding all cache content after network partition or client crash

Page 24: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

2424COMP 734 – Fall 2011

Andrew Benchmark -- Server CPU Utilization

Source: Howard, et al, “Scale and Performance in a Distributed File System”, ACM TOCS, vol. 6, no. 1, February 1988, pp. 51-81.

Andrew server utilization increases slowly with load

NFS server utilization saturates quickly with load

Page 25: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

2525COMP 734 – Fall 2011

Google is Really Different….

Huge Datacenters in 30+ Worldwide Locations

Datacenters house multiple server clusters

Even nearby in Lenior, NC

each > football field

4 story cooling towers

The Dalles, OR (2006)

2007

2008

Page 26: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

2626COMP 734 – Fall 2011

Google is Really Different….

Each cluster has hundreds/thousands of Linux systems on Ethernet switches

500,000+ total servers

Page 27: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

2727COMP 734 – Fall 2011

Custom Design Servers

Clusters of low-cost commodity hardware Custom design using high-volume components SATA disks, not SAS (high capacity, low cost, somewhat

less reliable) No “server-class” machines Battery power backup

Page 28: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

2828COMP 734 – Fall 2011

Facebook Enters the Custom Server Race (April 7, 2011)

Announces the Open Compute Project (the Green Data Center)

Page 29: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

2929COMP 734 – Fall 2011

Google File System Design Goals

Familiar operations but NOT Unix/Posix Standard Specialized operation for Google applications

record_append()

Scalable -- O(1000s) of clients per cluster Performance optimized for throughput

No client caches (big files, little cache locality) Highly available and fault tolerant Relaxed file consistency semantics

Applications written to deal with consistency issues

Page 30: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

3030COMP 734 – Fall 2011

File and Usage Characteristics

Many files are 100s of MB or 10s of GB Results from web crawls, query logs, archives, etc. Relatively small number of files (millions/cluster)

File operations: Large sequential (streaming) reads/writes Small random reads (rare random writes)

Files are mostly “write-once, read-many.” File writes are dominated by appends, many from hundreds of

concurrent processes (e.g., web crawlers)

process

process

process

Appended file

Page 31: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

3131COMP 734 – Fall 2011

GFS Basics

Files named with conventional pathname hierarchy (but no actual directory files) E.g., /dir1/dir2/dir3/foobar

Files are composed of 64 MB “chunks” (Linux typically uses 4 KB blocks)

Each GFS cluster has many servers (Linux processes): One primary Master Server Several “Shadow” Master Servers Hundreds of “Chunk” Servers

Each chunk is represented by a normal Linux file Linux file system buffer provides caching and read-ahead Linux file system extends file space as needed to chunk size

Each chunk is replicated (3 replicas default) Chunks are check-summed in 64KB blocks for data integrity

Page 32: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

3232COMP 734 – Fall 2011

GFS Protocols for File Reads

Minimizes client interaction with master: - Data operations directly with chunk servers. - Clients cache chunk metadata until new open or timeout

Ghemawat, S., H. Gobioff, and S-T. Leung, The Google File System, Proceedings of ACM SOSP 2003, pp. 29-43

Page 33: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

3333COMP 734 – Fall 2011

Master Server Functions

Maintain file name space (atomic create, delete names) Maintain chunk metadata

Assign immutable globally-unique 64-bit identifier Mapping from files name to chunk(s) Current chunk replica locations

Refresh dynamically from chunk servers

Maintain access control data Manage replicas and other chunk-related actions

Assign primary replica and version number Garbage collect deleted chunks and stale replicas Migrate chunks for load balancing Re-replicate chunks when servers fail

Heartbeat and state-exchange messages with chunk servers

Page 34: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

3434COMP 734 – Fall 2011

GFS Relaxed Consistency Model

Writes that are large or cross chunk boundaries may be broken into multiple smaller ones by GFS

Sequential writes (successful): One copy semantics*, writes serialized.

Concurrent writes (successful): One copy semantics Writes not serialized in overlapping regions

Sequential or concurrent writes (failure): Replicas may differ Application should retry

All replicas equal

*Informally, there exists exactly one current value at all replicas and that value is returned for a read of any replica

Page 35: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

3535COMP 734 – Fall 2011

GFS Applications Deal with Relaxed Consistency

Writes Retry in case of failure at any replica Regular checkpoints after successful sequences Include application-generated record identifiers and

checksums Reads

Use checksum validation and record identifiers to discard padding and duplicates.

Page 36: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

3636COMP 734 – Fall 2011

GFS Chunk Replication (1/2)

12

12

12

Master

Client

C1

C2

C3

primaryClient

Find

Loca

tion

C1,

C2(

prim

ary)

,C3

1. Client contacts master to get replica state and caches it

LRU buffers at chunk servers

2. Client picks any chunk server and pushes data. Servers forward data along “best” path to others.

ACK

ACK

ACK

Page 37: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

3737COMP 734 – Fall 2011

GFS Chunk Replication (2/2)

12

12

12

Master

Client

C1

C2

C3

Client

Write

Write

12

12

write order

write order

ACK

ACK

success/failure

3. Client sends write request to primary

4. Primary assigns write order and forwards to replicas

5. Primary collects ACKs and responds to client. Applications must retry write if there is any failure.

success/

failure

Page 38: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

3838COMP 734 – Fall 2011

GFS record_append()

Client specifies only data content and region size; server returns actual offset to region

Guaranteed to append at least once atomically File may contain padding and duplicates

Padding if region size won’t fit in chunk Duplicates if it fails at some replicas and client

must retry record_append() If record_append() completes successfully, all

replicas will contain at least one copy of the region at the same offset

Page 39: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

3939COMP 734 – Fall 2011

GFS Record Append (1/3)

12

12

12

Master

Client

C1

C2

C3

primaryClient

Find

Loca

tion

C1,

C2(

prim

ary)

,C3

1. Client contacts master to get replica state and caches it

LRU buffers at chunk servers

2. Client picks any chunk server and pushes data. Servers forward data along “best” path to others.

ACK

ACK

ACK

Page 40: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

4040COMP 734 – Fall 2011

GFS Record Append (2/3)

12

12

12

Master

Client

C1

C2

C3

Client

Write

Write

1@2@

1@2@

write order

write order

ACK

ACK

offset/failure

3. Client sends write request to primary

4. If record fits in last chunk, primary assigns write order and offset and forwards to replicas

5. Primary collects ACKs and responds to client with assigned offset. Applications must retry write if there is any failure.

success/

failure

Page 41: 1 COMP 734 – Fall 2011 COMP 734 -- Distributed File Systems With Case Studies: NFS, Andrew and Google.

4141COMP 734 – Fall 2011

GFS Record Append (3/3)

12

12

12

Master

Client

C1

C2

C3

Write

Retry on next chunk

3. Client sends write request to primary

4. If record overflows last chunk, primary and replicas pad last chunk and offset points to next chunk

Pad to next chunk

Pad to next chunk

5. Client must retry write from beginning