The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

The Google File System (GFS)

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

詹剑锋讲解

Acknowledgement

Parts of contents are from CSE 490h –Introduction to Distributed Computing, Winter 2008, Washington University

Distributed File Systems

Tradeoffs in Distributed File Systems Performance Scalability Reliability Availability

Two Core Approaches Super Computer? many cheap computers?

Motivation

Google went the cheap commodity route… Lots of data on cheap machines!

Why not use an existing file system? Unique problems GFS is designed for Google workloads Google apps are designed for GFS

Design constraints (1/2)

Component failures are the norm Large-scale cheap systems Bugs, human errors, failures of memory, disk,

connectors, networking, and power supplies Monitoring, error detection, fault tolerance,

automatic recovery Files are huge by traditional standards

Multi-GB files are common But there aren’t THAT many files

Design constraints (2/2)

Mutations are typically appending new data Random writes are rare Once written, files are only read, and typically

sequentially Optimize for this!

Large consecutive reads, small random reads Want high sustained bandwidth

low latency is not that important Google is co-designing apps AND file system

GFS Interface

Supports usual commands Create, delete, open, close, read, write

Snapshot Copies a file or a directory tree

Record Append Allows multiple concurrent appends to same

file

GFS Architecture

Single master Multiple chunkservers

Architectural Design (1/4)

A GFS cluster A single master Multiple chunkservers per master

Accessed by multiple clients Running on commodity Linux machines

A file divided into fixed-size chunks. Labeled with 64-bit unique global IDs Stored at chunkservers 3-way Mirrored across chunkservers


Master server Maintains all metadata

Name space, access control, file-to-chunk mappings, garbage collection, chunk migration

controls system-wide activities chunk lease management, garbage collection of

orphaned chunks, and chunk migration between chunkservers.

periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.


GFS clients Consult master for metadata Access data from chunkservers Does not go through VFS since not providing

the POSIX API


No caching at clients and chunkservers due to the frequent case of streaming Client: most applications stream through huge files

with too large working sets. simplifies the client and the overall system by

eliminating cache coherence issues. (Clients do cache metadata, however.)

Chunkservers need not cache file Use Linux’s buffer cache.

Single-Master Design

From distributed systems Single point of failure Scalability bottleneck

GFS solutions: Shadow masters Minimize master involvement

never move data through it, use only for metadata large chunk size (64 MB) master delegates authority to primary replicas in data mutations

(chunk leases) Simple, and good enough!

Master’s responsibilities (1/2)

Metadata storage Namespace management/locking Periodic communication with

chunkservers give instructions, collect state, track

cluster health Garbage Collection

Master’s responsibilities (2/2)

Chunk creation Place new replicas on chunkservers with below

average disk-space utilization Limit number of recent creations on each chunk server Spread replicas across racks

Re-Replicate when replicas fall below user goal Periodic rebalancing

Better disk space usage Load balancing

Chunk Size

64 MB Fewer chunk location requests to the master Reduced overhead to access a chunk

on a large chunk, a client perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection.

Fewer metadata entries Kept in memory

Metadata (1/5)

Global metadata is stored on the master File and chunk namespaces Mapping from files to chunks Locations of each chunk’s replicas

All in memory (64 bytes / chunk) Fast Easily accessible Any problems?

Metadata (2/5)

Master has an operation log for persistent logging of critical metadata updates persistent on local disk replicated checkpoints for faster recovery

Metadata (3/5)

Three major types File and chunk namespaces File-to-chunk mappings

kept persistent by logging mutations to an operation log stored on the master’s local disk and replicated on remote machines.

Locations of a chunk’s replicas master does not store chunk location information

persistently. asks each chunkserver about its chunks at master

startup and whenever a chunkserver joins the cluster.

Metadata (4/5)

All kept in memory. Fast! Quick global scans

Garbage collections Reorganizations

re-replication in the presence of chunkserver failures Chunk migration to balance load and disk space usage

across chunkservers.

64 bytes per 64 MB of data stores file names compactly using Prefix

compression

Metadata (5/5)

Chunk locations: no persistent states Polls chunkservers at startup Use heartbeat messages to monitor servers Simplicity On-demand approach vs. coordination

On-demand wins when changes (failures) are often no point in maintaining a consistent view on the master

because errors on a chunkserver may cause chunks to vanish or an operator may rename a chunkserver.

Operation Logs (1/2)

Central to GFS. contains a historical record of critical metadata

changes. Not only is it the only persistent record of

metadata, but it also serves as a logical time line that defines the order of concurrent operations.

Files and chunks, as well as their versions , are all uniquely and eternally identified by the logical times at which they were created.

Operation Logs (2/2)

Metadata updates are logged e.g., <old value, new value> pairs Log replicated on remote machines

Take global snapshots (checkpoints) to truncate logs Memory mapped (compact B-tree like form) Checkpoints (take a while) can be created while

updates arrive master switches to a new log file and creates the

new checkpoint in a separate thread. Recovery (Latest checkpoint + subsequent log files)

Mutations Mutation = write or append

must be done for all replicas Goal: minimize master

involvement Lease mechanism:

master picks one replica asprimary; gives it a “lease” for mutations

primary defines a serial order of mutations all replicas follow this order

Data flow decoupled from control flow

Data Mutations

A write causes data to be written at an application-specified file offset.

A record append causes data (the “record”) to be appended atomically at least once even in the presence of concurrent mutations.

Atomic record append

Client specifies data

GFS appends it to the file atomically at least once GFS picks the offset In contrast, a “regular” append is merely a write at

an offset that the client believes to be the current end of file.

Used heavily by Google apps e.g., for files that serve as multiple-producer/single-

consumer queues

Consistency Model (1/3)

A file region is consistent if all clients will always see the same data, regardless of which replicas they read from.

A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety.

consistency model (2/3)

“Consistent” = all replicas have the same value “Defined” = replica reflects the mutation, consistent

Some properties: concurrent writes leave region consistent, but possibly

undefined failed writes leave the region inconsistent

Some work has moved into the applications: e.g., self-validating, self-identifying records

Consistency Model (3/3)

Relaxed consistency Concurrent changes are consistent but

undefined An append is atomically committed at least

once - Occasional duplications

All changes to a chunk are applied in the same order to all replicas

Use version number to detect missed updates

System Interactions

The master grants a chunk lease to a replica The replica holding the lease determines the

order of updates to all replicas Lease

60 second timeouts Can be extended indefinitely Extension request are piggybacked on

heartbeat messages After a timeout expires, the master can grant

new leases

Replica Placement

Goals: Maximize data reliability and availability Maximize network bandwidth

Need to spread chunk replicas across machines and racks

Higher priority to replica chunks with lower replication factors

Limited resources spent on replication

Fault Tolerance and Diagnosis(1/2)

High availability fast recovery

master and chunkservers restartable in a few seconds

chunk replication default: 3 replicas.

shadow masters

Fault Tolerance and Diagnosis (2/2)

Data integrity A chunk is divided into 64-KB blocks Each with its checksum Verified at read and write times Also background scans for rarely used data

Deployment in Google

50+ GFS clusters Each with thousands of storage nodes Managing petabytes of data GFS is under BigTable, etc.

Conclusion

GFS demonstrates how to support large-scale processing workloads on commodity hardware design to tolerate frequent component failures optimize for huge files that are mostly appended and

read feel free to relax and extend FS interface as required go for simple solutions (e.g., single master)

GFS has met Google’s storage needs… it must be good!

谢谢

课程评估

目标:系统理解+表达能力

4 pages 描述一个系统

4张图 (每张图半页, Microsoft visio )描述一个系统

用最精炼的文字描述图形

2人一组

必须与我看到的不能完全相同

否则不会高于75分

The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Documents

The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung