The Google File System (GFS)
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
詹剑锋讲解
Acknowledgement
Parts of contents are from CSE 490h –Introduction to Distributed Computing, Winter 2008, Washington University
Distributed File Systems
Tradeoffs in Distributed File Systems Performance Scalability Reliability Availability
Two Core Approaches Super Computer? many cheap computers?
Motivation
Google went the cheap commodity route… Lots of data on cheap machines!
Why not use an existing file system? Unique problems GFS is designed for Google workloads Google apps are designed for GFS
Design constraints (1/2)
Component failures are the norm Large-scale cheap systems Bugs, human errors, failures of memory, disk,
connectors, networking, and power supplies Monitoring, error detection, fault tolerance,
automatic recovery Files are huge by traditional standards
Multi-GB files are common But there aren’t THAT many files
Design constraints (2/2)
Mutations are typically appending new data Random writes are rare Once written, files are only read, and typically
sequentially Optimize for this!
Large consecutive reads, small random reads Want high sustained bandwidth
low latency is not that important Google is co-designing apps AND file system
GFS Interface
Supports usual commands Create, delete, open, close, read, write
Snapshot Copies a file or a directory tree
Record Append Allows multiple concurrent appends to same
file
GFS Architecture
Single master Multiple chunkservers
Architectural Design (1/4)
A GFS cluster A single master Multiple chunkservers per master
Accessed by multiple clients Running on commodity Linux machines
A file divided into fixed-size chunks. Labeled with 64-bit unique global IDs Stored at chunkservers 3-way Mirrored across chunkservers
Architectural Design (2/4)
Master server Maintains all metadata
Name space, access control, file-to-chunk mappings, garbage collection, chunk migration
controls system-wide activities chunk lease management, garbage collection of
orphaned chunks, and chunk migration between chunkservers.
periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.
Architectural Design (3/4)
GFS clients Consult master for metadata Access data from chunkservers Does not go through VFS since not providing
the POSIX API
Architectural Design (4/4)
No caching at clients and chunkservers due to the frequent case of streaming Client: most applications stream through huge files
with too large working sets. simplifies the client and the overall system by
eliminating cache coherence issues. (Clients do cache metadata, however.)
Chunkservers need not cache file Use Linux’s buffer cache.
Single-Master Design
From distributed systems Single point of failure Scalability bottleneck
GFS solutions: Shadow masters Minimize master involvement
never move data through it, use only for metadata large chunk size (64 MB) master delegates authority to primary replicas in data mutations
(chunk leases) Simple, and good enough!
Master’s responsibilities (1/2)
Metadata storage Namespace management/locking Periodic communication with
chunkservers give instructions, collect state, track
cluster health Garbage Collection
Master’s responsibilities (2/2)
Chunk creation Place new replicas on chunkservers with below
average disk-space utilization Limit number of recent creations on each chunk server Spread replicas across racks
Re-Replicate when replicas fall below user goal Periodic rebalancing
Better disk space usage Load balancing
Chunk Size
64 MB Fewer chunk location requests to the master Reduced overhead to access a chunk
on a large chunk, a client perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection.
Fewer metadata entries Kept in memory
Metadata (1/5)
Global metadata is stored on the master File and chunk namespaces Mapping from files to chunks Locations of each chunk’s replicas
All in memory (64 bytes / chunk) Fast Easily accessible Any problems?
Metadata (2/5)
Master has an operation log for persistent logging of critical metadata updates persistent on local disk replicated checkpoints for faster recovery
Metadata (3/5)
Three major types File and chunk namespaces File-to-chunk mappings
kept persistent by logging mutations to an operation log stored on the master’s local disk and replicated on remote machines.
Locations of a chunk’s replicas master does not store chunk location information
persistently. asks each chunkserver about its chunks at master
startup and whenever a chunkserver joins the cluster.
Metadata (4/5)
All kept in memory. Fast! Quick global scans
Garbage collections Reorganizations
re-replication in the presence of chunkserver failures Chunk migration to balance load and disk space usage
across chunkservers.
64 bytes per 64 MB of data stores file names compactly using Prefix
compression
Metadata (5/5)
Chunk locations: no persistent states Polls chunkservers at startup Use heartbeat messages to monitor servers Simplicity On-demand approach vs. coordination
On-demand wins when changes (failures) are often no point in maintaining a consistent view on the master
because errors on a chunkserver may cause chunks to vanish or an operator may rename a chunkserver.
Operation Logs (1/2)
Central to GFS. contains a historical record of critical metadata
changes. Not only is it the only persistent record of
metadata, but it also serves as a logical time line that defines the order of concurrent operations.
Files and chunks, as well as their versions , are all uniquely and eternally identified by the logical times at which they were created.
Operation Logs (2/2)
Metadata updates are logged e.g., <old value, new value> pairs Log replicated on remote machines
Take global snapshots (checkpoints) to truncate logs Memory mapped (compact B-tree like form) Checkpoints (take a while) can be created while
updates arrive master switches to a new log file and creates the
new checkpoint in a separate thread. Recovery (Latest checkpoint + subsequent log files)
Mutations Mutation = write or append
must be done for all replicas Goal: minimize master
involvement Lease mechanism:
master picks one replica asprimary; gives it a “lease” for mutations
primary defines a serial order of mutations all replicas follow this order
Data flow decoupled from control flow
Data Mutations
A write causes data to be written at an application-specified file offset.
A record append causes data (the “record”) to be appended atomically at least once even in the presence of concurrent mutations.
Atomic record append
Client specifies data
GFS appends it to the file atomically at least once GFS picks the offset In contrast, a “regular” append is merely a write at
an offset that the client believes to be the current end of file.
Used heavily by Google apps e.g., for files that serve as multiple-producer/single-
consumer queues
Consistency Model (1/3)
A file region is consistent if all clients will always see the same data, regardless of which replicas they read from.
A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety.
consistency model (2/3)
“Consistent” = all replicas have the same value “Defined” = replica reflects the mutation, consistent
Some properties: concurrent writes leave region consistent, but possibly
undefined failed writes leave the region inconsistent
Some work has moved into the applications: e.g., self-validating, self-identifying records
Consistency Model (3/3)
Relaxed consistency Concurrent changes are consistent but
undefined An append is atomically committed at least
once - Occasional duplications
All changes to a chunk are applied in the same order to all replicas
Use version number to detect missed updates
System Interactions
The master grants a chunk lease to a replica The replica holding the lease determines the
order of updates to all replicas Lease
60 second timeouts Can be extended indefinitely Extension request are piggybacked on
heartbeat messages After a timeout expires, the master can grant
new leases
Replica Placement
Goals: Maximize data reliability and availability Maximize network bandwidth
Need to spread chunk replicas across machines and racks
Higher priority to replica chunks with lower replication factors
Limited resources spent on replication
Fault Tolerance and Diagnosis(1/2)
High availability fast recovery
master and chunkservers restartable in a few seconds
chunk replication default: 3 replicas.
shadow masters
Fault Tolerance and Diagnosis (2/2)
Data integrity A chunk is divided into 64-KB blocks Each with its checksum Verified at read and write times Also background scans for rarely used data
Deployment in Google
50+ GFS clusters Each with thousands of storage nodes Managing petabytes of data GFS is under BigTable, etc.
Conclusion
GFS demonstrates how to support large-scale processing workloads on commodity hardware design to tolerate frequent component failures optimize for huge files that are mostly appended and
read feel free to relax and extend FS interface as required go for simple solutions (e.g., single master)
GFS has met Google’s storage needs… it must be good!
谢谢
课程评估
目标:系统理解+表达能力
4 pages 描述一个系统
4张图 (每张图半页, Microsoft visio )描述一个系统
用最精炼的文字描述图形
2人一组
必须与我看到的不能完全相同
否则不会高于75分