Topic 11: Google Filesystem

11: Google Filesystem

Zubair Nabi

[email protected]

April 20, 2013

Zubair Nabi 11: Google Filesystem April 20, 2013 1 / 29

Outline

1 Introduction

2 Google Filesystem

3 Hadoop Distributed Filesystem


Outline

1 Introduction

2 Google Filesystem



Filesystem

The purpose of a filesystem is to:

1 Organize and store data

2 Support sharing of data among users and applications

3 Ensure persistence of data after a reboot

4 Examples include FAT, NTFS, ext3, ext4, etc.


Filesystem







Filesystem







Filesystem







Distributed filesystem

Self-explanatory: the filesystem is distributed across many machines

The DFS provides a common abstraction to the dispersed files

Each DFS has an associated API that provides a service to clients,which are normal file operations, such as create, read, write, etc.Maintains a namespace which maps logical names to physical names

I Simplifies replication and migration

Examples include the Network Filesystem (NFS), Andrew Filesystem(AFS), etc.












Each DFS has an associated API that provides a service to clients,which are normal file operations, such as create, read, write, etc.

Maintains a namespace which maps logical names to physical namesI Simplifies replication and migration
























Outline

1 Introduction

2 Google Filesystem



Introduction

Designed by Google to meet its massive storage needs

Shares many goals with previous distributed filesystems such asperformance, scalability, reliability, and availability

At the same time, design driven by key observations of their workloadand infrastructure, both current and future


Introduction





Introduction





Design Goals

1 Failure is the norm rather than the exception: The GFS mustconstantly introspect and automatically recover from failure

2 The system stores a fair number of large files: Optimize for largefiles, on the order of GBs, but still support small files

3 Applications prefer to do large streaming reads of contiguousregions: Optimize for this case


Design Goals





Design Goals





Design Goals (2)

4 Most applications perform large, sequential writes that are mostlyappend operations: Support small writes but do not optimize for them

5 Most operations are producer-consume queues or many-waymerging: Support concurrent reads or writes by hundreds of clientssimultaneously

6 Applications process data in bulk at a high rate: Favour throughputover latency


Design Goals (2)





Design Goals (2)





Interface

The interface is similar to traditional filesystems but no support for astandard POSIX-like API

Files are organized hierarchically into directories with pathnames

Support for create, delete, open, close, read, and write operations


Interface





Interface





Architecture

Consists of a single master and multiple chunkservers

The system can be accessed by multiple clients

Both the master and chunkservers run as user-space server processeson commodity Linux machines


Architecture





Architecture





Files

Files are sliced into fixed-size chunks

Each chunk is identifiable by an immutable and globally unique 64-bithandle

Chunks are stored by chunkservers as local Linux files

Reads and writes to a chunk are specified by a handle and a byterangeEach chunk is replicated on multiple chunkservers

I 3 by default


Files





I 3 by default


Files





I 3 by default


Files




Reads and writes to a chunk are specified by a handle and a byterange

Each chunk is replicated on multiple chunkserversI 3 by default


Files





I 3 by default


Files





I 3 by default


Master

In charge of all filesystem metadata

I Namespace, access control information, mapping between files andchunks, and current locations of chunks

I Holds this information in memory and regularly syncs it with a log file

Also in charge of chunk leasing, garbage collection, and chunkmigration

Periodically sends each chunkserver a heartbeat signal to check itsstate and send it instructionsClients interact with it to access metadata but all data-bearingcommunication goes directly to the relevant chunkservers

I As a result, the master does not become a performance bottleneck


Master

In charge of all filesystem metadataI Namespace, access control information, mapping between files and

chunks, and current locations of chunks

I Holds this information in memory and regularly syncs it with a log file





Master


chunks, and current locations of chunksI Holds this information in memory and regularly syncs it with a log file





Master







Master




Periodically sends each chunkserver a heartbeat signal to check itsstate and send it instructions

Clients interact with it to access metadata but all data-bearingcommunication goes directly to the relevant chunkservers



Master







Master








Consistency Model: Master

All namespace mutations (such as file creation) are atomic as they areexclusively handled by the master

Namespace locking guarantees atomicity and correctness

The operation log maintained by the master defines a global total orderof these operations












Consistency Model: Data

The state after mutation depends on:I Mutation type: write or append

I Whether it succeeds or failsI Whether there are other concurrent mutations

A file region is consistent if all clients see the same data, regardlessof the replica

A region is defined after a mutation if it is still consistent and clientssee the mutation in its entirety



The state after mutation depends on:I Mutation type: write or appendI Whether it succeeds or fails

I Whether there are other concurrent mutations





The state after mutation depends on:I Mutation type: write or appendI Whether it succeeds or failsI Whether there are other concurrent mutations














Consistency Model: Data (2)

If there are no other concurrent writers, the region is defined andconsistent

Concurrent and successful mutations leave the region undefined butconsistent

I Mingled fragments from multiple mutations

A failed mutation makes the region both inconsistent and undefined



If there are no other concurrent writers, the region is defined andconsistentConcurrent and successful mutations leave the region undefined butconsistent














Mutation Operations

Each chunk has many replicas

The primary replica holds a lease from the master

It decides the order of all mutations for all replicas


Mutation Operations





Mutation Operations





Write Operation

Client obtains the location of replicas and the identity of the primaryreplica from the master

It then pushes the data to all replica nodes

The client issues an update request to primary

Primary forwards the write request to all replicas

It waits for a reply from all replicas before returning to the client


Write Operation







Write Operation







Write Operation







Write Operation







Record Append Operation

Performed atomically

Append location chosen by the GFS and communicated to the client

Primary forwards the write request to all replicasIt waits for a reply from all replicas before returning to the client

1 If the records fits in the current chunk, it is written and communicated tothe client

2 If it does not, the chunk is padded and the client is told to try the nextchunk













It waits for a reply from all replicas before returning to the client1 If the records fits in the current chunk, it is written and communicated to

the client2 If it does not, the chunk is padded and the client is told to try the next

chunk
























Application Safeguards

Use record append rather than write

Insert checksums in record headers to detect fragments

Insert sequence numbers to detect duplicates












Chunk Placement

Put on chunkservers with below average disk space usage

Limit number of “recent” creations on a chunkserver, to ensure that itdoes not experience any traffic spike due to its fresh data

For reliability, replicas spread across racks


Chunk Placement





Chunk Placement





Garbage Collection

Chunks become garbage when they are orphaned

A lazy reclamation strategy is used by not reclaiming chunks at deletetime

Each chunkserver communicates the subset of its current chunks tothe master in the heartbeat signal

Master pinpoints chunks which have been orphaned

The chunkserver finally reclaims that space


Garbage Collection







Garbage Collection







Garbage Collection







Garbage Collection







Stale Replica Detection

Each chunk is assigned a version number

Each time a new lease is granted, the version number is incremented

Stale replicas will have outdated version numbers

They are simply garbage collected




















Outline

1 Introduction

2 Google Filesystem



Introduction

Open-source clone of GFS

Comes packaged with Hadoop

Master is called the NameNode and chunkservers are calledDataNodes

Chunks are known as blocks

Exposes a Java API and a command-line interface


Introduction







Introduction







Introduction







Introduction







Command-line API

Accessible through: bin/hdfs dfs -command args

Useful commands: cat, copyFromLocal, copyToLocal, cp,ls, mkdir, moveFromLocal, moveToLocal, mv, rm, etc.1

1http://hadoop.apache.org/docs/r1.0.4/file_system_shell.htmlZubair Nabi 11: Google Filesystem April 20, 2013 28 / 29

http://hadoop.apache.org/docs/r1.0.4/file_system_shell.html


Command-line API

Accessible through: bin/hdfs dfs -command args

Useful commands: cat, copyFromLocal, copyToLocal, cp,ls, mkdir, moveFromLocal, moveToLocal, mv, rm, etc.1

1http://hadoop.apache.org/docs/r1.0.4/file_system_shell.htmlZubair Nabi 11: Google Filesystem April 20, 2013 28 / 29



References

1 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. TheGoogle file system. In Proceedings of the nineteenth ACM symposiumon Operating systems principles (SOSP ’03). ACM, New York, NY,USA, 29-43.


Topic 11: Google Filesystem

Technology