Top Banner
Google File System Amir H. Payberah [email protected] Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 1 / 63
124
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Google File System

Google File System

Amir H. [email protected]

Amirkabir University of Technology(Tehran Polytechnic)

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 1 / 63

Page 2: Google File System

What is the Problem?

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 2 / 63

Page 3: Google File System

What is the Problem?

I Crawl the whole web.

I Store it all on one big disk.

I Process users’ searches on one big CPU.

I Does not scale.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 3 / 63

Page 4: Google File System

What is the Problem?

I Crawl the whole web.

I Store it all on one big disk.

I Process users’ searches on one big CPU.

I Does not scale.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 3 / 63

Page 5: Google File System

Motivation and Assumptions (1/3)

I Lots of cheap PCs, each with disk and CPU.• How to share data among PCs?

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 4 / 63

Page 6: Google File System

Motivation and Assumptions (2/3)

I 100s to 1000s of PCs in cluster.• Failure of each PC.• Monitoring, fault tolerance,

auto-recovery essential.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 5 / 63

Page 7: Google File System

Motivation and Assumptions (3/3)

I Large files: ≥ 100 MB in size.

I Large streaming reads and small random reads.

I Append to files rather than overwrite.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 6 / 63

Page 8: Google File System

Motivation and Assumptions (3/3)

I Large files: ≥ 100 MB in size.

I Large streaming reads and small random reads.

I Append to files rather than overwrite.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 6 / 63

Page 9: Google File System

Motivation and Assumptions (3/3)

I Large files: ≥ 100 MB in size.

I Large streaming reads and small random reads.

I Append to files rather than overwrite.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 6 / 63

Page 10: Google File System

Reminder

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 7 / 63

Page 11: Google File System

What is Filesystem?

I Controls how data is stored in and retrieved from disk.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 8 / 63

Page 12: Google File System

What is Filesystem?

I Controls how data is stored in and retrieved from disk.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 8 / 63

Page 13: Google File System

Distributed Filesystems

I When data outgrows the storage capacity of a single machine: par-tition it across a number of separate machines.

I Distributed filesystems: manage the storage across a network ofmachines.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 9 / 63

Page 14: Google File System

Google File System(GFS)

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 10 / 63

Page 15: Google File System

GFS

I Appears as a single disk

I Runs on top of a native filesystem.

I Fault tolerant: can handle disk crashes, machine crashes, ...

I Hadoop Distributed File System (HDFS) is an open source Javaproduct similar to GFS.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 11 / 63

Page 16: Google File System

GFS is Good for ...

I Storing large files• Terabytes, Petabytes, etc...• 100MB or more per file.

I Streaming data access• Data is written once and read many times.• Optimized for batch reads rather than random reads.

I Cheap commodity hardware• No need for super-computers, use less reliable commodity hardware.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 12 / 63

Page 17: Google File System

GFS is Good for ...

I Storing large files• Terabytes, Petabytes, etc...• 100MB or more per file.

I Streaming data access• Data is written once and read many times.• Optimized for batch reads rather than random reads.

I Cheap commodity hardware• No need for super-computers, use less reliable commodity hardware.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 12 / 63

Page 18: Google File System

GFS is Good for ...

I Storing large files• Terabytes, Petabytes, etc...• 100MB or more per file.

I Streaming data access• Data is written once and read many times.• Optimized for batch reads rather than random reads.

I Cheap commodity hardware• No need for super-computers, use less reliable commodity hardware.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 12 / 63

Page 19: Google File System

GFS is Not Good for ...

I Low-latency reads• High-throughput rather than low latency for small chunks of data.

I Large amount of small files• Better for millions of large files instead of billions of small files.

I Multiple writers• Single writer per file.• Writes only at the end of file, no-support for arbitrary offset.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 13 / 63

Page 20: Google File System

GFS is Not Good for ...

I Low-latency reads• High-throughput rather than low latency for small chunks of data.

I Large amount of small files• Better for millions of large files instead of billions of small files.

I Multiple writers• Single writer per file.• Writes only at the end of file, no-support for arbitrary offset.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 13 / 63

Page 21: Google File System

GFS is Not Good for ...

I Low-latency reads• High-throughput rather than low latency for small chunks of data.

I Large amount of small files• Better for millions of large files instead of billions of small files.

I Multiple writers• Single writer per file.• Writes only at the end of file, no-support for arbitrary offset.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 13 / 63

Page 22: Google File System

Files and Chunks (1/2)

I Files are split into chunks.

I Chunks• Single unit of storage: a contiguous piece of information on a disk.• Transparent to user.• Chunks are traditionally either 64MB or 128MB: default is 64MB.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 14 / 63

Page 23: Google File System

Files and Chunks (2/2)

I Why is a chunk in GFS so large?

• To minimize the cost of seeks.

I Time to read a chunk = seek time + transfer time

I Keeping the ratio seektimetransfertime small: we are reading data from the

disk almost as fast as the physical limit imposed by the disk.

I Example: if seek time is 10ms and the transfer rate is 100MB/s, tomake the seek time 1% of the transfer time, we need to make thechunk size around 100MB.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 15 / 63

Page 24: Google File System

Files and Chunks (2/2)

I Why is a chunk in GFS so large?• To minimize the cost of seeks.

I Time to read a chunk = seek time + transfer time

I Keeping the ratio seektimetransfertime small: we are reading data from the

disk almost as fast as the physical limit imposed by the disk.

I Example: if seek time is 10ms and the transfer rate is 100MB/s, tomake the seek time 1% of the transfer time, we need to make thechunk size around 100MB.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 15 / 63

Page 25: Google File System

Files and Chunks (2/2)

I Why is a chunk in GFS so large?• To minimize the cost of seeks.

I Time to read a chunk = seek time + transfer time

I Keeping the ratio seektimetransfertime small: we are reading data from the

disk almost as fast as the physical limit imposed by the disk.

I Example: if seek time is 10ms and the transfer rate is 100MB/s, tomake the seek time 1% of the transfer time, we need to make thechunk size around 100MB.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 15 / 63

Page 26: Google File System

GFS Architecture

I Main components:• GFS master• GFS chunk server• GFS client

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 16 / 63

Page 27: Google File System

GFS Master

I Manages file namespace operations.

I Manages file metadata (holds all metadata in memory).• Access control information• Mapping from files to chunks• Locations of chunks

I Manages chunks in chunk servers.• Creation/deletion• Placement• Load balancing• Maintains replication• Garbage collection

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 17 / 63

Page 28: Google File System

GFS Master

I Manages file namespace operations.

I Manages file metadata (holds all metadata in memory).• Access control information• Mapping from files to chunks• Locations of chunks

I Manages chunks in chunk servers.• Creation/deletion• Placement• Load balancing• Maintains replication• Garbage collection

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 17 / 63

Page 29: Google File System

GFS Master

I Manages file namespace operations.

I Manages file metadata (holds all metadata in memory).• Access control information• Mapping from files to chunks• Locations of chunks

I Manages chunks in chunk servers.• Creation/deletion• Placement• Load balancing• Maintains replication• Garbage collection

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 17 / 63

Page 30: Google File System

GFS Chunk Server

I Manage chunks.

I Tells master what chunks it has.

I Store chunks as files.

I Maintain data consistency of chunks.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 18 / 63

Page 31: Google File System

GFS Chunk Server

I Manage chunks.

I Tells master what chunks it has.

I Store chunks as files.

I Maintain data consistency of chunks.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 18 / 63

Page 32: Google File System

GFS Chunk Server

I Manage chunks.

I Tells master what chunks it has.

I Store chunks as files.

I Maintain data consistency of chunks.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 18 / 63

Page 33: Google File System

GFS Chunk Server

I Manage chunks.

I Tells master what chunks it has.

I Store chunks as files.

I Maintain data consistency of chunks.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 18 / 63

Page 34: Google File System

GFS Client

I Issues control (metadata) requests to master server.

I Issues data requests directly to chunk servers.

I Caches metadata.

I Does not cache data.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 19 / 63

Page 35: Google File System

GFS Client

I Issues control (metadata) requests to master server.

I Issues data requests directly to chunk servers.

I Caches metadata.

I Does not cache data.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 19 / 63

Page 36: Google File System

GFS Client

I Issues control (metadata) requests to master server.

I Issues data requests directly to chunk servers.

I Caches metadata.

I Does not cache data.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 19 / 63

Page 37: Google File System

GFS Client

I Issues control (metadata) requests to master server.

I Issues data requests directly to chunk servers.

I Caches metadata.

I Does not cache data.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 19 / 63

Page 38: Google File System

The Master Operations

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 20 / 63

Page 39: Google File System

The Master Operations

I Namespace management and locking

I Replica placement

I Creating, re-replicating and re-balancing replicas

I Garbage collection

I Stale replica detection

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 21 / 63

Page 40: Google File System

Namespace Management and Locking

I No per-directory data structure.

I No hard or symbolic links.

I Represents its namespace as a lookup table mapping full pathnamesto metadata.

I Each master operation acquires a set of locks before it runs.

I Read lock on internal nodes and read/write lock on the leaf.

I Allowed concurrent mutations in the same directory

I Read lock on directory prevents its deletion, renaming or snapshot

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 22 / 63

Page 41: Google File System

Namespace Management and Locking

I No per-directory data structure.

I No hard or symbolic links.

I Represents its namespace as a lookup table mapping full pathnamesto metadata.

I Each master operation acquires a set of locks before it runs.

I Read lock on internal nodes and read/write lock on the leaf.

I Allowed concurrent mutations in the same directory

I Read lock on directory prevents its deletion, renaming or snapshot

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 22 / 63

Page 42: Google File System

Namespace Management and Locking

I No per-directory data structure.

I No hard or symbolic links.

I Represents its namespace as a lookup table mapping full pathnamesto metadata.

I Each master operation acquires a set of locks before it runs.

I Read lock on internal nodes and read/write lock on the leaf.

I Allowed concurrent mutations in the same directory

I Read lock on directory prevents its deletion, renaming or snapshot

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 22 / 63

Page 43: Google File System

Namespace Management and Locking

I No per-directory data structure.

I No hard or symbolic links.

I Represents its namespace as a lookup table mapping full pathnamesto metadata.

I Each master operation acquires a set of locks before it runs.

I Read lock on internal nodes and read/write lock on the leaf.

I Allowed concurrent mutations in the same directory

I Read lock on directory prevents its deletion, renaming or snapshot

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 22 / 63

Page 44: Google File System

Replica Placement

I Maximize data reliability, availability and bandwidth utilization.

I Replicas spread across machines and racks, for example:• 1st replica on the local rack.• 2nd replica on the local rack but different machine.• 3rd replica on the different rack.

I The master determines replica placement.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 23 / 63

Page 45: Google File System

Creation, Re-replication and Re-balancing

I Creation• Place new replicas on chunk servers with below-average disk usage.• Limit number of recent creations on each chunk servers.

I Re-replication• When number of available replicas falls below a user-specified goal.

I Rebalancing• Periodically, for better disk utilization and load balancing.• Distribution of replicas is analyzed.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 24 / 63

Page 46: Google File System

Creation, Re-replication and Re-balancing

I Creation• Place new replicas on chunk servers with below-average disk usage.• Limit number of recent creations on each chunk servers.

I Re-replication• When number of available replicas falls below a user-specified goal.

I Rebalancing• Periodically, for better disk utilization and load balancing.• Distribution of replicas is analyzed.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 24 / 63

Page 47: Google File System

Creation, Re-replication and Re-balancing

I Creation• Place new replicas on chunk servers with below-average disk usage.• Limit number of recent creations on each chunk servers.

I Re-replication• When number of available replicas falls below a user-specified goal.

I Rebalancing• Periodically, for better disk utilization and load balancing.• Distribution of replicas is analyzed.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 24 / 63

Page 48: Google File System

Garbage Collection

I File deletion logged by master.

I File renamed to a hidden name with deletion timestamp.

I Master regularly deletes files older than 3 days (configurable).

I Until then, hidden file can be read and undeleted.

I When a hidden file is removed, its in-memory metadata is erased.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 25 / 63

Page 49: Google File System

Garbage Collection

I File deletion logged by master.

I File renamed to a hidden name with deletion timestamp.

I Master regularly deletes files older than 3 days (configurable).

I Until then, hidden file can be read and undeleted.

I When a hidden file is removed, its in-memory metadata is erased.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 25 / 63

Page 50: Google File System

Garbage Collection

I File deletion logged by master.

I File renamed to a hidden name with deletion timestamp.

I Master regularly deletes files older than 3 days (configurable).

I Until then, hidden file can be read and undeleted.

I When a hidden file is removed, its in-memory metadata is erased.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 25 / 63

Page 51: Google File System

Stale Replica Detection

I Chunk replicas may become stale: if a chunk server fails and missesmutations to the chunk while it is down.

I Need to distinguish between up-to-date and stale replicas.

I Chunk version number:• Increased when master grants new lease on the chunk.• Not increased if replica is unavailable.

I Stale replicas deleted by master in regular garbage collection.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 26 / 63

Page 52: Google File System

Stale Replica Detection

I Chunk replicas may become stale: if a chunk server fails and missesmutations to the chunk while it is down.

I Need to distinguish between up-to-date and stale replicas.

I Chunk version number:• Increased when master grants new lease on the chunk.• Not increased if replica is unavailable.

I Stale replicas deleted by master in regular garbage collection.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 26 / 63

Page 53: Google File System

Stale Replica Detection

I Chunk replicas may become stale: if a chunk server fails and missesmutations to the chunk while it is down.

I Need to distinguish between up-to-date and stale replicas.

I Chunk version number:• Increased when master grants new lease on the chunk.• Not increased if replica is unavailable.

I Stale replicas deleted by master in regular garbage collection.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 26 / 63

Page 54: Google File System

Stale Replica Detection

I Chunk replicas may become stale: if a chunk server fails and missesmutations to the chunk while it is down.

I Need to distinguish between up-to-date and stale replicas.

I Chunk version number:• Increased when master grants new lease on the chunk.• Not increased if replica is unavailable.

I Stale replicas deleted by master in regular garbage collection.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 26 / 63

Page 55: Google File System

System Interactions

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 27 / 63

Page 56: Google File System

GFS API

I Not POSIX compliant• Supports only popular FS operations, and semantics are different.

I API:• Read operation: read• Update operations: write and append• Delete operation

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 28 / 63

Page 57: Google File System

Read Operation (1/2)

I 1. Application originates the read request.

I 2. GFS client translates request and sends it to the master.

I 3. The master responds with chunk handle and replica locations.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 29 / 63

Page 58: Google File System

Read Operation (1/2)

I 1. Application originates the read request.

I 2. GFS client translates request and sends it to the master.

I 3. The master responds with chunk handle and replica locations.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 29 / 63

Page 59: Google File System

Read Operation (1/2)

I 1. Application originates the read request.

I 2. GFS client translates request and sends it to the master.

I 3. The master responds with chunk handle and replica locations.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 29 / 63

Page 60: Google File System

Read Operation (2/2)

I 4. The client picks a location and sends the request.

I 5. The chunk server sends requested data to the client.

I 6. The client forwards the data to the application.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 30 / 63

Page 61: Google File System

Read Operation (2/2)

I 4. The client picks a location and sends the request.

I 5. The chunk server sends requested data to the client.

I 6. The client forwards the data to the application.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 30 / 63

Page 62: Google File System

Read Operation (2/2)

I 4. The client picks a location and sends the request.

I 5. The chunk server sends requested data to the client.

I 6. The client forwards the data to the application.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 30 / 63

Page 63: Google File System

Update Order (1/2)

I Update (mutation): an operation that changes the contents ormetadata of a chunk.

I For consistency, updates to each chunk must be ordered in the sameway at the different chunk replicas.

I Consistency means that replicas will end up with the same versionof the data and not diverge.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 31 / 63

Page 64: Google File System

Update Order (1/2)

I Update (mutation): an operation that changes the contents ormetadata of a chunk.

I For consistency, updates to each chunk must be ordered in the sameway at the different chunk replicas.

I Consistency means that replicas will end up with the same versionof the data and not diverge.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 31 / 63

Page 65: Google File System

Update Order (1/2)

I Update (mutation): an operation that changes the contents ormetadata of a chunk.

I For consistency, updates to each chunk must be ordered in the sameway at the different chunk replicas.

I Consistency means that replicas will end up with the same versionof the data and not diverge.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 31 / 63

Page 66: Google File System

Update Order (2/2)

I For this reason, for each chunk, one replica is designated as theprimary.

I The other replicas are designated as secondaries

I Primary defines the update order.

I All secondaries follows this order.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 32 / 63

Page 67: Google File System

Update Order (2/2)

I For this reason, for each chunk, one replica is designated as theprimary.

I The other replicas are designated as secondaries

I Primary defines the update order.

I All secondaries follows this order.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 32 / 63

Page 68: Google File System

Update Order (2/2)

I For this reason, for each chunk, one replica is designated as theprimary.

I The other replicas are designated as secondaries

I Primary defines the update order.

I All secondaries follows this order.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 32 / 63

Page 69: Google File System

Update Order (2/2)

I For this reason, for each chunk, one replica is designated as theprimary.

I The other replicas are designated as secondaries

I Primary defines the update order.

I All secondaries follows this order.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 32 / 63

Page 70: Google File System

Primary Leases (1/2)

I For correctness, at any time, there needs to be one single primaryfor each chunk.

I At any time, at most one server is primary for each chunk.

I Master selects a chunk-server and grants it lease for a chunk.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 33 / 63

Page 71: Google File System

Primary Leases (1/2)

I For correctness, at any time, there needs to be one single primaryfor each chunk.

I At any time, at most one server is primary for each chunk.

I Master selects a chunk-server and grants it lease for a chunk.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 33 / 63

Page 72: Google File System

Primary Leases (1/2)

I For correctness, at any time, there needs to be one single primaryfor each chunk.

I At any time, at most one server is primary for each chunk.

I Master selects a chunk-server and grants it lease for a chunk.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 33 / 63

Page 73: Google File System

Primary Leases (2/2)

I The chunk-server holds the lease for a period T after it gets it, andbehaves as primary during this period.

I The chunk-server can refresh the lease endlessly, but if the chunk-server can not successfully refresh lease from master, he stops beinga primary.

I If master does not hear from primary chunk-server for a period, hegives the lease to someone else.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 34 / 63

Page 74: Google File System

Primary Leases (2/2)

I The chunk-server holds the lease for a period T after it gets it, andbehaves as primary during this period.

I The chunk-server can refresh the lease endlessly, but if the chunk-server can not successfully refresh lease from master, he stops beinga primary.

I If master does not hear from primary chunk-server for a period, hegives the lease to someone else.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 34 / 63

Page 75: Google File System

Primary Leases (2/2)

I The chunk-server holds the lease for a period T after it gets it, andbehaves as primary during this period.

I The chunk-server can refresh the lease endlessly, but if the chunk-server can not successfully refresh lease from master, he stops beinga primary.

I If master does not hear from primary chunk-server for a period, hegives the lease to someone else.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 34 / 63

Page 76: Google File System

Write Operation (1/3)

I 1. Application originates the request.

I 2. The GFS client translates request and sends it to the master.

I 3. The master responds with chunk handle and replica locations.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 35 / 63

Page 77: Google File System

Write Operation (1/3)

I 1. Application originates the request.

I 2. The GFS client translates request and sends it to the master.

I 3. The master responds with chunk handle and replica locations.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 35 / 63

Page 78: Google File System

Write Operation (1/3)

I 1. Application originates the request.

I 2. The GFS client translates request and sends it to the master.

I 3. The master responds with chunk handle and replica locations.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 35 / 63

Page 79: Google File System

Write Operation (2/3)

I 4. The client pushes write data to all locations. Data is stored inchunk-server’s internal buffers.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 36 / 63

Page 80: Google File System

Write Operation (3/3)

I 5. The client sends write command to the primary.

I 6. The primary determines serial order for data instances in its bufferand writes the instances in that order to the chunk.

I 7. The primary sends the serial order to the secondaries and tellsthem to perform the write.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 37 / 63

Page 81: Google File System

Write Operation (3/3)

I 5. The client sends write command to the primary.

I 6. The primary determines serial order for data instances in its bufferand writes the instances in that order to the chunk.

I 7. The primary sends the serial order to the secondaries and tellsthem to perform the write.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 37 / 63

Page 82: Google File System

Write Operation (3/3)

I 5. The client sends write command to the primary.

I 6. The primary determines serial order for data instances in its bufferand writes the instances in that order to the chunk.

I 7. The primary sends the serial order to the secondaries and tellsthem to perform the write.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 37 / 63

Page 83: Google File System

Write Consistency

I Primary enforces one update order across all replicas for concurrentwrites.

I It also waits until a write finishes at the other replicas before itreplies.

I Therefore:• We will have identical replicas.• But, file region may end up containing mingled fragments from

different clients: e.g., writes to different chunks may be ordereddifferently by their different primary chunk-servers

• Thus, writes are consistent but undefined state in GFS.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 38 / 63

Page 84: Google File System

Write Consistency

I Primary enforces one update order across all replicas for concurrentwrites.

I It also waits until a write finishes at the other replicas before itreplies.

I Therefore:• We will have identical replicas.• But, file region may end up containing mingled fragments from

different clients: e.g., writes to different chunks may be ordereddifferently by their different primary chunk-servers

• Thus, writes are consistent but undefined state in GFS.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 38 / 63

Page 85: Google File System

Record Append Operation (1/3)

I Operations that append data to a file.• Same as write, but no offset (GFS choses the offset)

I Important operation at Google• Merging results from multiple machines in one file.• Using file as producer-consumer queue.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 39 / 63

Page 86: Google File System

Record Append Operation (2/3)

I 1. Application originates record append request.

I 2. The client translates request and sends it to the master.

I 3. The master responds with chunk handle and replica locations.

I 4. The client pushes write data to all locations.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 40 / 63

Page 87: Google File System

Record Append Operation (2/3)

I 1. Application originates record append request.

I 2. The client translates request and sends it to the master.

I 3. The master responds with chunk handle and replica locations.

I 4. The client pushes write data to all locations.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 40 / 63

Page 88: Google File System

Record Append Operation (2/3)

I 1. Application originates record append request.

I 2. The client translates request and sends it to the master.

I 3. The master responds with chunk handle and replica locations.

I 4. The client pushes write data to all locations.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 40 / 63

Page 89: Google File System

Record Append Operation (2/3)

I 1. Application originates record append request.

I 2. The client translates request and sends it to the master.

I 3. The master responds with chunk handle and replica locations.

I 4. The client pushes write data to all locations.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 40 / 63

Page 90: Google File System

Record Append Operation (3/3)

I 5. The primary checks if record fits in specified chunk.

I 6. If record does not fit, then the primary:• Pads the chunk,• Tells secondaries to do the same,• And informs the client.• The client then retries the append with the next chunk.

I 7. If record fits, then the primary:• Appends the record,• Tells secondaries to do the same,• Receives responses from secondaries,• And sends final response to the client

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 41 / 63

Page 91: Google File System

Record Append Operation (3/3)

I 5. The primary checks if record fits in specified chunk.

I 6. If record does not fit, then the primary:• Pads the chunk,• Tells secondaries to do the same,• And informs the client.• The client then retries the append with the next chunk.

I 7. If record fits, then the primary:• Appends the record,• Tells secondaries to do the same,• Receives responses from secondaries,• And sends final response to the client

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 41 / 63

Page 92: Google File System

Record Append Operation (3/3)

I 5. The primary checks if record fits in specified chunk.

I 6. If record does not fit, then the primary:• Pads the chunk,• Tells secondaries to do the same,• And informs the client.• The client then retries the append with the next chunk.

I 7. If record fits, then the primary:• Appends the record,• Tells secondaries to do the same,• Receives responses from secondaries,• And sends final response to the client

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 41 / 63

Page 93: Google File System

Delete Operation

I Meta data operation.

I Renames file to special name.

I After certain time, deletes the actual chunks.

I Supports undelete for limited time.

I Actual lazy garbage collection.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 42 / 63

Page 94: Google File System

Delete Operation

I Meta data operation.

I Renames file to special name.

I After certain time, deletes the actual chunks.

I Supports undelete for limited time.

I Actual lazy garbage collection.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 42 / 63

Page 95: Google File System

Delete Operation

I Meta data operation.

I Renames file to special name.

I After certain time, deletes the actual chunks.

I Supports undelete for limited time.

I Actual lazy garbage collection.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 42 / 63

Page 96: Google File System

Delete Operation

I Meta data operation.

I Renames file to special name.

I After certain time, deletes the actual chunks.

I Supports undelete for limited time.

I Actual lazy garbage collection.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 42 / 63

Page 97: Google File System

Delete Operation

I Meta data operation.

I Renames file to special name.

I After certain time, deletes the actual chunks.

I Supports undelete for limited time.

I Actual lazy garbage collection.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 42 / 63

Page 98: Google File System

Fault Tolerance

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 43 / 63

Page 99: Google File System

Fault Tolerance for Chunks

I Chunks replication (re-replication and re-balancing)

I Data integrity• Checksum for each chunk divided into 64KB blocks.• Checksum is checked every time an application reads the data.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 44 / 63

Page 100: Google File System

Fault Tolerance for Chunk Server

I All chunks are versioned.

I Version number updated when a new lease is granted.

I Chunks with old versions are not served and are deleted.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 45 / 63

Page 101: Google File System

Fault Tolerance for Master

I Master state replicated for reliability on multiple machines.

I When master fails:• It can restart almost instantly.• A new master process is started elsewhere.

I Shadow (not mirror) master provides only read-only access to filesystem when primary master is down.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 46 / 63

Page 102: Google File System

Fault Tolerance for Master

I Master state replicated for reliability on multiple machines.

I When master fails:• It can restart almost instantly.• A new master process is started elsewhere.

I Shadow (not mirror) master provides only read-only access to filesystem when primary master is down.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 46 / 63

Page 103: Google File System

Fault Tolerance for Master

I Master state replicated for reliability on multiple machines.

I When master fails:• It can restart almost instantly.• A new master process is started elsewhere.

I Shadow (not mirror) master provides only read-only access to filesystem when primary master is down.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 46 / 63

Page 104: Google File System

High Availability

I Fast recovery• Master and chunk-servers have to restore their state and start in

seconds no matter how they terminated.

I Heartbeat messages:• Checking liveness of chunk-servers• Piggybacking garbage collection commands• Lease renewal

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 47 / 63

Page 105: Google File System

High Availability

I Fast recovery• Master and chunk-servers have to restore their state and start in

seconds no matter how they terminated.

I Heartbeat messages:• Checking liveness of chunk-servers• Piggybacking garbage collection commands• Lease renewal

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 47 / 63

Page 106: Google File System

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 48 / 63

Page 107: Google File System

HDFS

I Sub-project of Apache Hadoop project

I Inspired by the Google File System

I Namenode: master

I Datanode: chunk server

I Block: chunk

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 49 / 63

Page 108: Google File System

HDFS Installation and Shell

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 50 / 63

Page 109: Google File System

HDFS Installation

I Three options

• Local (Standalone) Mode

• Pseudo-Distributed Mode

• Fully-Distributed Mode

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 51 / 63

Page 110: Google File System

Installation - Local

I Default configuration after the download.

I Executes as a single Java process.

I Works directly with local filesystem.

I Useful for debugging.

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 52 / 63

Page 111: Google File System

Installation - Pseudo-Distributed (1/6)

I Still runs on a single node.

I Each daemon runs in its own Java process.• Namenode• Secondary Namenode• Datanode

I Configuration files:• hadoop-env.sh• core-site.xml• hdfs-site.xml

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 53 / 63

Page 112: Google File System

Installation - Pseudo-Distributed (2/6)

I Specify environment variables in hadoop-env.sh

export JAVA_HOME=/opt/jdk1.7.0_51

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 54 / 63

Page 113: Google File System

Installation - Pseudo-Distributed (3/6)

I Specify location of Namenode in core-site.sh

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost:8020</value>

<description>NameNode URI</description>

</property>

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 55 / 63

Page 114: Google File System

Installation - Pseudo-Distributed (4/6)

I Configurations of Namenode in hdfs-site.sh

I Path on the local filesystem where the Namenode stores the names-pace and transaction logs persistently.

<property>

<name>dfs.namenode.name.dir</name>

<value>/opt/hadoop-2.2.0/hdfs/namenode</value>

<description>description...</description>

</property>

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 56 / 63

Page 115: Google File System

Installation - Pseudo-Distributed (5/6)

I Configurations of Secondary Namenode in hdfs-site.sh

I Path on the local filesystem where the Secondary Namenode storesthe temporary images to merge.

<property>

<name>dfs.namenode.checkpoint.dir</name>

<value>/opt/hadoop-2.2.0/hdfs/secondary</value>

<description>description...</description>

</property>

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 57 / 63

Page 116: Google File System

Installation - Pseudo-Distributed (6/6)

I Configurations of Datanode in hdfs-site.sh

I Comma separated list of paths on the local filesystem of a Datanodewhere it should store its blocks.

<property>

<name>dfs.datanode.data.dir</name>

<value>/opt/hadoop-2.2.0/hdfs/datanode</value>

<description>description...</description>

</property>

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 58 / 63

Page 117: Google File System

Start HDFS and Test

I Format the Namenode directory (do this only once, the first time).

hdfs namenode -format

I Start the Namenode, Secondary namenode and Datanode daemons.

hadoop-daemon.sh start namenode

hadoop-daemon.sh start secondarynamenode

hadoop-daemon.sh start datanode

jps

I Verify the deamons are running:• Namenode: http://localhost:50070• Secondary Namenode: http://localhost:50090• Datanode: http://localhost:50075

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 59 / 63

Page 118: Google File System

Start HDFS and Test

I Format the Namenode directory (do this only once, the first time).

hdfs namenode -format

I Start the Namenode, Secondary namenode and Datanode daemons.

hadoop-daemon.sh start namenode

hadoop-daemon.sh start secondarynamenode

hadoop-daemon.sh start datanode

jps

I Verify the deamons are running:• Namenode: http://localhost:50070• Secondary Namenode: http://localhost:50090• Datanode: http://localhost:50075

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 59 / 63

Page 119: Google File System

Start HDFS and Test

I Format the Namenode directory (do this only once, the first time).

hdfs namenode -format

I Start the Namenode, Secondary namenode and Datanode daemons.

hadoop-daemon.sh start namenode

hadoop-daemon.sh start secondarynamenode

hadoop-daemon.sh start datanode

jps

I Verify the deamons are running:• Namenode: http://localhost:50070• Secondary Namenode: http://localhost:50090• Datanode: http://localhost:50075

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 59 / 63

Page 120: Google File System

HDFS Shell

hdfs dfs -<command> -<option> <path>

hdfs dfs -ls /

hdfs dfs -ls file:///home/big

hdfs dfs -ls hdfs://localhost/

hdfs dfs -cat /dir/file.txt

hdfs dfs -cp /dir/file1 /otherDir/file2

hdfs dfs -mv /dir/file1 /dir2/file2

hdfs dfs -mkdir /newDir

hdfs dfs -put file.txt /dir/file.txt # can also use copyFromLocal

hdfs dfs -get /dir/file.txt file.txt # can also use copyToLocal

hdfs dfs -rm /dir/fileToDelete

hdfs dfs -help

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 60 / 63

Page 121: Google File System

HDFS Shell

hdfs dfs -<command> -<option> <path>

hdfs dfs -ls /

hdfs dfs -ls file:///home/big

hdfs dfs -ls hdfs://localhost/

hdfs dfs -cat /dir/file.txt

hdfs dfs -cp /dir/file1 /otherDir/file2

hdfs dfs -mv /dir/file1 /dir2/file2

hdfs dfs -mkdir /newDir

hdfs dfs -put file.txt /dir/file.txt # can also use copyFromLocal

hdfs dfs -get /dir/file.txt file.txt # can also use copyToLocal

hdfs dfs -rm /dir/fileToDelete

hdfs dfs -help

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 60 / 63

Page 122: Google File System

Summary

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 61 / 63

Page 123: Google File System

Summary

I Google File System (GFS)

I Files and chunks

I GFS architecture: master, chunk servers, client

I GFS interactions: read and update (write and update record)

I Master operations: metadata management, replica placement andgarbage collection

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 62 / 63

Page 124: Google File System

Questions?

Amir H. Payberah (Tehran Polytechnic) GFS 1393/7/14 63 / 63