Presented by Haoran Ma, Yifan Qiao
The Hadoop Distributed File System
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo!
Sunnyvale, California USA
Outline
• Introduction
• Architecture
• File I/O Operations and Replica Management
• Practice at YAHOO!
• Future Work
Introduction
• A single dataset is too large —> Divide it and store them on a cluster of commodity hardwares.
- What if one of the physical machines fails?
• Some applications like MapReduce need high throughput of data access.
Introduction
• HDFS is the file system component of Hadoop. It is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications.
• These are achieved by replicating file contents on multiple machines(DataNodes).
Introduction• Very Large Distributed File System
• Assumes Commodity Hardware
- Files are replicated to handle hardware failure
• Optimized for Batch Processing
- Data locations exposed so that computations can move to where data resides
Introduction
Usually 128 MB
Source: HDFS Tutorial – A Complete Hadoop HDFS Overview. DATAFLAIR TEAM.
Architecture
Source: Hadoop HDFS Architecture Explanation and Assumptions. DATAFLAIR TEAM.
Architecture
• Stores meta-data such as number of data blocks, replicas and other details in memory
• Maintains and manages the DataNodes, and assigns tasks to them
Architecture
Store Application Data
Source: Hadoop HDFS Architecture Explanation and Assumptions. DATAFLAIR TEAM.
ArchitectureHDFS Client: a code library that exports the HDFS file system interface
Architecture
• How does this architecture achieve high fault-tolerance?
• DataNodes Failure
• NameNode Failure
Architecture: Failure Recovery for DataNodes
Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.
Architecture: Failure Recovery for DataNodes
Block Report
Heartbeat
Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.
Architecture: Failure Recovery for DataNodes
Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.
Architecture: Failure Recovery for DataNodes
What if NameNode fails?
Architecture: Failure Recovery for NameNode
Image = Checkpoint + Journal
• Image: The file system metadata that describes the organization of application data as directories and files.
• Checkpoint: A persistent record of the image written to disk.
• Journal: The modification log of the image. It is also stored in the local host’s native file system.
Architecture: Failure Recovery for NameNode
• CheckpointNode:
• Periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal.
• BackupNode:
• A read-only NameNode.
• Maintains an in-memory, up-to-date image of the file system namespace that is always synchronized with the state of the NameNode.
• If the NameNode fails, the BackupNode’s image in memory and the checkpoint on disk is a record of the latest namespace state.
Architecture: Failure Recovery for NameNode
• Snapshots
• To minimize potential damage to the data stored in the system during upgrades.
• Persistently save the current state of the file system(both data and metadata).
• , so that if the upgrade results in data loss or corruption, it is possible to rollback the upgrade and return HDFS to the namespace and storage state as they were at the time of the snapshot.
Architecture: Failure Recovery for NameNode
• Snapshots
• To minimize potential damage to the data stored in the system during upgrades.
• Persistently save the current state of the file system(both data and metadata).
• , so that if the upgrade results in data loss or corruption, it is possible to rollback the upgrade and return HDFS to the namespace and storage state as they were at the time of the snapshot.Copy on Write
Architecture: Failure Recovery for NameNode
Combine Return
NameNode
DataNode (Example)
BackupNode
CheckpointNode
Memory:Image
Disk:
Memory:
SynchronizeImage
CheckpointJournal
New CheckpointEmpty Journal
Snapshot Snapshot (Only Hard Links)
Disk:
Disk:
Usage & Management of HDFS Cluster
• Basic File I/O operations
• Rack Awareness
• Replication Management
File I/O Operations• Write Files to HDFS: Single writer, multiple reader
(1) addBlock
(2) uniqueblock ids
(3) writeto block
Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.
• Write Files to HDFS: Single writer, multiple reader
(1) addBlock
(2) uniqueblock ids
(3) writeto block
1. Client consults the NameNode to get a lease and destination DataNodes
2. Client writes a block to DataNodes in a pipeline way
3. DataNodes replicate blocks 4. Client writes a new block after finishing the previous block
The visibility of the modification is not guaranteed!
Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.
File I/O Operations
1. the client consults the NameNode to get the list of blocks and their replicas' locations
2. try the nearest replica first, then the second nearest replica, and so on
• Identifying corrupted data • Checksums-CRC32
• Read Files in HDFS
File I/O Operations
In-cluster Client Reads a File
Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.
Outside Client Reads a File
Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.
Rack Awareness
Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.
Rack Awareness• Benefits:
• higher throughput • higher reliability:
• an entire rack failure never loses all replicas of a block
• better network bandwidth utilization: • reduce inter-rack and inter-node write traffic as
much as possible
• The default HDFS replica placement policy: 1. No DataNode contains more than one
replica of any block 2. No rack contains more than two replicas of
the same block, provided there are sufficient racks on the cluster
Rack Awareness
Replication Management• to avoid blocks to be under- or over-replicated
Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.
Practice at Yahoo!
• Clusters at Yahoo! can be as large as ~3500 nodes with typical configuration: • 2 quad core Xeon processors @ 2.5Ghz • 4 directly attached SATA drives (1TB each, 4TB total) • 16GB RAM• 1-Gbit Ethernet
• Total 9.8PB of storage available, 3.3PB available for user applications when replicating blocks 3 times
Cluster Basic Information
Practice at Yahoo!• Uncorrelated nodes failure:
• Chance of a node failed during a month ~0.8% (Naive estimation for a node failure probability during a year is ~9.2%)
• Chance of losing a block during a year < 0.5%• Correlated nodes failure:
• HDFS tolerates a rack switch failure • But a core switch failure or cluster power loss can
lose some blocks
Data Durability
Practice at Yahoo!• Benchmarks
Scenario Read (MB/sper node)
Write (MB/sper node)
DFSIO 66 407200 RPM
Desktop HDD[6]
< 130(typical 50-120)
< 130(typical 50-120)
Table1: Contrived benchmark compared with typical HDD performance
Scenario Read (MB/sper node)
Write (MB/sper node)
Busy Cluster 1.02 1.09Table2: HDFS performance in a production cluster
Practice at Yahoo!• Benchmarks
Bytes (TB) Nodes Maps Reduces Time / s
HDFS I/O Bytes/s
Aggregate (GB) Per Node (MB)
1 1460 8000 2700 62 32 22.11000 3658 80000 20000 58500 34.2 9.35
Table 3: Sort benchmark
1000TB is too large to fit in the node memory intermediate results spill to disks and occupy disk bandwidth
Practice at Yahoo!• Benchmarks
Operation Throughput (ops/s)
Open file for read 126 100Create file 5600Rename file 8300Delete file 20 700
DataNode Heartbeat 300 000
Blocks report (blocks/s) 639 700
Table4: NameNode throughput benchmark
involve modifying nodes,
can be the bottleneck in large scale
Summary: HDFS: Two Easy Pieces*
• Reliability
• Throughput
*: The title is from two great books: Six Easy Pieces: Essentials Of Physics Explained By Its Most Brilliant Teacher, by Richard P. Feynman, and Operating Systems: Three Easy Pieces, by Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau
Summary: HDFS: Reliability• System Design:
• Split files into blocks and replicate them (typical 3) • For NameNode:
• Checkpoint + Journal can restore the latest image • BackupNode • Snapshot • NameNode is the single point of failure of the whole system - NOT GOOD!
• For DataNodes: • Rack Awareness + Replica Placement Policy, never lose a block if a rack
fails • Replica Management, to avoid blocks to be under-replicated • Snapshot
Summary: HDFS: Throughput
• System Design • Split files into large blocks (128MB) - good for streaming access and
parallel access • Provide APIs that expose the location of blocks - facilitating applications
to schedule computation tasks to where the data reside • NameNode - Not Good for High Throughput and Scalability
• Single node handles all requests from clients and manages all DataNodes • DataNodes
• Rack Awareness & replica placement policy - better utilizing network bandwidth
• Write files in a pipeline way • Read files from the nearest DataNode first
Future Work (Out of Date!)• Automated failover solution
• Zookeeper
• Scalability of the NameNode
• multiple namespaces to share physical storage• Advantages:
• isolate namespaces • improve the overall availability • generalize the block storage
abstraction
• Drawbacks: • management cost
Thank you.
• Reference:
[1] “The Hadoop Distributed File System”. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler.
[2] “Hadoop HDFS Architecture Explanation and Assumptions”. DATAFLAIR TEAM. https://data-flair.training/blogs/hadoop-hdfs-architecture/
[3] “HDFS Tutorial – A Complete Hadoop HDFS Overview”. DATAFLAIR TEAM. https://data-flair.training/blogs/hadoop-hdfs-tutorial/
[4] “HDFS Architecture”. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
[5] “Understanding Hadoop Clusters and the Network”. Brad Hedlund. http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/
[6] "Speed Considerations". Seagate. https://web.archive.org/web/20110920075313/http://www.seagate.com/www/en-us/support/before_you_buy/speed_considerations