Transcript
HDFS(Hadoop Distributed File System)
Thiru
Agenda
Typical Work flow
Writing file in to HDFS
Reading file from
HDFS
Rack Awareness
Planning for a
ClusterQ & A
Client
Map Reduce{Job Tracker}
HDFS{Name Node}
{Secondary Name Node}
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Masters
Slaves
Hadoop Server Roles
Hadoop Cluster Name
NodeJob
TrackerSecondary
NNHadoop Client
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
Hadoop Client
Sample HDFS Workflow
Write data in to cluster (HDFS)
Analyze the data (Map Reduce)
Store the result in to cluster (HDFS)
Read the result from cluster (HDFS)
Sample scenario: How many times customer called to customer care enquiring about a recently launched product? Compare it against the AD campaign in the television. Correlate both and find the best time to run the AD
CRM Data entry SQOO
P
HDFS
Map Reduce
ResultResult
Write data in to cluster (HDFS
Hadoop Client
File size 200 MB
I want to write file
Name Node
Ok! Block size is 64 MB. Split the file in to 3 and write in to node 1,4,5
Data Node 1
Data Node 2
Data Node 3
Data Node 4
Data Node 5
Data Node 6
Client Consults
Name node
Client write data to one data node
Data node replicates
as per replication factor and intimates
Name node
Cycle repeats for every block
Rack Awareness
Never loose data when a rack is down
Keep bulky flows within Rack when possible
Assumption in rack has higher bandwidth, and low latency
DN 1
DN 2
DN 3
DN 4
DN 5
DN 6
DN 7
DN 8
DN 9
DN 10
DN 11
DN 12
Name Node
Rack Aware: Rack 1: Data node 1Data node 2Rack 2:Data node 5
A
A
A
CB B
B
C
C
Multi Block Replication
Hadoop Client
File size 200 MB
File.txt
Name Node
A CB
DN 1
DN 2
DN 3
DN 4
DN 5
DN 6
DN 7
DN 8
DN 9
DN 10
DN 11
DN 12
A
Replicate in 3,8
A
A
Name Node
Data node sends hearth beatsEvery 10th heart beat is Block reportName node builds meta data from block report
If name node is down, HDFS is downMissing heartbeats signify lost nodes
Name node consults metadata and finds affected data
Name node consults rack awareness script
Name node tells data node to replicate
Name Node & Secondary Name node Not a hot standby for the name node* (Zoo keeper)
Connects to name node every one hour* (Configurable)
Housekeeping, backup of Name node meta data
Saved meta data can be used to rebuild name node
Primary Name node
Secondary Name Node
File System Metadata:
File.txt = A0 {1,5,7}A1 {1,7,9}
A2{5,10,15}
It’s been 1 hr, give
your data
Understanding Secondary name node house keeping
fsimageedits
edits-new
fsimageedits
Fsimage.ckpt
Fsimage.ckpt
edits Fsimage.ckpt
Primary Name Node Secondary Name Node
Reading data from HDFS Cluster
Hadoop Client
I want to read file file.txt
Name Node
Ok! File.txt =
blck a {1,5,6}Blck b
{8,1,2}Blck c
{5,8,9}
Data Node 1
Data Node 2
Data Node 3
Data Node 4
Data Node 5
Data Node 6
Client Consults
Name node
Client receives
DN list for each block
Client picks first node
of list
Client reads data sequentiall
y
Data Node 7
Data Node 8
Data Node 9C
B
A
A A
BB
C
C
Choosing right hardware
Master node
Single Point of Failure
Dual power supply for
redundancy
No Commodity hardware
Regular Data backup
RAM thumb rule – 1 GB/
Million blocks of
data
# Task per node•1 core can run 1.5 Mapper or Reducer
Practice at Yahoo!
HDFS clusters at Yahoo! include about 3500 nodes
A typical cluster node has:
· 2 quad core Xeon processors @ 2.5ghz · Red Hat Enterprise Linux Server Release 5.1 · Sun Java JDK 1.6.0_13-b03 · 4 directly attached SATA drives (one terabyte each) · 16G RAM · 1-gigabit Ethernet
Practice at YAHoo!
70 percent of the disk space is allocated to HDFS. The remainder is reserved for the operating system (Red Hat Linux), logs, and space to spill the output of map tasks. (MapReduce intermediate data are not stored in HDFS.)
For each cluster, the NameNode and the BackupNode hosts are specially provisioned with up to 64GB RAM; application tasks are never assigned to those hosts.
In total, a cluster of 3500 nodes has 9.8 PB of storage available as blocks that are replicated three times yielding a net 3.3 PB of storage for user applications. As a convenient approximation, one thousand nodes represent one PB of application storage.
Practice at YAHoo!
Durability of Datauncorrelated node failures
Replication of data three times is a robust guard against loss of data due to uncorrelated node failures.correlated node failures, the failure of a rack or core switch.
HDFS can tolerate losing a rack switch (each block has a replica on some other rack).loss of electrical power to the cluster
a large cluster will lose a handful of blocks during a power-on restart.
Practice at YAHoo!
Benchmarks
Practice at YAHoo!
Benchmarks
NameNode Throughput benchmark
Practice at YAHoo!
Automated failover
plan: Zookeeper, Yahoo’s distributed consensus technology to build an automated failover solution Scalability of the NameNode
Solution: Our near-term solution to scalability is to allow multiple namespaces (and NameNodes) to share the physical storage within a cluster.
Drawbacks: The main drawback of multiple independent namespaces is the cost of managing them.
Future work
Thank you
top related