Top Banner
Hadoop and HDFS in CMRI China Mobile Research Institute WANG, Xu [wangxu(at)chinamobile.com]
14
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 20100130 hadoop apache

Hadoop and HDFS in CMRI

China Mobile Research InstituteWANG, Xu [wangxu(at)chinamobile.com]

Page 2: 20100130 hadoop apache

Apache Hadoop

http://hadoop.apache.org/Open source clone of Google infrastructureDe facto standards of MapReduce framework, win Terasort several timesSearch Engine, Data Mining, Log Analyzing

内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密

Search Engine, Data Mining, Log AnalyzingClusters scale up to 4,000 nodesYahoo!, Facebook, ClouderaBaidu, Alibaba, China Mobile

Page 3: 20100130 hadoop apache

Hadoop in China 2009

内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密

BeijingNov 15, 2009

Page 4: 20100130 hadoop apache

Subprojects of Hadoop

PigHBase

(BigTable)ZooKeeper(Chubby)

Hive

Hadoop

Data Warehouse

K-K-V Store / Column based

DB

Distributed Lock

Basic Platform

内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密

JVM

Hadoop Common(io, ipc….)

HDFS (Google GFS)

MapReduce (Google MapReduce)Hadoop Core

Avro (ipc)

Serialized Data Format

&RPC

Page 5: 20100130 hadoop apache

HDFS Principles

Follow Google GFS PaperFor Big data storage and processingWrite once, read frequently

Modify is not permitted, append will be support soonRead is prior to writing

内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密

Read is prior to writing

Working on commodity PCHardware may fail anytimeMultiple replicas for data safety

Page 6: 20100130 hadoop apache

HDFS Architecture

内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密

Page 7: 20100130 hadoop apache

Data in HDFS NameNode’s Memory

Namespace InfoFS Hierarchical TreeMap(file, blocks)

DataNode MapMap(living datanode, blocks)

内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密

Map(living datanode, blocks)

Blocks MapMap(block, file/datanodes)

Other runtime infoLock holding by clientsBlocks being processed (replication, invalid…)

Page 8: 20100130 hadoop apache

Persistence of NameNode data

NameNode persistenceNamespace: FSImage & EditLogStarting & Shutdown

Secondary NameNodeCheckpoint (merge EditLog into FSImage)Periodically work (1 hour by default)

内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密

Periodically work (1 hour by default)

Backup NameNodeIntroduced In 0.21 (not release yet)“Real time Secondary NameNode” or Remote Editlog

DataNode Map and other Info only exists in NameNode Memory

Page 9: 20100130 hadoop apache

High Availability Considerations

Availability in MainstreamSPOF in NameNode, Fail of NameNode may cause

Service interruption for minutesData loss for a ckpt period (worst case)

Possible Solution: DRBD+Linux-HAMature fail over mechanismService interruption for minutesService interruption for minutes

内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密

Service interruption for minutesService interruption for minutesAlmost no data loss

Another Solution: NameNode Cluster ExtensionService continuousAlmost no data lossModify the codeModify the codeConsistency vs. PerformanceConsistency vs. Performance

Page 10: 20100130 hadoop apache

HDFS+NNC Architecture

内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密

Page 11: 20100130 hadoop apache

NNC Design

Master & Slave: 1:NMaster synchronize the FSNamesystem to slavesZookeeper works as a registry, client and datanode can lookup namenode list from it.DFSClient can

内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密

DFSClient can access multiple namenode for reading operationFailover is controlled by linux-HA by far, which get namenode status info from ClientProtocol

Page 12: 20100130 hadoop apache

Update Events

NNU_NOP // nothing to do NNU_BLK // add or remove a blockNNU_INODE // add or remove or modify an inode (add or remove file; new block allocation)NNU_NEWFILE // start new fileNNU_CLSFILE // close new fileNNU_MVRM // move or remove file NNU_MKDIR // mkdirNNU_LEASE // add/update or release a leaseNNU_LEASE_BATCH //update batch of leases

内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密

NNU_LEASE_BATCH //update batch of leasesNNU_DNODEHB_BATCH //batch of datanode heartbeatNNU_DNODEREG // dnode register NNU_DNODEBLK // block reportNNU_DNODERM // remove dnodeNNU_BLKRECV // block received message from datanodeNNU_REPLICAMON //replication monitor workNNU_WORLD //bootstrap a slave nodeNNU_MASSIVE //bootstrap a slave node

Page 13: 20100130 hadoop apache

Performance and Other Issues

The overhead of NameNode synchronizationFor typical file IO and MapReduce (sort, wordcount)

NNC system reaches 95% performance of hadoop without NNC

For meta data write only operation (parallel touchz or mkdir)NNC system reaches 15% performance of hadoop without NNC

Performance gaining of Multiple NameNode in read-only operationCannot observed till now, unfortunately

Other design issue

内部资料内部资料内部资料内部资料 注意保密注意保密注意保密注意保密

Other design issueWhy from master to slaves directly without an additional delivery node?

That may introduce another SPOF, and make the problem more complex.

Why don’t use Zookeeper for failover?Linux-HA works well, and we are also evaluate whether change to ZK, any suggestions?

Page 14: 20100130 hadoop apache

Q & A