Copyright © 2015 NTT DATA Corporation 9/30/2015 NTT DATA Corporation Akira Ajisaka HDFS 2015: Past, Present, and Future Apache: Big Data Europe 2015
Copyright © 2015 NTT DATA Corporation
9/30/2015NTT DATA CorporationAkira Ajisaka
HDFS 2015: Past, Present, and Future
Apache: Big Data Europe 2015
2Copyright © 2015 NTT DATA Corporation
Self introduction
Akira Ajisaka (NTT DATA)
Apache Hadoop Committer
130+ commits in 2015
Working on usability
80+ documentation patches
"Open-Source Professional Services" team
Has deployed and supported 10k+ nodes of Hadoop clusters overall for 7 years
Contributing to Apache Hadoop 6th in the world with NTT [1]
[1] The Activities of Apache Hadoop Community 2014 http://ajisakaa.blogspot.com/2015/02/the-activities-of-apache-hadoop.html
3Copyright © 2015 NTT DATA Corporation
About
Similar to "YARN 2015" presentation by @tshooter
HDFS is developed faster than YARN
Need a summary of HDFS new features
0
200
400
600
800
1000
1200
1400
1-Jan-15 1-Feb-15 1-Mar-15 1-Apr-15 1-May-15 1-Jun-15 1-Jul-15 1-Aug-15 1-Sep-15
Resolved issues in 2015 (cumulative)
HDFS YARN
4Copyright © 2015 NTT DATA Corporation
Agenda
Past
Present
Future
Copyright © 2015 NTT DATA Corporation 5
Past
6Copyright © 2015 NTT DATA Corporation
2.X is the release branch
1.X and 0.23.X are no longer maintained
Past releases
20142010 2011 201320122009
branch-2
2.2.0 (GA)
2.3.0
2.4.02.0.0-alpha
2.1.0-beta
branch-1
(branch-0.20)
1.0.0 1.1.0 1.2.1(stable)0.20.1 0.20.205
0.22.00.21.0
New append
Security
0.23.0
0.23.11(final)
NameNode Federation, YARN
NameNode HA
2015
2.5.0
2.6.0
2.7.0
trunk
7Copyright © 2015 NTT DATA Corporation
Hadoop 2.2 (2013-10-13)
NameNode High-Availability
No Single Point of Failure
Federation
Multiple NameNodes, multiple namespaces
Improve scalability
Snapshots
Read only point-in-time copy (Copy on Write)
NFSv3 mount
8Copyright © 2015 NTT DATA Corporation
DataNode
Hadoop 2.3 (2014-02-20)
Heterogeneous Storages (Phase 1)
In-memory caching
Introduce memory-locality
Make efficient use of memory in DNs
DFSClient NameNode1. Ask NN to cache a file
DISK Memory
File
9Copyright © 2015 NTT DATA Corporation
DataNode
Hadoop 2.3 (2014-02-20)
Heterogeneous Storages (Phase 1)
In-memory caching
Introduce memory-locality
Make efficient use of memory in DNs
DFSClient NameNode
DISK Memory
File2. Ask DN to cache blocks
File
10Copyright © 2015 NTT DATA Corporation
DataNode
Hadoop 2.3 (2014-02-20)
Heterogeneous Storages (Phase 1)
In-memory caching
Introduce memory-locality
Make efficient use of memory in DNs
DFSClient
DISK Memory
File File
If cached locally,
read directly from memory and
skip checksum calculation
11Copyright © 2015 NTT DATA Corporation
Hadoop 2.4 (2014-04-07)
Rolling Upgrades
No need to wait for hours
ACLs
More fine-grained permissions
Similar to POSIX ACL
-rw-rw-r-- 3 tester hadoop 129 2015-09-15 12:00 /user/tester/test.txt
$ hdfs dfs -setfacl -m group:hive:rw- /user/tester/test.txt
gives write permission to hive group
12Copyright © 2015 NTT DATA Corporation
Hadoop 2.5 (2014-08-11)
Extended Attributes (XAttrs)
Similar to extended attributes in Linux
Currently used by transparent encryption
-rw-r--r-- 3 tester hadoop 129 2015-09-15 12:00 /user/tester/test.txt
Set XAttrs
$ hdfs dfs -setfattr -n user.locale -v jp /user/tester/test.txt
$ hdfs dfs -setfattr -n user.city -v tokyo /user/tester/test.txt
Get XAttrs
$ hdfs dfs -getfattr -d /user/tester/test.txt
# file: /user/tester/test.txt
user.locale="jp"
user.city="tokyo"
13Copyright © 2015 NTT DATA Corporation
Hadoop 2.6 (2014-11-18)
Hot swap volumes
Recover from disk failures w/o stopping DNs
Integrate Apache HTrace (incubating)
Trace RPCs inside HDFS
Finding bottlenecks becomes easier
Time
Span A trace id: 12345
parent: rootnode 1
Span B trace id: 12345
parent: Anode 2
Span C Span Dnode 3
RPC
RPC RPC
Easy to find
parent-child
relations
14Copyright © 2015 NTT DATA Corporation
Hadoop 2.6 (2014-11-18) (Cont.d)
Heterogeneous Storages (Phase 2)
Archival Storage
Memory as storage tier
Transparent Encryption
15Copyright © 2015 NTT DATA Corporation
Heterogeneous Storages
Problem
SSD is getting cheaper
Want to store hot data in SSD to achieve higher throughput
Solution: Introduce storage type and block placement policy
Storage: HDD, SSD, ARCHIVE, ...
Policy: One_SSD, HOT, WARM, COLD, ...
Example: A -> One_SSD, B -> HOT
DN1
SSD DISK
DISK DISK
A
B
DN2
SSD DISK
DISK DISKA
B DN3
SSD DISK
DISK DISK
A B
Hadoop 2.6
16Copyright © 2015 NTT DATA Corporation
How to use
Configure HDFS to recognize storage type for each disk
Set block placement policy to HDFS path
Reset policy after putting data is possible
Mover will move blocks to satisfy the policy considering rack awareness
Hadoop 2.6
Heterogeneous Storages
<parameter>
<name>dfs.datanode.data.dir</name>
<value>[SSD]file:///data/ssd,[HDD]file:///data/hdd</value>
</parameter>
$ hdfs setstoragepolicies -setStoragePolicy -path <path> -policy <policy>
17Copyright © 2015 NTT DATA Corporation
Archival Storage
DISK or ARCHIVE?
ARCHIVE is for cold data
eBay reduces cost/GB by 5x [1]
Use low-spec DNs for ARCHIVE
No need to split cluster![1] Reduce Storage Costs by 5x Using The New HDFS Tierd Storage Feature http://www.slideshare.net/Hadoop_Summit/reduce-storage-costs-by-5x-using-the-new-hdfs-tiered-storage-feature
Regular Node Archival Node
Drives 12 HDDs 60 HDDs
CPU 32 Cores 4 Cores
Memory 128GB 64GB
Run NodeManager Yes No
Hadoop 2.6
18Copyright © 2015 NTT DATA Corporation
Transparent Encryption
Problem
Cannot guard data from OS-level attacks
Solution
Provide end-to-end encryption
Encrypt/decrypt data transparently
No need to rewrite user application
Hadoop 2.6
Client
DataNode
DataTransferProtocol
can be encrypted
DISK
Data
DataEncrypted data
NOT encrypted!
19Copyright © 2015 NTT DATA Corporation
Transparent Encryption: How to encrypt data
DEK (Data Encryption Key)
A unique key for each file in EZ (Encryption Zone)
Stored in an Xattr of the file, encrypted (EDEK)
Client NameNode
Key
Management
Server
1. Create file in EZ
2. Get EDEK
3. Store EDEK in metadata
EDEK
• Proxy to underlying key provider
• ACLs on per key basis
• Bundled with Hadoop package
Hadoop 2.6
20Copyright © 2015 NTT DATA Corporation
Transparent Encryption: How to encrypt data
DEK (Data Encryption Key)
A unique key for each file in EZ (Encryption Zone)
Stored in an Xattr of the file, encrypted (EDEK)
Client NameNode
Key
Management
Server
4. EDEK returned EDEK
5. Call to decrypt EDEK to DEK
EDEK
Hadoop 2.6
21Copyright © 2015 NTT DATA Corporation
Transparent Encryption: How to encrypt data
DEK (Data Encryption Key)
A unique key for each file in EZ (Encryption Zone)
Stored in an Xattr of the file, encrypted (EDEK)
Client NameNode
Key
Management
Server
EDEKDEK
DataNode
6. Write encrypted data to DN using DEK
Hadoop 2.6
Encrypted data
Encrypted data
22Copyright © 2015 NTT DATA Corporation
Transparent Encryption: Very low overhead
Very low overhead
Simple benchmark with 3 slaves (m3.xlarge, 4 core Xeon E5-2670 v2)
Use AES-NI
Known issue
Encryption is sometimes done incorrectly (HADOOP-11343)
Recommend 2.7.1 or 2.6.1
Hadoop 2.6
Encryption Off Encryption On
1GB Teragen 17 sec 18 sec
1GB Terasort 47 sec 49 sec
Copyright © 2015 NTT DATA Corporation 23
Present
24Copyright © 2015 NTT DATA Corporation
Hadoop 2.7 (2014-11-18)
Quota per storage type
Truncate API
Files with variable-length blocks
Web UI for NFS gateway
NNTop: top-like tool for NameNode
List top users for each operation
Exposed via metric
fsck -blockId option
Print the file which the blockId belongs to
Inotify
25Copyright © 2015 NTT DATA Corporation
INotify for HDFS
Problem
Some components do caching
Hive caches path names
Impala caches block locations
When to invalidate cache?
Solution
Introduce a tool similar to Linux inotify
Client can monitor the events without parsing NN log or edits
Hadoop 2.7
26Copyright © 2015 NTT DATA Corporation
INotify for HDFS: Technical Approach
Client polls NameNode periodically
Not push model
Known issue
Truncate is not notified (HDFS-8742)
Fixed in 2.8.0
Client NameNode
1. Poll any events after #XX
2. Return events after #XX
Caches the highest
event number
Hadoop 2.7
Copyright © 2015 NTT DATA Corporation 27
Future
28Copyright © 2015 NTT DATA Corporation
Many features are being developed
2.8 (not released)
Support OAuth2 in WebHDFS
RPC Congestion control
Feature branches
Erasure Coding (HDFS-7285)
Ozone: Object store (HDFS-7240)
BlockManager Scalability Improvements (HDFS-7836)
HTTP/2 support for DataTransferProtocol(HDFS-7966)
Implement an async pure c++ HDFS client (HDFS-8707)
29Copyright © 2015 NTT DATA Corporation
RPC Congestion Control
Problem
NameNode RPC queue is FIFO
DDoS can kill entire cluster
Solution
Fair scheduling for RPC queue (2.6.0)
Retriable exception with exponential backoff(2.8.0)
Enable by default in 2.8
while (true) {
dfs.exists("/data");
}Don't do this!
Hadoop 2.8
30Copyright © 2015 NTT DATA Corporation
Erasure Coding
Problem
Reduce costs of storage
Blocks are replicated to 3 DNs
3x storage overhead is costly
Solution
Use Erasure Code
3-replication (6,3)-Reed-Solomon
Tolerates 2 failures 3 failures
Disk Usage 3x 1.5x
31Copyright © 2015 NTT DATA Corporation
Erasure Coding: Write files using (6,3)-Reed-Solomon
Write data to 9 DNs in parallel
DN1
DN6
DN7
DN9
・・・・・・
Incoming Data
・・・
ECClient
・・・
3 Parity Blocks
6 Data Blocks
32Copyright © 2015 NTT DATA Corporation
Erasure Coding: Read files
Read data from 6 DNs in parallel
DN1
DN6
DN7
DN9
・・・・・・
ECClient
・・・
33Copyright © 2015 NTT DATA Corporation
Erasure Coding: Read files when DN fails
Read data from (arbitrary) 6 DNs in parallel
DN1
DN6
DN7
DN9
・・・・・・
ECClient
・・・
×
34Copyright © 2015 NTT DATA Corporation
Erasure Coding: Current status
Suitable for cold data
No data locality
Very low cost/GB with archival storage
Now preparing for merge
Follow on work
Intel ISA-L support for faster encoding
Support append/truncate/hflush/hsync
More encoding schemas
Pipeline error handling
Support contiguous layout (HDFS EC Phase 2)
35Copyright © 2015 NTT DATA Corporation
Summary
Many features are still in development
I cannot predict when the feature will be available
Recommend anyone who wants a feature to join contributing to it to make the development faster
There are many ways to contribute
Creating/Testing/Reviewing patches
Reporting bugs
Writing documents
Discussing architecture design
https://wiki.apache.org/hadoop/HowToContribute
Copyright © 2011 NTT DATA Corporation
Copyright © 2015 NTT DATA Corporation
37Copyright © 2015 NTT DATA Corporation
References
Apache Hadoop Docs: http://hadoop.apache.org/docs/current/
In-memory caching (HDFS-4949)
In-memory Caching in HDFS: Lower Latency, Same Grate Taste: http://www.slideshare.net/Hadoop_Summit/inmemory-caching-in-hdfs-lower-latency-same-great-taste-33921794
Heterogeneous Storages (HDFS-5682)
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature: http://www.slideshare.net/Hadoop_Summit/reduce-storage-costs-by-5x-using-the-new-hdfs-tiered-storage-feature
Transparent Encryption (HDFS-6134)
Transparent Encryption in HDFS: http://www.slideshare.net/Hadoop_Summit/transparent-encryption-in-hdfs
INotify (HDFS-6634)
Keep Me in the Loop: Introducing HDFS Inotify: http://www.slideshare.net/Hadoop_Summit/keep-me-in-the-loop-inotify-in-hdfs
38Copyright © 2015 NTT DATA Corporation
References
RPC congestion control (HADOOP-9640, HADOOP-10597, HDFS-8820)
Improving HDFS Availability with Hadoop RPC Quality of Service: http://www.slideshare.net/MingMa4/hadoop-rpcqoshadoopsummit2015
Erasure Coding (HDFS-7285)
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency: http://www.slideshare.net/Hadoop_Summit/hdfs-erasure-code-storage-same-reliability-at-better-storage-efficiency