Transcript
Hops – Distributed MetaData for Hadoop
Jim Dowling Associate Prof @ KTH
Senior Researcher @ SICSCEO @ Hops AB
BDOOP Meetup, Hadoop Summit, Dublin, 12th April 2016
www.hops.io @hopshadoop
MetaData Services in Hadoop
2
3
Metadata Totem Poles in Hadoop
Eventual Consistency
4
With Many Hadoop Clusters
Cluster 1 Cluster N
MetaDataService
MetaDataService
MetaData Service (Aggregator)
MetaData consistency protocols have O(N) operational complexity.
Case Study: Access Control as a MetaData Service
5
6
Access Control in Relational Databases# Multi-tenancy for alice and bob on db1 and db2
grant all privileges on db1.* to ‘alice'@‘%‘;grant all privileges on db2.* to ‘bob'@‘%‘;
#More fine-grained privilegesgrant SELECT privileges on db2.sensitiveTable to ‘alice'@‘192.168.1.2‘;
Databases ensure the consistency of security and policies using foreign keys.
“drop table db2.sensitiveTable” => delete associated privileges
7
Access Control in Hadoop: Apache Sentry
How do you ensure the consistency of the policies and the data?
[Mujumdar’15]
8
Policy Editor for Sentry
Administrators administer privileges for users
9
Talk Overview• Our Story of Distributed Metadata for Hadoop
• Metadata at work in Hops: Multi-tenancy
• Metadata at work in Hops: HopsWorks
Bill Gates’ biggest product regret?*
Windows Future Storage (WinFS*)
*http://www.zdnet.com/article/bill-gates-biggest-microsoft-product-regret-winfs/
12
HDFS v2
DataNodes
HDFS Client
Journal Nodes Zookeeper
SnapshotNode
ActiveNameNode
StandbyNameNode
Asynchronous Replication of NN LogAgreement on the Active NameNodeFaster Recovery - Cut the NN Log
13
Max Pause times for NameNode Heap Sizes*
Max Pause-Times (ms)
100
1000
10000
10
JVM Heap Size (GB)
25 50 75 100
Unopt
imize
d
Optimized
*OpenJDK or Oracle JVM
14
NameNode and Decreasing Memory Costs
Size (GB)
250
500
1000
Year
2015 2016 2017 2018
Projected Max NameNode JVM Heap Size
2019
0
750
Size of RAM in a COTS $7,000 Rack Server
15
Externalizing the NameNode State• Problem:NameNode not scaling up with lower RAM prices
• Solution:Move the metadata off the JVM Heap
• Move it where?An in-memory storage system that can be efficiently queried and managed. Preferably Open-Source.
• MySQL Cluster (NDB)
16
HopsFS Architecture
NameNodes
NDB
Leader
HDFS Client
HopsFS Client
Load Balancer
DataNodes
17
Pluggable DBs: Data Abstraction Layer (DAL)
NameNode(Apache v2)
DAL API(Apache v2)
NDB-DAL-Impl(GPL v2)
Other DB(Other License)
hops-2.5.0.jar dal-ndb-2.5.0-7.4.7.jar
The Global Lock in the NameNode
18
Apache NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
Journal Nodes
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
Meta Data & In-Memory EditLogFSNameSystem Lock
EditLog Buffer
EditLog1 EditLog2 EditLog3
Listener(Nio Thread)
Responder(Nio Thread)
dfs.namenode.service.handlercount (default 10)
ipc.server.read.threadpool.size (default 1)
…
Handler1 HandlerM… Done RPCs
ackIdsflush
HopsFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
NDB
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
inodes block_infos replicas
Listener(Nio Thread)
Responder(Nio Thread)
dfs.namenode.service.handlercount (default 10)
ipc.server.read.threadpool.size (default 1)
…
Handler1 HandlerM…
Done RPCs
ackIds
leases…
DAL-ImplDAL API
HARD PART
21
Consistency: Transactions & Implicit Locking
• Serializabile FS ops using implicit locking of subtrees.
[Hakimzadeh, Peiro, Dowling, ”Scaling HDFS with a Strongly Consistent Relational Model for Metadata”, DAIS 2014]
22
Preventing Deadlock and Starvation
• Acquire FS locks in agreed order using FS Hierarchy. • Block-level operations follow the same agreed order.• No cycles => Freedom from deadlock• Pessimistic Concurrency Control ensures progress
/user/jim/myFilemv
readblock_report
Client DataNodeNameNode
Client
Per Transaction Cache• Reusing the HDFS codebase resulted in too many roundtrips to the database per transaction.
• Cache intermediate transaction results at NameNodes.
24
Sometimes, Transactions Just ain’t Enough• Large Subtree Operations (delete, mv, set-quota) can’t always be executed in a single Transaction.
• 4-phase Protocol• Isolation and Consistency• Aggressive batching• Transparent failure handling• Failed ops retried on new NN.• Lease timeout for failed clients.
Leader Election using NDB• Leader to coordinate replication/lease management• NDB as shared memory for Leader Election of NN.
• No more Zookeeper, yay!25[Niazi, Berthou, Ismail, Dowling, ”Leader Election in a NewSQL Database”, DAIS 2015]
Path Component Caching• Path of length N needs O(N) round-trips to resolve• With our cache, O(1) round-trip for a cache hit
/user/jim/myFile
NDB
getInode(0, “user”) getInode
(1, “jim”) getInode(2, “myFile”)
NameNode
/user/jim/myFile
NDB
validateInodes([(0, “user”), (1,”jim”),(2,”myFile”)])
NameNode
CachegetInodes(“/user/jim/myFile”)
Scalable Blocking Reporting• On 100PB+ clusters, internal maintenance protocol traffic makes up much of the network traffic
• Block Reporting - Leader Load Balances- Work-steal when exiting
safe-mode
SafeBlocks
DataNodes
NameNodes
NDB
Leader
Blocks
work steal
HopsFS Performance
28
29
HopsFS Metadata Scaleout
Assuming 256MB Block Size, 100 GB JVM Heap for Apache Hadoop
30
HopsFS Throughput (Spotify Workload)
Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances
Hops-YARN
31
32
YARN Architecture
NodeManagers
YARN Client
Zookeeper Nodes
ResourceMgr StandbyResourceMgr
1. Master-Slave Replication of RM State2. Agreement on the Active ResourceMgr
33
NDB
ResourceManager– Monolithic but Modular
ApplicationMasterService
ResourceTrackerService
Scheduler
ClientService
YARN Client
AdminService
Security
Cluster State
HopsResourceTracker
Cluster State
HopsScheduler
NodeManagerNodeManagerYARN Client App MasterApp Master
ResourceManager
34
Hops-YARN Architecture
ResourceMgrs
NDB
Scheduler
YARN Client
NodeManagers
Resource Trackers Leader Election forFailed Scheduler
What do we do with all this Metadata?
35
Hops MetaData Tree
36
HopsFSHopsYARN
NDB
ProjectsDataSets
Hops Users
ProvenanceSearch
HistoryServiceExt-Metadata
37
Problem: Need Cluster per Sensitive DataSet
NSA DataSet
User DataSet
has access to
has access to
Alice can copy/cross-link between data sets
Alice has only one Kerberos Identity. Dynamic Roles not supported in Hadoop.
Alice
38
Solution: Project-Specific UserIDs
Project NSA
Project UsersMember of
NSA__Alice
Users__Alice
Member of
HDFS enforcesaccess control
39
Sharing DataSets with HopsWorks
Project NSA
Project UsersMember of
DataSetowns
Add members of Project NSA to the DataSet group
NSA__Alice
Users__Alice
Member of
HopsWorks enforces Dynamic Roles
40
Alice@gmail.com
NSA__Alice
Authenticate
Users__Alice
HopsWorks
HopsFS
HopsYARN
Projects
SecureImpersonation
41
User• Authentication Provider
- JDBC Realm- 2-Factor Authentication- LDAP
42
Project• Members
- Roles: Owner, Data Scientist
• DataSets - Home project- Can be shared
43
Project Roles• Data Owner Privileges
- Import/Export data- Manage Membership- Share DataSets
• Data Scientist Privileges- Write code- Run code- Request access to DataSets
We delegate administration of privileges to users
45
Sharing DataSets between Projects
The same as Sharing Folders in Dropbox
46
Delegate Access Control to HDFS• HDFS enforces access control- UserID per Project- GroupID per
Project and DataSet
• Metadata Integrity using Foreign Keys- Removing a project removes
all users, groups, and (optionally) DataSets.
47
How ACME Inc. handles Free-Text Search
HDFS
In Theory
Unified Search and Update API
In Practice
Inconsistent Metadata
48
Free Text Search with Consistent Metadata
Free-Text Search
Distributed DatabaseElasticSearch
The Distributed Database is the Single Source of Truth.Foreign keys ensure the integrity of Metadata.
MetaDataDesigner
MetaDataEntry
49
Global Search: Projects and DataSets
50
Project Search: Files, Directories
51
Design your own Extended Metadata
Analytics in HopsWorks
52
53
Batch Job Analytics
Interactive Analytics: Zeppelin
Other Features• Audit Logs
• Erasure Coding Replication
• Online upgrade of Hops (and NDB)
• Automated Installation with Karamel
• Tinker friendly – easy to extend metadata!
55
56
Conclusions• Hops is a next-generation distribution of Hadoop.
• HopsWorks is a frontend to Hops that supports true multi-tenancy, free-text search, interactive analytics with Zeppelin/Flink/Spark, and batch jobs.
• Looking for contributors/committers- Pick-me-ups on GitHub
www.hops.io
The TeamActive: Jim Dowling, Seif Haridi, Tor Björn Minde,
Gautier Berthou, Salman Niazi, Mahmoud Ismail,Kamal Hakimzadeh, Ermias Gebremeskel, Theofilos Kakantousis, Johan Svedlund Nordström, Someya Sayeh, Vasileios Giannokostas, Antonios Kouzoupis, Misganu Dessalegn, Ahmad Al-Shishtawy, Ali Gholami.
Alumni: K. “Sri” Srijeyanthan, Steffen Grohsschmiedt, Alberto Lorente, Andre Moré, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Jude D’Souza, Qi Qi, Gayana Chandrasekara,Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,Peter Buechler, Pushparaj Motamari, Hamid Afzali,Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
Hops
Hops[Hadoop For Humans]
Join us!http://github.com/hopshadoop
top related