Developing and Deploying Apache Hadoop SecurityOwen O’Malley - Hortonworks Co-founder and [email protected]@owen_omalley
© Hortonworks Inc. 2011 July 25, 2011
Who am I
• An architect working on Hadoop full time since the beginning of the project (Jan ‘06)−Primarily focused on MapReduce
• Tech-lead on adding security to Hadoop
• Co-founded Hortonworks this month
• Before Hadoop – Yahoo Search WebMap
• Before Yahoo – NASA, Sun
• PhD from UC Irvine
What is Hadoop?
• A framework for storing and processing big data on lots of commodity machines.−Up to 4,500 machines in a cluster−Up to 20 PB in a cluster
• Open Source Apache project
• High reliability done in software−Automated failover for data and computation
• Implemented in Java
• Primary data analysis platform at Yahoo!−40,000+ machines running Hadoop−More than 1,000,000 jobs every month
twice the engagement
4
Personalized for each visitor
Result: twice the engagement
+160% clicksvs. one size fits all
+79% clicksvs. randomly selected
+43% clicksvs. editor selected
Recommended links News Interests Top Searches
Case Study: Yahoo Front Page
Problem
• Yahoo! has more yahoos than clusters.• Hundreds of yahoos using Hadoop each month• 40,000 computers in ~20 Hadoop clusters.• Sharing requires isolation or trust.
• Different users need different data.• Not all yahoos should have access to sensitive data−financial data and PII
• In Hadoop 0.20, easy to impersonate.−Segregate different data on separate clusters
5
Solution
• Prevent unauthorized HDFS access• All HDFS clients must be authenticated.• Including tasks running as part of MapReduce jobs• And jobs submitted through Oozie.
• Users must also authenticate servers• Otherwise fraudulent servers could steal credentials
• Integrate Hadoop with Kerberos• Provides well tested open source distributed
authentication system.
6
Requirements
• Security must be optional.−Not all clusters are shared between users.
• Hadoop commands must not prompt for passwords−Must have single sign on.−Otherwise trojan horse versions are easy to write.
• Must support backwards compatibility−HFTP must be secure, but allow reading from
insecure clusters
Definitions
• Authentication – Determining the user−Hadoop 0.20 completely trusted the user
• User passes their username and groups over wire
−We need it on both RPC and Web UI.
• Authorization – What can that user do?−HDFS had owners, groups and permissions since 0.16.−Map/Reduce had nothing in 0.20.
• Auditing – Who did what?−Available since 0.20
Authentication
• Changes low-level transport
• RPC authentication using SASL−Kerberos (GSSAPI)−Token (Digest-MD5)−Simple
• Browser HTTP secured via plugin
• Tool HTTP (eg. fsck) via SSL/Kerberos
Authorization
• HDFS−Command line unchanged−Web UI enforces authentication
• MapReduce added Access Control Lists−Lists of users and groups that have access.−mapreduce.job.acl-view-job – view job−mapreduce.job.acl-modify-job – kill or modify job
© Hortonworks Inc. 2011 12
Auditing
• A critical part of security is an accurate method for determining who did what.−Almost useless until you have strong authentication
• HDFS Audit log tracks−Reading or writing of files
• MapReduce audit log tracks−Launching or modifying job properties
Kerberos and Single Sign-on
• Kerberos allows user to sign in once−Obtains Ticket Granting Ticket (TGT)
• kinit – get a new Kerberos ticket• klist – list your Kerberos tickets• kdestroy – destroy your Kerberos ticket• TGT’s last for 10 hours, renewable for 7 days by
default−Once you have a TGT, Hadoop commands just work
• hadoop fs –ls /• hadoop jar wordcount.jar in-dir out-dir
13
API Changes
• Very Minimal API Changes−Most applications work unchanged−UserGroupInformation *completely* changed.
• MapReduce added secret credentials−Available from JobConf and JobContext−Never displayed via Web UI
• Automatically get tokens for HDFS−Primary HDFS, File{In,Out}putFormat, and DistCp−Can set mapreduce.job.hdfs-servers
14
© Hortonworks Inc. 2011 15
MapReduce task-level security
• MapReduce tasks run as submitting user.−No more accidently killing TaskTrackers!−Use a setuid C program.
• Task output logs aren’t globally visible.
• Task work directories aren’t globally visible.
• Distributed cache is split−Public – shared between all users−Private – shared between jobs of same user
Web UIs
• Hadoop relies on Web User Interfaces served from embedded Jetty.−These need to be authenticated also…
• Web UI authentication is pluggable.−SPENGO or static user plug-ins are available−Companies may need or want their own systems
• All servlets enforce permissions based on the authenticated user.
Proxy-Users
• Some services access HDFS and MapReduce as other users.
• Configure service masters (NameNode and JobTracker) with the proxy user:−For each proxy user, configuration defines:
• Who the proxy service can impersonate• Which hosts they can impersonate from
• New admin commands to refresh−Don’t need to bounce cluster
17
Out of Scope
• Encryption−RPC transport −Block transport protocol−On disk
• File Access Control Lists−Still use Unix-style owner, group, other permissions
• Non-Kerberos Authentication−Much easier now that framework is available
Deployment
• The security team worked hard to get security added to Hadoop on schedule.
−Roll out was smoothest major Hadoop version in a long time.−In the 0.20.203.0 and upcoming 0.20.204.0 release.−Measured performance degradation < 3%
• Security Development team:
−Devaraj Das, Ravi Gummadi, Jakob Homan, Owen O’Malley, Jitendra Pandey, Boris Shkolnik, Vinod Vavilapalli, Kan Zhang
• Currently deployed on all shared clusters (alpha, science, and production) at Yahoo!
© Hortonworks Inc. 2011 20
Incident after Deployment
• Only tense incident involved one cluster where 1/3 of the machines dropped out of the cluster after a day.
• Had to diagnose what had gone wrong.
• The dropped machines had newer keytab files!
• An operator had regenerated the keys on 1/3 of the cluster after it was running. Servers failed when they tried to renew their tickets.
© Hortonworks Inc. 2011 21
Hadoop Eco-system
• Security percolates upward…−You can only be as secure as the lower levels−Pig finished integrating with security−Oozie supports security−HBase is being updated for security
• All backing data files are owned by HBase user.• Doesn’t support reading/writing files directly by
application
−Hive is also being updated• Doesn’t support column level permissions
Questions?
• Questions should be sent to:
− common/hdfs/[email protected]
• Security holes should be sent to:
• Thanks!