Stephen Skory, PhD | Seagate Technology Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan LUG Conference | 15 April 2015 Understanding Hadoop Performance on Lustre
Stephen Skory, PhD | Seagate Technology
Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan
LUG Conference | 15 April 2015
Understanding HadoopPerformance on Lustre
Seagate Confidential 2Seagate ConfidentialStephen Skory | Seagate Technology | LUG 15 Apr 2015
Our live demo of Hadoop on Lustre at Supercomputing November 17-20, 2014
• Why Hadoop on Lustre?
• LustreFS Hadoop Plugin
• Our Test System
• HDFS vs Lustre Benchmarks
• Performance experiments on Lustre and Hadoop Configuration
• Diskless Node Optimization
• Final Thoughts
Seagate Confidential 4
• Lustre is POSIX compliant and convenient for a wide range of workflows and applications, while HDFS is a distributed object store designed narrowly for the write once, read many Hadoop paradigm
• Disaggregate computation and storage – HDFS requires storage to accompany compute, with Lustre each part can be scaled or resized according to needs
• Lower Storage Overhead – Lustre is typically backed by RAID with ~40% overhead, while HDFS has a default 200% overhead
• Speed – In our testing we see nearly 30% TeraSort v2 Performance boost over HDFS. When transfer time is considered the improvement is over 50%
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Why Hadoop on Lustre?
Seagate Confidential 5Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Hadoop on HDFS and on Lustre Data Flow
Compute
HDFS Storage
Lustre Storage
Typical Workflow:Ingestion of Data before Analysis
Compute
HDFS Storage
Lustre Storage
1) Transfer (and usually duplicate) data on HDFS2) Analyze data
1) Analyze data on Lustre
Seagate Confidential 6Stephen Skory | Seagate Technology | LUG 15 Apr 2015
LustreFS Plugin
• It is a single Java jar file that needs to be placed in the CLASSPATH of whatever service that needs to access Lustre file
• Requires minimal changes to Hadoop configuration XML files
• Instead of file:/// or hdfs://hostname:8020/, files are accessed using lustrefs:///
• HDFS can co-exist with this plugin, and jobs can run between the two file systems seamlessly
• Hadoop service accounts (e.g. “yarn” or “mapred”) need to have permissions to a directory hierarchy on Lustre for temporary and other accounting data
• Freely available at http://github.com/seagate/lustrefs today, with documentation
• A PR has been issued to include the plugin with Apache Hadoop
Seagate Confidential 7
• Hadoop Rack:• 1 Namenode, 18 Datanodes• 12 cores & 12 HDDs per Datanode,
total of 216 for both• 19x10 GbE connection within rack
for Hadoop traffic• Each Hadoop node has a second
10GbE connection to the ClusterStor network for a total of 190Gb full duplex bandwidth
• ClusterStor Rack:• 2 ClusterStor 6000 SSUs• 160 data HDDs
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
All benchmarks shown were performed on this systemOur Hadoop on Lustre Development Cluster
Seagate Confidential 8
0
500
1000
1500
Hadoop 1 on HDFS Hadoop 1 onLustre
Hadoop 2 on HDFS Hadoop 2 onLustre
Com
plet
ion
Tim
e (s
)
Transfer Time (s)Analysis Time (s)
28% FasterAnalysis
53% Faster Overall Data Transfer
not Needed
38% FasterAnalysis
63% Faster Overall Data Transfer
not Needed
Faster
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
How Much Faster is Hadoop on Lustre?Terasort timings
Seagate Confidential 9Stephen Skory | Seagate Technology | LUG 15 Apr 2015
How Much Faster is Hadoop on Lustre?Ecosystem Timings Compared to HDFS (transfer times not included)
-10%
0%
10%
20%
30%
40%
TeraSort Sort Wordcount Mahout Pig Hive
38%
5%
15%
5%1%
8%
28%
-7%
6%
17%
-1% 0%
HoL
Impr
ovem
ent o
ver H
DFS
Hadoop 1 Hadoop 2
All remaining benchmarks presented were performed on Hadoop v2
Seagate Confidential 10Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Stripe Size PerformanceHadoop on Lustre Performance
Stripe size is not astrong factor in Hadoopperformance, exceptperhaps for 1MB stripesin some cases
Recommend 4MB orgreater
Seagate Confidential 11Stephen Skory | Seagate Technology | LUG 15 Apr 2015
max_rpcs_in_flight and max_dirty_mbHadoop on Lustre Performance
Terasort performance on 373GB with 4MB stripe size, 216 Mappers and 108 Reducers
RPCs 8 16 128Dirty MB 32 128 512Time 5m04s 5m03s 5m07s
Changing these values does not strongly affect performance.
Seagate Confidential 12Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Number of Mappers and ReducersHadoop on Lustre Performance
Terasort 373GB
The number of mappers and reducers has a much stronger effect on performance than any Lustre client configuration
Seagate Confidential 13
• When it comes to Terasort, and likely other I/O intensive jobs, most performance gains come from tuning Hadoop rather than Lustre
• In contrast to HDFS, this means that Hadoop performance is less tied to how (or where, in the case of HDFS) the data is stored on disk, which gives users more options in how to tune their applications
• It’s best to focus on tuning Hadoop parameters for best performance:• We have seen that Lustre does better with fewer writers than the same job going to
HDFS (e.g. number of Mappers for Teragen, or Reducers in Terasort)• A typical Hadoop rule of thumb is to build clusters, and run jobs, following the 1:1
core:HDD rule. With Hadoop on Lustre, this isn’t necessarily true• Particularly for CPU-intensive jobs, performance tuning for Lustre is very similar to
typical Hadoop on HDFS
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Some additional observationsHadoop on Lustre Performance
Seagate Confidential 14
• Standard Hadoop has local disks used for both HDFS and temporary data
• However, when using Hadoop on Lustre on diskless compute nodes, all data needs to go to Lustre, including temporary data.
• Beginning in 2011, we investigated an idea of how to optimize Map/Reduce performance when using Lustre as the temporary data store point
• This required a modification to core Hadoop functionality, and we released these changes at SC14:
• https://github.com/seagate/hadoop-on-lustre• https://github.com/seagate/hadoop-on-lustre2
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Hadoop on Lustre Diskless Modification
(2011)
Seagate Confidential 15Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Map/Reduce Data Flow BasicsLustre Disk HDFS Disk Local Xfer Network XferRAM
Hado
op
on H
DFS
Local Disk
Mapper Xfer Task(Reducer)
Sort Reduce
Insufficient RAM
Sufficient RAM Note: 1 Local HDFS+ 2 Network HDFScopies forReduce Output
Note: Most casesHDFS data is localto mapper but canbe remote
• HDFS data is split into blocks which is analogous to Lustre’s stripe size. Default size is 128 MB for Hadoop 2
• The ResourceManager does best effort to assign computation to nodes that already have the data (Mapper)
• When done, the Mapper writes intermediate data to local scratch disk (no replication)• Reducers collect data from Mappers (reading from local disk) over HTTP (called the “shuffle”
phase) and sort incoming data either in RAM or on disk (again, local scratch disk)• When data is sorted, the Reducer applies the reduction algorithm and writes output to HDFS
that is 3x replicated
Seagate Confidential 16Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Map/Reduce Data Flow BasicsLustre Disk HDFS Disk Local Xfer Network XferRAM
Hado
op
on H
DFS
Hado
op
on L
ustr
ew
ith lo
cal
Disk
s
Local Disk
Mapper Xfer Task(Reducer)
Sort Reduce
Insufficient RAM
Sufficient RAM Note: 1 Local HDFS+ 2 Network HDFScopies forReduce Output
Note: Most casesHDFS data is localto mapper but canbe remote
Mapper Xfer Task(Reducer)
Sort Reduce
Note: 1 Lustrenetwork copy forReduce output
Insufficient RAM
Sufficient RAM
Using Lustre instead of HDFS is very similar, except the Mapper reads data from Lustre over the network, and the Reducer output is not triple replicated.
The lower diagram describes the setup for how we performed all the benchmarks we’ve presented so far, by using both Lustre and local disks to each Hadoop compute node.
Seagate Confidential 17Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Hadoop on Lustre Map/Reduce Diskless ModificationLustre Disk HDFS Disk Local Xfer Network XferRAM
Hado
op
on H
DFS
Hado
op
on L
ustr
eDi
skle
ssU
nmod
ified
Hado
op
on L
ustr
eDi
skle
ssM
odifi
ed
Local Disk
Mapper Xfer Task(Reducer)
Sort Reduce
Insufficient RAM
Sufficient RAM Note: 1 Local HDFS+ 2 Network HDFScopies forReduce Output
Note: Most casesHDFS data is localto mapper but canbe remote
Mapper Xfer Task(Reducer)
Sort Reduce
Mapper Sort Reduce
Note: 1 Lustrenetwork copy forReduce output
Insufficient RAM
Sufficient RAM
Insufficient RAM
Sufficient RAMNote: 1 Lustrenetwork copy forReduce output
Seagate Confidential 18
Mapper Reducer
Modified
Mapper Reducer
Unmodified
Hadoop Node
Size Unmodified Modified100 GB 131s 155s500 GB 538s 558s
Stephen Skory | Seagate Technology | LUG 15 Apr 2015
Hadoop on Lustre Diskless Modification
Terasort Timings:
The modified method is slower!
IO Cache
IO Cache IO Cache
IO Cache
Takeaway: IO Caching speeds up theunmodified method.
Temporary filesSaved to Lustre
Seagate Confidential 19
• We have explored three methods of Hadoop on Lustre:• Using local disk(s) for temporary storage• Using Lustre for temporary storage with no modifications to Map/Reduce• Using Lustre for temporary storage, but with modifications to Map/Reduce
• We find that each method above is slower than the ones above it
• The local disks do not need to be close to “big” (where “big” is the largest dataset analyzed), but it is most optimal to have some kind of local disk for temporary data
• Thank you to my collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan
Final Thoughts