* Some names and brands may be claimed as the property of others. OpenFabrics Software User Group Workshop Lustre * Filesystem for Cloud and Hadoop * Robert Read, Intel
* Some names and brands may be claimed as the property of others.
OpenFabrics Software User Group Workshop
Lustre* Filesystem for Cloud and Hadoop*
Robert Read, Intel
* Some names and brands may be claimed as the property of others.
Lustre* for Cloud and Hadoop*
• Brief Lustre History and Overview • Using Lustre with Hadoop • Intel® Cloud Edition for Lustre
March 15 – 18, 2015 #OFADevWorkshop 2
* Some names and brands may be claimed as the property of others.
What is Lustre*
• High performance parallel filesystem for Linux environments • Designed to use RDMA on high performance fabrics
• Allows large number of users to share a file system • High speed and low latencies • Across local or wide area networks
• Designed for reliable storage • Ideal for streaming IO and large, shared datasets • Evolving to bring parallel filesystem to new
workloads
March 15 – 18, 2015 #OFADevWorkshop 3
* Some names and brands may be claimed as the property of others.
Quick History
• Originally built to solve next gen IO problems for HPC
• An open source (GPL) project from the beginning
• Core team has survived numerous transitions and is now safely established in Intel’s HPDD.
• Used in production systems since 2002 • Intel Enterprise Edition for Lustre* since 2013 • Intel Cloud Edition now available on AWS
Marketplace
March 15 – 18, 2015 #OFADevWorkshop 4
* Some names and brands may be claimed as the property of others.
Lustre* Storage Components
5
Management Target
Metadata Targets
Object Storage Targets
Lustre mount serviceInitial point of contact for Clients
Namespace of file systemFile layouts, no dataScalable
File content stored as objectsStriped across multiple targetsScales to 100s
MGS MGT
MDSMDT
OSSOST
* Some names and brands may be claimed as the property of others.
Hadoop* Introduction
• Open source framework for data-intensive computing
• Parallelism hidden by framework – Highly scalable: can be applied to large datasets (Big
Data) and run on commodity clusters • Comes with its own user-space distributed file
system (HDFS) based on the local storage of cluster nodes
6
* Some names and brands may be claimed as the property of others.
Hadoop* with HDFS
• HDFS locality has advantages, but… – HDFS requires import/
export to share data – Compute nodes require
local storage – Hadoop nodes are both
compute and IO nodes – Hadoop nodes are
single-purpose
March 15 – 18, 2015 #OFADevWorkshop 7
Lustre Filesystem
HPC Cluster
Hadoop Cluster HDFS
Import / Export
* Some names and brands may be claimed as the property of others.
Hadoop* Adapter for Lustre*
• Shared data repository for all compute resources
• Use data in place (no import/export)
• Dedicated Compute and IO nodes
March 15 – 18, 2015 #OFADevWorkshop 8
Lustre Filesystem
Hadoop Cluster
HPC Cluster
* Some names and brands may be claimed as the property of others.
HPC Adapter for MapReduce
• Run Hadoop* on HPC cluster
• Replace YARN with Slurm
• Single compute cluster used for variety of workloads
March 15 – 18, 2015 #OFADevWorkshop 9
Lustre Filesystem
HPC Cluster
Hadoop
* Some names and brands may be claimed as the property of others.
LUSTRE (Global Namespace)
Optimized Shuffle: HDFS vs Lustre*
Sort Merge
Map O/P
Spill
Map
Map O/P
Map O/P
Map O/P Map O/P
Shuffle Merge
Input
Reduce
Output
Input Output
Spill
Spill File : Reducer à N : 1
Local Disk Local Disk
REDUCE MAP
HDFS
Repetitive
* Some names and brands may be claimed as the property of others.
Performance Comparison
• Tata Consultancy Services performed comparison in 2014 – “Performance comparison of Lustre* and HDFS for MR
implementation of FSI workload using HDDP cluster hosted in the Intel BigData Lab in Swindon (UK) and Intel® Enterprise Edition for Lustre* software”
– Intel® EE for Lustre = 3 X HDFS for single job – Intel® EE for Lustre = 5.5 X HDFS for concurrent workload
• http://insidebigdata.com/2014/09/29/performance-comparison-intel-enterprise-edition-lustre-hdfs-mapreduce/
March 15 – 18, 2015 #OFADevWorkshop 11
* Some names and brands may be claimed as the property of others. 23
Degree of Concurrency = 1
Intel® EE for Lustre* = 3 X HDFS for optimal SC settings*
March 15 – 18, 2015 #OFADevWorkshop 12 http://www.eofs.eu/fileadmin/lad2014/slides/21_Rekha_Singhal_LAD2014_Lustre_vs_HDFS_MR.pdf
* Some names and brands may be claimed as the property of others. 24
Degree of Concurrency = 1
Intel® EE for Lustre* optimal SC gives 70% improvement over HDFS
March 15 – 18, 2015 #OFADevWorkshop 13
http://www.eofs.eu/fileadmin/lad2014/slides/21_Rekha_Singhal_LAD2014_Lustre_vs_HDFS_MR.pdf
* Some names and brands may be claimed as the property of others.
Intel Cloud Edition for Lustre*
• Three Intel Lustre AMIs available on AWS Marketplace*: • Community Version • Global Support • Global Support HVM
• Lustre v2.5.3 • Automated Lustre configuration • Lustre monitoring with Ganglia and LMT/ltop • Automated S3 data import • Global Support includes additional capabilities:
• Intel Lustre Support • High Availability(requires VPC) • Enhanced Networking (HVM + VPC) • Higher performance instance types
14
* Some names and brands may be claimed as the property of others.
Automated Lustre* Deployment
15
OSS
MGS
CloudFormation
MDS
OSS
OSS
2
3
4
5
2 MGS Initializes itself
DynamoDB
3 MGS updates DB with NID
MDS formats MDT, registers with MGS, updates DB.
5
5
5 OSSs format local targets, updates DB
1
1 CloudFormation creates a stack of AWS resources from a template4
* Some names and brands may be claimed as the property of others.
Using Virtual Private Cloud
March 15 – 18, 2015 #OFADevWorkshop 16
VPC Compute Subnet
Lustre* Private SubnetPublic Subnet
MGS &Console
OSS
OSS
NAT
ELB
Compute
MDS
AWS APIs
Worker Client
OSS
OSS
PublicNetwork OSS
OSS
OSS
OSS
Compute Compute Compute
* Some names and brands may be claimed as the property of others.
Automated S3 Import
• Option to import an S3 bucket into a new Lustre* filesystem
• Initially only the file metadata is imported • File contents are retrieved from S3 on demand
and stored on Lustre
March 15 – 18, 2015 #OFADevWorkshop 17
* Some names and brands may be claimed as the property of others.
Lustre* HA in the cloud
• HA template is available – Storage is managed independently of instances
• Failure scenario – AWS AutoScaling detects and terminates a failed
instance – AutoScaling creates a new instance – New instance identifies orphaned storage and
network interface – “Adopts” the storage and NIC – Target is brought back online and recovers
March 15 – 18, 2015 #OFADevWorkshop 18
* Some names and brands may be claimed as the property of others.
ICE-L Monitoring tools
19
* Some names and brands may be claimed as the property of others.
Large File Benchmark
• Using IOR benchmark • Comparing 3 Lustre
cluster configurations • Increase the number of
OSSs • 4 OSS • 8 OSS • 16 OSS
• Configurations of MGS and MDS are fixed
• 1-32 clients
20
MDSEBS Optimized
RAID08x 40GBStandard
110 MB/sec
m3.2xlarge
OSSEBS Optimized
8x 100GBStandard
110 MB/sec
m3.2xlarge
Client110 MB/sec
m3.2xlarge
MGS94 MB/sec
m1.medium
* Some names and brands may be claimed as the property of others.
IOR Sequential Read FPP
21
0
200
400
600
800
1000
1200
1400
1600
1 2 4 8 16 32
4OSS
8OSS
16OSS
N. Clients
MB/sec
Client’s network bottleneck
OSS’s network bottleneck
OSS’s network bottleneck
Close to the OSS network
* Some names and brands may be claimed as the property of others.
0
200
400
600
800
1000
1200
1400
1600
1 2 4 8 16 32
4OSS
8OSS
16OSS
IOR Sequential Write FPP
22
N. Clients
MB/sec
Client’s network bottleneck
OSS’s network bottleneck
OSS’s network bottleneck
Ops….
* Some names and brands may be claimed as the property of others.
Aggregate Performance During Run
23
time
1920 MB/sec
Long tail
LMT used to record the OST metrics during the IOR run. With a simple python script we create this graph: “aggregate performance vs time” to analyze the problem.
* Some names and brands may be claimed as the property of others.
MDTEST on 16 OSS Cluster Configuration
24
N. Threads/32 clients0
5000
10000
15000
20000
25000
30000
32 64 128 256
Directory crea9on
Directory stat
Directory removal
File crea9on
File stat
File removal
Tree crea9on
Tree removal
MOPS
* Some names and brands may be claimed as the property of others. #OFSUserGroup
OpenFabrics Software User Group Workshop
Thank You