8/7/2019 Nccs Lustre
1/24
NCCS Lust re Fi le Syst em Overv iew
resented b
Sarp Oral, Ph.D.
NCCS Scalin Worksho
Oak Ridge National Laboratory
U.S. Department of Energy
August 1st, 2007
8/7/2019 Nccs Lustre
2/24
Out l ine
What is Lustre
NCCS Jaguar Lustre
Other NCCS Lustre File Systems
Lustre Centre of Excellence (LCE) at ORNL
Oak Ridge National Laboratory U.S. Department of Energy
8/7/2019 Nccs Lustre
3/24
What is Lust re
Lustre
POSIX compliant
Parallel file system
Lustre provides
Hi h-scalabilit
High-performance
Sin le lobal name s ace
Lustre is a software only architecture
Oak Ridge National Laboratory U.S. Department of Energy
8/7/2019 Nccs Lustre
4/24
Lus t re Arch i t ec t u re
Lustre consists of four major components
MetaData Server (MDS)
Object Storage Servers (OSSs)
Object Storage Targets (OSTs)
and of course Clients
Manages the name space, directory and file operations
Stores file system metadata
OSS
Mana es the OSTs
OST
Manages underlying block devices
Oak Ridge National Laboratory U.S. Department of Energy
Stores file data stripes
8/7/2019 Nccs Lustre
5/24
Lus t re Arch i t ec t u re
MDSMetadata O s
File creation, stats,
OSS
recovery
OSS
OSTOST
OSTOST
Block I/O and file
locking
BlockBlock
Block
Device
Block
Device
Oak Ridge National Laboratory U.S. Department of Energy
Device
8/7/2019 Nccs Lustre
6/24
Lus t re Arch i t ec t u re
All servers have a full-blown file system they operate on
,
Today, only a single active MDS is supported
Goal is to have many MDSs in near future
Whole file system is limited by that single MDS performance
Although not that bad, sometimes can be a problem
Oak Ridge National Laboratory U.S. Department of Energy
8/7/2019 Nccs Lustre
7/24
Lus t re Arch i t ec t u re
Failover
Active-passive pairs for MDS and OSS
Works fine on all NIX based systems except Catamount Failover is not supported with current UNICOS
Failover will be supported with the CNL
Supports sparse files
We are using 2TB partitions
Unlike all other NIX based systems, on Catamount clients, Lustre
access s ac eve over us re Catamount clients are uninterruptible and I/O is not cached
Liblustre is directly linked into the application
Oak Ridge National Laboratory U.S. Department of Energy
8/7/2019 Nccs Lustre
8/24
Lus t re Arch i t ec t u re
Striping is the key for achieving high scalability and performance
File data is written to and read from multiple OSTs Provides higher aggregate R/W BW than a single server can
deliver
Allows file sizes to be larger than a single OSS could handle
Simple tips Over striping might be bad
Too small chunks to write into each OST
Under utilizing OSTs and the network Under striping might be bad
Too much stress per each OST
Oak Ridge National Laboratory U.S. Department of Energy
8/7/2019 Nccs Lustre
9/24
Lus t re Arch i t ec t u re
Stripe pattern can be changed by the user
Before the file or directory is created
Once created, the stripe pattern is fixed
lfs setstripe to set the stripe pattern
lfs getstripe to query the stripe pattern
Within the application Several low-level ioctlcalls available to set and uer stri e
patterns and some other EA
Oak Ridge National Laboratory U.S. Department of Energy
8/7/2019 Nccs Lustre
10/24
Lus t re Arch i t ec t u re
11 1
OST1 OST2 OST3
File A 21 1
OST1 OST2 OST3
File A
34
File B
File C134
2File B
File C
Single Striped Two Striped
Stri e count or width
21 1 File A
# of OSTs the file has beenstripped over
Stripe size
4
File C Size of each stripe on an OST
Normally same for all OSTsfor a given file
Oak Ridge National Laboratory U.S. Department of Energy
8/7/2019 Nccs Lustre
11/24
Lus t re Arch i t ec t u re
Everything is based on RPCs
RPCs Sometimes messages are dropped or lost
Timeouts
If the error is caused by the client side
Client will simply disconnect from that particular server
Keep retrying to connect
Eviction
If the error is caused by the server side
Client will discover it has been evicted by the next request Clients all buffer cache will be invalidated
Dirty data will be lost
Oak Ridge National Laboratory U.S. Department of Energy
8/7/2019 Nccs Lustre
12/24
Lus t re Arch i t ec t u re
Architecture has changed with Lustre 1.4.6
LNET and LNDs
Inde endent network conduits has been introduced
A single network and recovery layer establishes connection withupper Lustre file system layers and the lower network conduits
TCP, Cray Portals, Infiniband, Myricom, Elan
Lustre File System Layer
us re e wor ng ayer
TCP
LNDMyricom
LND
Elan
LND
TCP
VendorMyricom
Vendor
Elan
Vendor
Oak Ridge National Laboratory U.S. Department of Energy
rary rary rary
8/7/2019 Nccs Lustre
13/24
Lus t re Arch i t ec t u re
POSIX compliant (in a cluster)
Clients dont see stale data or metadata Semantics are guaranteed by strict locking
Exceptions Flock/lockf is still not supported
Security Kerberos capabilities on the way Encrypted Lustre file system is under development
Oak Ridge National Laboratory U.S. Department of Energy
8/7/2019 Nccs Lustre
14/24
NCCS J aguar Lus t re
Cray XT4 (Catamount) Lustre, 3-D tori
,
Uses 3 MDS service nodes 72 XT4 service nodes as OSSs
2 OSTS for the 300 TB FS
1 OST per each remaining 150 TB FS
2 1- ort 4 Gb FC HBAs er OSS
45 GB/s block I/O for the 300 TB FS
DDN 9550s
18 racks/couplets Write-back cache is 1MB on each controller
36 TB per couplet w/ Fibre Channel drives
Oak Ridge National Laboratory U.S. Department of Energy
Each LUN has a capacity of 2 TB and 4 KB block size
8/7/2019 Nccs Lustre
15/24
NCCS J aguar Lus t re
Cray XT4 (Catamount) LUN configuration
Oak Ridge National Laboratory U.S. Department of Energy
8/7/2019 Nccs Lustre
16/24
NCCS J aguar Lus t re
Default stripe count/width on Jaguar (Catamount)
Default stripe size on Jaguar (Catamount)
Oak Ridge National Laboratory U.S. Department of Energy
8/7/2019 Nccs Lustre
17/24
NCCS J aguar Lus t re
Jaguar Compute Node Linux (CNL) Lustre
Compared to the Catamount side
Exact configuration details are to be determined
A mechanism to transfer files between the Catamount side and
the CNL side
Can be done by users
Oak Ridge National Laboratory U.S. Department of Energy
8/7/2019 Nccs Lustre
18/24
Ot her NCCS Lust re Fi le Syst em s
End-to-end cluster (Ewok)
. . .
1 MDS, 6 OSS, 2 OST/OSS, OFED 1.1 IB, 81 clients 20 TB, ~3-4 GB/s
Viz cluster (Everest) om ng soon
1 MDS, 10 OSS, 2 OST/OSS, OFED 1.2
~Cou le tens of TB ~4-5 GB/s
Oak Ridge National Laboratory U.S. Department of Energy
8/7/2019 Nccs Lustre
19/24
Ot her NCCS Lust re Fi le Syst em s
Center-wide Lustre cluster (Spider)
To serve all NCCS resources Jaguar, Everest, and Ewok by the end of 2007
Baker by the end of 2008 And all new additions from that point on
Phase 0
Will be in production soon over Jaguar , s, , x
10 couplets of DDN 8500s, FC 2 Gb direct links w/ failoverconfigured
Phase 1: additional 20 GB/s by the end of 2007
Oak Ridge National Laboratory U.S. Department of Energy
8/7/2019 Nccs Lustre
20/24
Ot her NCCS Lust re Fi le Syst em s
Spider provides
+ Ability to analyze data offline+ On the fl data anal sis/visualization ca abilit
+ Ease of diagnostics/decoupling
+
Lower acquisition/expansion cost
Oak Ridge National Laboratory U.S. Department of Energy
8/7/2019 Nccs Lustre
21/24
Ot her NCCS Lust re Fi le Syst em s
Lustre router nodes on Jaguar
Route Lustre packets between
TCP to/from Cray Portals
IB to/from Cray Portals
~ 450 MB/s/XT4 SIO node over TCP/Cray Portals ~ 600-700 MB/s/XT4 SIO node over IB/Cray Portals
Oak Ridge National Laboratory U.S. Department of Energy
8/7/2019 Nccs Lustre
22/24
Ot her NCCS Lust re Fi le Syst em s
Jaguar SIO
Ewok
Router Node
(Cray Portals/IB)
Everest
Infiniband
NetworkTCP
Network
Spider
Oak Ridge National Laboratory U.S. Department of Energy
Backend
Disks
Systems
8/7/2019 Nccs Lustre
23/24
8/7/2019 Nccs Lustre
24/24
Lust re Cent re of Ex c el lenc e (LCE) at ORNL
Lustre Centre of Excellence (LCE) established in December 2006.
Create an on-site presence at ORNL (1st floor back hall) - ,
Oleg Drokin and Wang Di
Develop a risk mitigation Lustre package for ORNL ,
MPILND
In out-years explore possible 1 TB/s solutions
Develop local expertise to reduce dependence on CFS and Cray Peter Braam gave a 3 day tutorial on Lustre Internals in January A sys admin training is being planned
Assist Science teams in tuning their application I/O Focus on 2-3 key apps initially and document results
Wang Di
Oak Ridge National Laboratory U.S. Department of Energy
On-site Lustre workshops for application teams