© 2012 WARP Mechanics Ltd. All Rights Reserved. Josh Judd, CTO Lustre+ZFS: Reliable/Scalable Storage
May 26, 2015
© 2012 WARP Mechanics Ltd. All Rights Reserved.
Josh Judd, CTO
Lustre+ZFS: Reliable/Scalable Storage
© 2012 WARP Mechanics Ltd. All Rights Reserved.
2Page
ZFS+Lustre: Open Storage Layers• ZFS:
– Volume management layer (RAID)– Reliable storage (checksums)– Feature rich (snap, compression, replication, etc.)– Accelerators (SSD hybrid)– Scalable (E.g.16 exabytes for one file)
• Lustre:– Linux-centric scale-out filesystem– Powers the world’s largest super computers– Single FS can be 10s of PB with TBs/sec of
throughput– Can sit on top of ZFS
© 2012 WARP Mechanics Ltd. All Rights Reserved.
3Page
What does ZFS do for me?
• HPC relevant features:– Support for 1/10GbE, 4/8/16GbFC, and 40Gb
Infiniband– Multi-layer cache combines DRAM and SSDs with
HDDs– Copy-on-write eliminates holes and accelerates
writes– Checksums eliminate silent data corruption and bit
rot– Snap, thin provisioning, compression, de-dupe, etc.
built in– Lustre and SNFS integration allows 40GbE
networking
• Same software/hardware supports NAS and RAID
– One management code base to control all storage platforms
• Open storage. You can have the source code.
© 2012 WARP Mechanics Ltd. All Rights Reserved.
4Page
ZFS Feature Focus
• Enhanced data integrity:– Legacy RAID is subject to silent corruption and
holes– Little known fact: Virtually all RAIDs do not check
parity on each read – so if a bit flipped, you just get the wrong data!
– ZFS adds a checksum to every block to solve this
• Advanced cache:– Legacy RAID has insufficient cache to be
meaningful– ZFS-based WARPraid supports 100s of GB of DRAM,
plus 10s of TBs of SSD– E.g., you can “push” directories onto SSD cache
ahead of time to drastically accelerate read-intensive workloads
© 2012 WARP Mechanics Ltd. All Rights Reserved.
5Page
What does Lustre do for me?
• Category: Distributed (parallel-ish) filesystem– Similar(-ish) to StorNext, GPFS, pNFS, etc.
• Allows performance and capacity to scale independently
• One FS can be 10s of PB with TBs of throughput
• No theoretical upper limit on architectural scalability
– Yes, there are practical limits...– But those expand every year, and...– Even a “practical” Lustre FS can be 10s or 1000s of
times larger than other FSs on the market
© 2012 WARP Mechanics Ltd. All Rights Reserved.
6Page
How do they combine?
• ZFS is currently supportable on Solaris. Lustre is currently supportable on Linux. So... How do they mix?
• In theory, any of three ways:– Port Lustre to Solaris – has not been done– Port ZFS to Linux – in progress – replace MD/RAID
and EXT4– Use ZFS on Solaris as a RAID controller under
Lustre on Linux
• WARP Mechanics focuses on the third option– Is supportable in production now– Maintains full separation of code– Still allows ZFS to replace EXT4 down the road,
while performing volume management on separate controllers – which aids performance and scalability
© 2012 WARP Mechanics Ltd. All Rights Reserved.
7Page
OSS 1a
RAID 1a
ZIL
Lustre Clients
QDR/FDR IB or 10/40Gbps ENet
OSS 1b OSS 2a OSS 2b
ARC
RAID 1b
L2ARC
~250TB HDDs
ZFS RAID 1
RAID 2a
ZILARC
RAID 2b
L2ARC
~250TB HDDs
ZFS RAID 2
P2P
IB or ENet
Example Architecture
© 2012 WARP Mechanics Ltd. All Rights Reserved.
8Page
Scale out example:
200 ZFS RAID systems =
400 controllers = 1.2TBytes/sec
16,000 NL-SAS HDDs = ~40PB usable
~200TB usable per Neutronium; estimated 3GB/sec per ctrl’er well-formed IO
[ ... ]
[ ... ]
Lustre Clients
Infiniband or High-Speed Ethernet
OSS1a
R1bR1a
OSS1b
N-001 N-002 N-003 N-004 N-200
Example Architecture
© 2012 WARP Mechanics Ltd. All Rights Reserved.
9Page
~200TB usable per Neutronium; estimated 3GB/sec per ctrl’er well-formed IO
[ ... ]
[ ... ]
Lustre Clients
Infiniband or High-Speed Ethernet
SMB SMB NFS NFS
Non-Lustre Clients
IP (LAN/CAN/WAN)
Example Architecture
© 2012 WARP Mechanics Ltd. All Rights Reserved.
10Page
Practical WARP Implementation
• The PetaPod Appliance: