Top Banner
© 2012 WARP Mechanics Ltd. All Rights Reserved. Josh Judd, CTO Lustre+ZFS: Reliable/Scalable Storage
10
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lustre+ZFS:Reliable/Scalable Storage

© 2012 WARP Mechanics Ltd. All Rights Reserved.

Josh Judd, CTO

Lustre+ZFS: Reliable/Scalable Storage

Page 2: Lustre+ZFS:Reliable/Scalable Storage

© 2012 WARP Mechanics Ltd. All Rights Reserved.

2Page

ZFS+Lustre: Open Storage Layers• ZFS:

– Volume management layer (RAID)– Reliable storage (checksums)– Feature rich (snap, compression, replication, etc.)– Accelerators (SSD hybrid)– Scalable (E.g.16 exabytes for one file)

• Lustre:– Linux-centric scale-out filesystem– Powers the world’s largest super computers– Single FS can be 10s of PB with TBs/sec of

throughput– Can sit on top of ZFS

Page 3: Lustre+ZFS:Reliable/Scalable Storage

© 2012 WARP Mechanics Ltd. All Rights Reserved.

3Page

What does ZFS do for me?

• HPC relevant features:– Support for 1/10GbE, 4/8/16GbFC, and 40Gb

Infiniband– Multi-layer cache combines DRAM and SSDs with

HDDs– Copy-on-write eliminates holes and accelerates

writes– Checksums eliminate silent data corruption and bit

rot– Snap, thin provisioning, compression, de-dupe, etc.

built in– Lustre and SNFS integration allows 40GbE

networking

• Same software/hardware supports NAS and RAID

– One management code base to control all storage platforms

• Open storage. You can have the source code.

Page 4: Lustre+ZFS:Reliable/Scalable Storage

© 2012 WARP Mechanics Ltd. All Rights Reserved.

4Page

ZFS Feature Focus

• Enhanced data integrity:– Legacy RAID is subject to silent corruption and

holes– Little known fact: Virtually all RAIDs do not check

parity on each read – so if a bit flipped, you just get the wrong data!

– ZFS adds a checksum to every block to solve this

• Advanced cache:– Legacy RAID has insufficient cache to be

meaningful– ZFS-based WARPraid supports 100s of GB of DRAM,

plus 10s of TBs of SSD– E.g., you can “push” directories onto SSD cache

ahead of time to drastically accelerate read-intensive workloads

Page 5: Lustre+ZFS:Reliable/Scalable Storage

© 2012 WARP Mechanics Ltd. All Rights Reserved.

5Page

What does Lustre do for me?

• Category: Distributed (parallel-ish) filesystem– Similar(-ish) to StorNext, GPFS, pNFS, etc.

• Allows performance and capacity to scale independently

• One FS can be 10s of PB with TBs of throughput

• No theoretical upper limit on architectural scalability

– Yes, there are practical limits...– But those expand every year, and...– Even a “practical” Lustre FS can be 10s or 1000s of

times larger than other FSs on the market

Page 6: Lustre+ZFS:Reliable/Scalable Storage

© 2012 WARP Mechanics Ltd. All Rights Reserved.

6Page

How do they combine?

• ZFS is currently supportable on Solaris. Lustre is currently supportable on Linux. So... How do they mix?

• In theory, any of three ways:– Port Lustre to Solaris – has not been done– Port ZFS to Linux – in progress – replace MD/RAID

and EXT4– Use ZFS on Solaris as a RAID controller under

Lustre on Linux

• WARP Mechanics focuses on the third option– Is supportable in production now– Maintains full separation of code– Still allows ZFS to replace EXT4 down the road,

while performing volume management on separate controllers – which aids performance and scalability

Page 7: Lustre+ZFS:Reliable/Scalable Storage

© 2012 WARP Mechanics Ltd. All Rights Reserved.

7Page

OSS 1a

RAID 1a

ZIL

Lustre Clients

QDR/FDR IB or 10/40Gbps ENet

OSS 1b OSS 2a OSS 2b

ARC

RAID 1b

L2ARC

~250TB HDDs

ZFS RAID 1

RAID 2a

ZILARC

RAID 2b

L2ARC

~250TB HDDs

ZFS RAID 2

P2P

IB or ENet

Example Architecture

Page 8: Lustre+ZFS:Reliable/Scalable Storage

© 2012 WARP Mechanics Ltd. All Rights Reserved.

8Page

Scale out example:

200 ZFS RAID systems =

400 controllers = 1.2TBytes/sec

16,000 NL-SAS HDDs = ~40PB usable

~200TB usable per Neutronium; estimated 3GB/sec per ctrl’er well-formed IO

[ ... ]

[ ... ]

Lustre Clients

Infiniband or High-Speed Ethernet

OSS1a

R1bR1a

OSS1b

N-001 N-002 N-003 N-004 N-200

Example Architecture

Page 9: Lustre+ZFS:Reliable/Scalable Storage

© 2012 WARP Mechanics Ltd. All Rights Reserved.

9Page

~200TB usable per Neutronium; estimated 3GB/sec per ctrl’er well-formed IO

[ ... ]

[ ... ]

Lustre Clients

Infiniband or High-Speed Ethernet

SMB SMB NFS NFS

Non-Lustre Clients

IP (LAN/CAN/WAN)

Example Architecture

Page 10: Lustre+ZFS:Reliable/Scalable Storage

© 2012 WARP Mechanics Ltd. All Rights Reserved.

10Page

Practical WARP Implementation

• The PetaPod Appliance: