Top Banner
SC’16 Technical Training
56

SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Sep 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

SC’16 Technical Training

Page 2: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com/content/www/us/en/software/intel-solutions-for-lustre-software.html.

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

3D XPoint, Intel, the Intel logo, Intel Core, Intel Xeon Phi, Optane and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

* Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation

2

Page 3: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

3

Introductions Lustre Overview Roadmap Deep Dive10:30 Break Lustre on ZFS UpdateLunch (provided)Intel® Omni-Path with LustreKnights Landing with LustreIntroduction to Intel® HPC OrchestratorLustre Performance Tuning Review

Page 4: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Intel® Scalable System FrameworkA Holistic Solution for All HPC Needs

Small Clusters Through Supercomputers

Compute and Data-Centric Computing

Standards-Based Programmability

On-Premise and Cloud-Based

Intel® Xeon® Processors

Intel® Xeon Phi™ Processors

Intel® FPGAs and Server Solutions

Intel® Solutions for Lustre*

Intel® Optane™ Technology

3D XPoint™ Technology

Intel® SSDs

Intel® Omni-Path Architecture

Intel® Silicon Photonics

Intel® Ethernet

Intel® HPC Orchestrator

Intel® Software Tools

Intel® Cluster Ready Program

Intel Supported SDVis

ComputeFabric

Memory / Storage

Software

4

Page 5: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Lets go around the room!

5

Page 6: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

December 2015: Intel’s Analysis of Top 100 Systems (top100.org)

71%

18%

4% 7%

Lustre GPFS NFS Other

9 of Top10 Sites

71% of Top100

Most Adopted PFS

Most Scalable PFS

Open Source GPL v2

Commercial Packaging

Vibrant Community

6

Page 7: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

1 Source: Chris Morrone, Lead of OpenSFS Lustre Working Group, April 2016

Intel65%

8%

6%

6%

3%3%

2%2%

1%2%

Intel ORNL* Seagate* Cray* DDN*

Atos* LLNL* CEA* IU Other

Intel65%

18%

4%2%

2%2%1%1%1%

Intel ORNL Cray Atos Seagate

DDN IU CEA Other

Commit per Organization Lines of codes per organization

7

Page 8: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

* Other names and brands may be claimed as the property of others.

Bioscience Government research and defense Large-scale manufacturingMechanical, computer-aided design & computer-aided engineering systems

Genomic data analysis, modeling and simulations

Weather and climate Energy FinanceFraud detection, Monte Carlo simulations,

risk management analysisHighly complex CGI rendering Seismic processing, reservoir modeling /

characterization, sensor data analysis

8

Government funded research. Surveillance, Signal Processing, encryption etc.

Page 9: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Intel® Scalable System Framework for HPC

* Other names and brands may be claimed as the property of others.

Intel® FOUNDATION Edition for Lustre* software

Delivers the latest functions and features, fully supported by Intel

Ideal for organizations that prefer to design and deploy their own open

source configurations

Intel® ENTERPRISE Edition for Lustre* software

Maximum performance with minimal complexity and cost for multi-

petabyte file system. Management with Intel® Manager for Lustre*

software

Intel® CLOUD Edition for Lustre* software

Cost-effective access to parallel storage on Amazon Web Services*

(AWS) and Microsoft Azure* to boost cloud-computing

9

Page 10: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Read/WhiteHeat Map

OSTBalance

MetadataOperations

Read/WhiteBandwidth

10

Page 11: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

* Other names and brands may be claimed as the property of others.

11

Page 12: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

* Other names and brands may be claimed as the property of others.

12

Page 13: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Intel Manager for Lustre

* Other names and brands may be claimed as the property of others.

ManagementNetwork

High Performance Data Network(Infiniband*, 10GbE)

MetadataServers(1-10s)

Object Storage Servers

(10s-1000s)

Lustre Clients (1 – 100,000+)

Object StorageTargets (OSTs)

Object StorageTargets (OSTs)MetadataTarget (MDT)

ManagementTarget (MGT)

Native Lustre* Client for Intel® Xeon Phi™ processorIntel® Omni-Path Support

Robin HoodOpenZFS, RAIDz

Hadoop* Adapters

13

HSM

Page 14: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

14

Page 15: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

15

Page 16: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Lustre w/ZFS – Unique Features

ZFS System Design

Software Installation

Lustre ZFS HA Overview

16

Page 17: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Raidz2: Data+2 parity data protection scheme

Raidz3: Data+3 parity data protection scheme

Vdev: Collection of devices (eg: raidz2 9+2 Vdev)

Zpool: Collection of vdevs

Zpools become Lustre OSTs

You can have many devs in a zpool

L2arc cache: ZFS Read Cache

17

Page 18: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Incredible reliability

– Data is always consistent on disk; silent data corruption is detected and corrected; smart rebuild strategy

Compression

– Maximize usable capacity for increased ROI

Snapshot – support built into Lustre

– Consistent snapshot across all the storage targets without stopping the file system.

Hybrid Storage Pool

– Data is tiered automatically across DRAM, SSD/NVMe and HDD accelerating random & small file read performance

Manageability

– Powerful storage pool management makes it easy to assemble and maintain Lustre storage targets from individual devices

18

Page 19: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Silent Data Corruption is a real world issue: “Data ~= Dada”

Causes:

Interface Design

Manufacturing Defects

Cable Defects

Heat/Power/Vibrations

Software defects

Netapp Study* : 1.5 Million Drives: 41 Months:400,000 Errors

* https://atg.netapp.com/wp-content/uploads/2008/03/corruption-fast08.pdf

19

Page 20: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

On Write:

Write data + checksum

On Read:

Read data and re-compute checksum then compare to original

On Error:

If running zRaid discard read and recalculate from VDEV

Notify user and continue on

20

Page 21: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

21

Enable more space allocation to usersminimizes hardware costsmore data in the same footprint

Increase the file transfer rateIncrease throughput by up to 25% See Laval University’s presentation fromHP CAST 2015: http://www.hp-cast.org/

Compression effects on genomics filesText based output of genomic sequence systemsHuman genome can generate 600GB file size

Page 22: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

22

How Can Lustre* Snapshots Be Used?

Undo/undelete/recover file(s) from the snapshot

Removed file by mistake, application failure causes data invalid

Quickly backup the filesystem before system upgrade

Upgrade Lustre/kernel may hit some trouble and need to roll back

Prepare a consistent frozen data view for backup tools

Ensure system is consistent for the whole backup

Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation

Page 23: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

ZFS-based Lustre* Snapshot Overview

23

lctl snapshot commands

Lustre kernel

lctl APILustre

control

Userspace

ZFS

controlMGS

MDSsLustre kernel ZFS tools set

OSSsLustre kernel ZFS tools set

ZFS snapshot created on each target with a new fsname

Mount as separate read-only Lustre filesystem on client(s)

Architecture details: http://wiki.lustre.org/Lustre_Snapshots

Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation

Page 24: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

24

Global Write Barrier

“Freeze” the system during creating snapshot pieces on every target.

Write barrier on MDTs only

No orphans, no dangling references

New lctl commands for the global write barrier

lctl barrier_freeze <fsname> [timeout (seconds)]

lctl barrier_thaw <fsname>

lctl barrier_stat <fsname>

Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation

Page 25: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

25

Two Phase Global Write Barrier Setup

MGS

MDT_0

OST_0

OST_m

OST_x

.

.

.

.

.

.

.

.

.

lctl barrier_freeze

MGS action MDT action

5. Get barrier lock from MGS,

sync/commit local transactions

1. start FREEZE1

2. Get barrier lock from MGS, block

client modifications, flush RPCs

4. Wait all FREEZE1 done;

start FREEZE2

6. Notify MGS FREEZE2 done7. Wait all FREEZE2 done

Barrier done

Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation

MDT_n

3. Notify MGS FREEZE1 done

User action

0. lctl barrier_freeze

Page 26: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

26

Fork/Erase Configuration Logs

Snapshot is independent from the original filesystem

New filesystem name (fsname) is assigned to the snapshot

Fsname is part of the configuration logs names

Fsname exists in the configuration logs entries

New lctl commands for fork/erase configuration logs

lctl fork_lcfg <fsname> <new_fsname>

lctl erase_lcfg <fsname>

Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation

Page 27: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

27

Mount Snapshot Read-only

Any modification of ZFS snapshot can trigger backend failure/assertion

Open ZFS dataset as readonly mode

NOT start cross-servers sync thread,

pre-create thread, quota thread

Skip sequence file initialization, orphan

cleanup, recovery

Ignore last_rcvd modification

Deny to create transaction

Forbid LFSCK

Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation

Page 28: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

28

Userspace Interfaces – lctl snapshot_xxx

Functionality Usage

Create snapshot lctl snapshot_create [-b | --barrier] [-c | --comment comment]<-F | --fsname fsname] [-h | --help] <-n | --name ssname>[-r | --rsh remote_shell][-t | --timeout timeout]

Destroy snapshot lctl snapshot_destroy [-f | --force] <-F | --fsname fsname>[-h | --help] <-n | --name ssname> [-r | --rsh remote_shell]

Modify snapshot

attributes

lctl snapshot_modify [-c | --comment comment] <-F | --fsname fsname>

[-h | --help] <-n | --name ssname> [-N | --new new_ssname][-r | --rsh remote_shell]

List the snapshots lctl snapshot_list [-d | --detail] <-F | --fsname fsname>[-h | --help] [-n | --name ssname] [-r | --rsh remote_shell]

Mount snapshot lctl snapshot_mount <-F | --fsname fsname> [-h | --help]<-n | --name ssname> [-r | --rsh remote_shell]

Umount snapshot lctl snapshot_umount <-F | --fsname fsname> [-h | --help]<-n | --name ssname> [-r | --rsh remote_shell]

Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation

Page 29: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Write Barrier Scalability

29

CPU: Intel® Xeon® E5620 @2.40GHz

– 4 cores * 2, HT

RAM: 64GB DDR3

Network: InfiniBand QDR

Storage: SATA disk arrays

2 MDTs per MDS

4 OSTs per OSS

Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation

14.4

17.2 18.4 17.2

16.818.2

20.2

18.2

0

5

10

15

20

25

1 2 4 8

ba

rrie

r_fr

ee

ze

tim

e (

se

co

nd

s)

MDTs count

Write Barrier Scalability

idle

busy

Page 30: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Snapshot I/O Scalability

30

22.8 21.4

28.6 26

27.2

27.6

30.8

26.6

1 1.4 1.2 1.2

1.8 1.4 1.6 1.6

0

5

10

15

20

25

30

35

1 2 4 8

sn

ap

sh

ot_

cre

ate

tim

e (

se

co

nd

s)

MDTs count

Snapshot Scalability with MDTs

20.821

25.3

21.2

25.4

29.2

30

32.6

1.8 1.4 1.6 1.61.4 1.6 2 2

0

5

10

15

20

25

30

35

2 4 8 16

sn

ap

sh

ot_

cre

ate

tim

e (

se

co

nd

s)

OSTs count

Snapshot Scalability with OSTs

idle+barrier busy+barrier

idle-barrier busy-barrier

Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation

Page 31: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

I/O Performance With Snapshots

31

Limited impact on metadata performance

– Measured via mds-survey on single MDT

– Slight benefit as changed blocks not freed

No significant impact on I/O performance

– Measure via obdfilter-survey on one OST

Not Lustre* specific, ZFS is COW based

159261490314817

13500

0

4000

8000

12000

16000

20000

destroy create

Ob

jec

ts/s

ec

on

d

Metadata Performance Impact

withsnasphot

withoutsnapshot

Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation

Page 32: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

32

Next Steps for Snapshot Feature

Phase I: scheduled for Community Lustre 2.10/EE 3.0 release landing

Phase II: Lustre* integrated snapshot

– Depends on users’ requirements vs. other Lustre features, performance, etc.

– More controllable and relatively independent solution

– Reuse Phase I global write barrier

– Integrate snapshot creation/mount/unmount into OSD

– Identify files/objects in each snapshot as part of File Identifier (FID)

Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation

Page 33: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

L2ARC Cache is supported

Read Cache

Local NVMe/SSD

1 L2ARC per Zpool

Read Test:

3.8 Million 64K Files

16 Clients

16HDD Raidz2 Zpool

1 DCP-3700 NVMe

33

Page 34: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

ZFS Manages:

--Disks: Zraid2, Zraid3, Mirror….

ZFS Allows:

OST and Disk Management in 1 place

Large OST:Example

4 x (9+2 raidz2 Vdev)

1 OST per OSS possible

Single storage management interface

34

Page 35: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Changes for using ZFS more efficiently

– Improved file create performance

– Snapshots of whole file system

Changes to core ZFS code

– Inode quota accounting

– Multi-mount protection for safety

– System and fault monitoring improvements

– Large dnodes for improved extended attribute performance

– Reduce CPU usage with hardware-assisted checksums, compression

– Declustered parity & distributed hot spaces to improve re-silvering

– Metadata allocation class to store all metadata on SSD/NVRAM

35

Page 36: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Path to Exascale

CORAL and future follow-on architectures are scoped with ZFS.

LLNL Sequoia1 (55PB File System)

Cheaper, less complex, higher performance file system for Sequoia

With Intel, Lustre and ZFS continue to advance

Collaborate with OpenZFS community on new features.

Improve metadata performance: LAD’16 Talk

36

1 http://computation.llnl.gov/projects/zfs-lustre

Page 37: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Native Encryption

Built-in encryption for

data at rest to provide

enhanced storage

security.

Persistent Read Cache

Update of existing L2ARC

read cache to persist data

across reboots.

37

Performance Enhancements

ZFS improvements for increased

metadata performance.

Fault Management

Enhanced fault monitoring and

management architecture for ZFS.

D-RAID

De-clustered RAIDZ provides

massively improved rebuild

performance after a drive failure.

Parity acceleration – Using AVX instructions to

accelerate parity calculation

Intel

IPCC

OpenZFS

Page 38: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

The CDDL (the license of OpenZFS) and GPLv2 (the license of Linux) are considered incompatible by the FSF (the authors of the GPL; see https://www.gnu.org/licenses/license-list.html#CDDL), but does not prohibit end users from using OpenZFS with Linux together in ways that don’t invoke that incompatibility. Intel does not distribute compiled binaries of OpenZFS kernel modules for Linux. Intel provides DKMS packages which help our customers automatically build OpenZFSmodules from source for use on their own systems. Consider seeking legal advice for any activities that might be considered “distribution” under GPLv2.

38

Page 39: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

39

Page 40: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

1x 12Gb SAS port ~ 4GB/s Block level

1x 2 12Gb SAS port ~ 6GB/s Block level (8x PCI limitation)

2 x 2 12Gb SAS port ~ 12GB/s Block Level

Decide on how many spare drives

Understand internal JBOD SAS layout

Developed specify strategy for Alignment

Which SAS Port will control what group of Drives

40

Page 41: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Zraid 9+2 or larger data drives counts for best write performance

60 drive JBOD ~ 13+2 x 4

90 drive JBOD ~ 9+2 x 8 (plus 2 Hot Spares)

84 drive JBOD ~ 12+2 x 7 (imbalanced)

Consider Spares

Use 1M Record Size on OSTs

Important for performance

How to connect enough SAS Cables?

41

Page 42: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Multi-Path:

Configure Priority Path Failover Groups

Round Robin Kills performance

Align Vdevs to specific Paths

Not documented by Redhat

Partners Own Multipath configuration

Zoning:

Zone JBOD on Vdev Alignments

Cable Pull requires HA Failover

42

Page 43: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

ECC Memory Mandatory

CPU has more duties then LDISKFS

Parity

Compression

CRC

ZFS Adaptive Read Cache needs Memory

128GB+ Recommended

Obdfilter-survey will run a few GB/s Less then IOR

Very helpful for Development

43

Page 44: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Install Process FE/Community

yum -y install kernel-devel dkms-2.2.0.3-30.git.7c3e7c5.el7.noarch.rpm

#install and configure Fabric (MOFED/IFS)

cd archive/artifacts/RPMS/x86_64/

yum -y install lib*.rpm lustre-osd-zfs-mount*.x86_64.rpm

yum -y install spl-dkms-*.noarch.rpm zfs-dkms-*.rpm lustre-dkms-*.noarch.rpm

yum -y install lustre-2.7.*.rpm zfs-0.6.5*.rpm spl-0.6.5*.rpm lustre-osd-zfs*.rpm

Intel EE 3.1: Add Server in IML GUI

Lustre Master: Build ZFS and Lustre from SRC (Not Covered)

44

Page 45: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Zpool create ost1 raidz2 /dev/mapper/mpath1 …….

Zpool add ost1 raidz2 /dev/mapper/mpath12 …….

Zpool status

Shows all zpools and current status

Zpool export ost1

Exports Zpool (make avaible for HA import)

Zpool import ost1

Import named Zpool

Zpools become Lustre Targets

45

Page 46: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Drive Naming Considerations

ZFS Services and HA

Corosync and Pacemaker

ZFS Lustre resource type

Lustre* and OpenZFS* Installation and Configuration Guide

Complete instructions

46

Page 47: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

/dev/sbX won’t work……

Change on Reboot / Other HA Server

Single SAS Path

/dev/disk/by-id/

Defaults to first path found

/dev/disk/by-path

Maybe but zoning is better

/dev/mapper/mpath

Sync Mpath settings between Servers

47

Page 48: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Zpool visible on 2 Nodes (via zpool status)

Quickly corrupt the file system

No Multi-mount protection (MPP feature in Development)

Extreme Care is required

Partial MPP via “hostid”

#genhostid

#reboot

Requires “zpool import –f “ to over ride

48

Page 49: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

ZFS works to Cache pools so on reboot they are present

Disable this behavior for HA

zpool create -f -o ashift=12 -o cachefile=none ostN driveA….

rm /etc/zfs/zpool.cache

systemctl disable zfs.target

Test:

Create and Export Zpool

Import on other HA node

Reboot

49

Page 50: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Intel EE 3.1 (some detail still TBD)

Go into IML and do Create Filesystem with Zpools as Targets

Intel FE/Community

Create Lustre FS as normal with ZFS Syntax

mkfs.lustre --ost --backfstype=zfs --fsname=$FSNAME \

--servicenode=$OSS1@tcp --servicenode=$OSS2@tcp \

--index=$index --mgsnode=$MGS_NID <Zpool_Name>/OSTX

zfs set recordsize=1M <Zpool_Name>/OSTX

zfs set compression=on <Zpool_Name>/OSTX

50

Page 51: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

For non HA testing

Mount and use targets as normal

mount –t lustre <pool_name>/ostx /mnt/ostx

Pacemaker Lustre ZFS Resouce Setup

Install LustreZFS File

Get file from LU-8455

#cp LustreZFS /usr/lib/ocf/resource.d/heartbeat/

#chmod +755 /usr/lib/ocf/resource.d/heartbeat/LustreZFS

Creates ZFS Lustre Target Resource Types for HA Framework

51

Page 52: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Configure 2nd Direct Interface Between HA Pairs

Ring0 = Management network

Ring1 = Direct Connect between HA Nodes

#pcs cluster auth $OSS1 $OSS2 -u hacluster -p $HA_PW --force

#pcs cluster setup --start --name lustre_ha $OSS1,$OSS1RING2 $OSS2,$OSS2RNIG2 --token 17000 --join 100 --force

#pcs cluster enable --all

52

Page 53: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Shot The Other Node In The Head (STONITH)

If automated action one node shuts the other down

IPMI or Power Distribution Unit (PDU)

(IPMI Example)

#pcs stonith create a-ipmi fence_ipmilan ipaddr="$IPMI1" lanplus=true \passwd="$IPMI1PW" login="root" pcmk_host_list="$OSS1HN"

#pcs stonith create b-ipmi fence_ipmilan ipaddr="$IPMI2" lanplus=true \passwd="$IPMI2PW" login="root" pcmk_host_list="$OSS2HN"

#pcs cluster sync

53

Page 54: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Add Lustre ZFS Resources to HA service

Export all Zpool form all servers

Create mount points on both servers

#pcs resource create OSTX ocf:heartbeat:LustreZFS pool=<zpool_name> \volume=OSTX mountpoint="/mnt/OSTX"

#pcs constraint location OSTX prefers $OSS1=10

#pcs constraint location OSTX prefers $OSS2=20

#pcs cluster sync

Repeat for all Lustre Targets

54

Page 55: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

Create Zpools that Perform well

Raidz2 9+2 or larger

Use identifiers for drives (Zoning / Multi-Path)

Reproducible placement of OST

Intel EE 3.1+

Add Server in IML -> Create pools on cmd line ->Use IML to create FS

Community / FE builds

Add Lustre Targers to pacemaker as LustreZFS resouce types

Set constraints

Sync cluster

55

Page 56: SC’16 Technical Training...71% of Top100 Most Adopted PFS Most Scalable PFS Open Source GPL v2 Commercial Packaging Vibrant Community 6 1 Source: Chris Morrone, Lead of OpenSFS Lustre

56