Top Banner

of 24

Nccs Lustre

Apr 08, 2018

Download

Documents

Praveen Bhuyan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/7/2019 Nccs Lustre

    1/24

    NCCS Lust re Fi le Syst em Overv iew

    resented b

    Sarp Oral, Ph.D.

    NCCS Scalin Worksho

    Oak Ridge National Laboratory

    U.S. Department of Energy

    August 1st, 2007

  • 8/7/2019 Nccs Lustre

    2/24

    Out l ine

    What is Lustre

    NCCS Jaguar Lustre

    Other NCCS Lustre File Systems

    Lustre Centre of Excellence (LCE) at ORNL

    Oak Ridge National Laboratory U.S. Department of Energy

  • 8/7/2019 Nccs Lustre

    3/24

    What is Lust re

    Lustre

    POSIX compliant

    Parallel file system

    Lustre provides

    Hi h-scalabilit

    High-performance

    Sin le lobal name s ace

    Lustre is a software only architecture

    Oak Ridge National Laboratory U.S. Department of Energy

  • 8/7/2019 Nccs Lustre

    4/24

    Lus t re Arch i t ec t u re

    Lustre consists of four major components

    MetaData Server (MDS)

    Object Storage Servers (OSSs)

    Object Storage Targets (OSTs)

    and of course Clients

    Manages the name space, directory and file operations

    Stores file system metadata

    OSS

    Mana es the OSTs

    OST

    Manages underlying block devices

    Oak Ridge National Laboratory U.S. Department of Energy

    Stores file data stripes

  • 8/7/2019 Nccs Lustre

    5/24

    Lus t re Arch i t ec t u re

    MDSMetadata O s

    File creation, stats,

    OSS

    recovery

    OSS

    OSTOST

    OSTOST

    Block I/O and file

    locking

    BlockBlock

    Block

    Device

    Block

    Device

    Oak Ridge National Laboratory U.S. Department of Energy

    Device

  • 8/7/2019 Nccs Lustre

    6/24

    Lus t re Arch i t ec t u re

    All servers have a full-blown file system they operate on

    ,

    Today, only a single active MDS is supported

    Goal is to have many MDSs in near future

    Whole file system is limited by that single MDS performance

    Although not that bad, sometimes can be a problem

    Oak Ridge National Laboratory U.S. Department of Energy

  • 8/7/2019 Nccs Lustre

    7/24

    Lus t re Arch i t ec t u re

    Failover

    Active-passive pairs for MDS and OSS

    Works fine on all NIX based systems except Catamount Failover is not supported with current UNICOS

    Failover will be supported with the CNL

    Supports sparse files

    We are using 2TB partitions

    Unlike all other NIX based systems, on Catamount clients, Lustre

    access s ac eve over us re Catamount clients are uninterruptible and I/O is not cached

    Liblustre is directly linked into the application

    Oak Ridge National Laboratory U.S. Department of Energy

  • 8/7/2019 Nccs Lustre

    8/24

    Lus t re Arch i t ec t u re

    Striping is the key for achieving high scalability and performance

    File data is written to and read from multiple OSTs Provides higher aggregate R/W BW than a single server can

    deliver

    Allows file sizes to be larger than a single OSS could handle

    Simple tips Over striping might be bad

    Too small chunks to write into each OST

    Under utilizing OSTs and the network Under striping might be bad

    Too much stress per each OST

    Oak Ridge National Laboratory U.S. Department of Energy

  • 8/7/2019 Nccs Lustre

    9/24

    Lus t re Arch i t ec t u re

    Stripe pattern can be changed by the user

    Before the file or directory is created

    Once created, the stripe pattern is fixed

    lfs setstripe to set the stripe pattern

    lfs getstripe to query the stripe pattern

    Within the application Several low-level ioctlcalls available to set and uer stri e

    patterns and some other EA

    Oak Ridge National Laboratory U.S. Department of Energy

  • 8/7/2019 Nccs Lustre

    10/24

    Lus t re Arch i t ec t u re

    11 1

    OST1 OST2 OST3

    File A 21 1

    OST1 OST2 OST3

    File A

    34

    File B

    File C134

    2File B

    File C

    Single Striped Two Striped

    Stri e count or width

    21 1 File A

    # of OSTs the file has beenstripped over

    Stripe size

    4

    File C Size of each stripe on an OST

    Normally same for all OSTsfor a given file

    Oak Ridge National Laboratory U.S. Department of Energy

  • 8/7/2019 Nccs Lustre

    11/24

    Lus t re Arch i t ec t u re

    Everything is based on RPCs

    RPCs Sometimes messages are dropped or lost

    Timeouts

    If the error is caused by the client side

    Client will simply disconnect from that particular server

    Keep retrying to connect

    Eviction

    If the error is caused by the server side

    Client will discover it has been evicted by the next request Clients all buffer cache will be invalidated

    Dirty data will be lost

    Oak Ridge National Laboratory U.S. Department of Energy

  • 8/7/2019 Nccs Lustre

    12/24

    Lus t re Arch i t ec t u re

    Architecture has changed with Lustre 1.4.6

    LNET and LNDs

    Inde endent network conduits has been introduced

    A single network and recovery layer establishes connection withupper Lustre file system layers and the lower network conduits

    TCP, Cray Portals, Infiniband, Myricom, Elan

    Lustre File System Layer

    us re e wor ng ayer

    TCP

    LNDMyricom

    LND

    Elan

    LND

    TCP

    VendorMyricom

    Vendor

    Elan

    Vendor

    Oak Ridge National Laboratory U.S. Department of Energy

    rary rary rary

  • 8/7/2019 Nccs Lustre

    13/24

    Lus t re Arch i t ec t u re

    POSIX compliant (in a cluster)

    Clients dont see stale data or metadata Semantics are guaranteed by strict locking

    Exceptions Flock/lockf is still not supported

    Security Kerberos capabilities on the way Encrypted Lustre file system is under development

    Oak Ridge National Laboratory U.S. Department of Energy

  • 8/7/2019 Nccs Lustre

    14/24

    NCCS J aguar Lus t re

    Cray XT4 (Catamount) Lustre, 3-D tori

    ,

    Uses 3 MDS service nodes 72 XT4 service nodes as OSSs

    2 OSTS for the 300 TB FS

    1 OST per each remaining 150 TB FS

    2 1- ort 4 Gb FC HBAs er OSS

    45 GB/s block I/O for the 300 TB FS

    DDN 9550s

    18 racks/couplets Write-back cache is 1MB on each controller

    36 TB per couplet w/ Fibre Channel drives

    Oak Ridge National Laboratory U.S. Department of Energy

    Each LUN has a capacity of 2 TB and 4 KB block size

  • 8/7/2019 Nccs Lustre

    15/24

    NCCS J aguar Lus t re

    Cray XT4 (Catamount) LUN configuration

    Oak Ridge National Laboratory U.S. Department of Energy

  • 8/7/2019 Nccs Lustre

    16/24

    NCCS J aguar Lus t re

    Default stripe count/width on Jaguar (Catamount)

    Default stripe size on Jaguar (Catamount)

    Oak Ridge National Laboratory U.S. Department of Energy

  • 8/7/2019 Nccs Lustre

    17/24

    NCCS J aguar Lus t re

    Jaguar Compute Node Linux (CNL) Lustre

    Compared to the Catamount side

    Exact configuration details are to be determined

    A mechanism to transfer files between the Catamount side and

    the CNL side

    Can be done by users

    Oak Ridge National Laboratory U.S. Department of Energy

  • 8/7/2019 Nccs Lustre

    18/24

    Ot her NCCS Lust re Fi le Syst em s

    End-to-end cluster (Ewok)

    . . .

    1 MDS, 6 OSS, 2 OST/OSS, OFED 1.1 IB, 81 clients 20 TB, ~3-4 GB/s

    Viz cluster (Everest) om ng soon

    1 MDS, 10 OSS, 2 OST/OSS, OFED 1.2

    ~Cou le tens of TB ~4-5 GB/s

    Oak Ridge National Laboratory U.S. Department of Energy

  • 8/7/2019 Nccs Lustre

    19/24

    Ot her NCCS Lust re Fi le Syst em s

    Center-wide Lustre cluster (Spider)

    To serve all NCCS resources Jaguar, Everest, and Ewok by the end of 2007

    Baker by the end of 2008 And all new additions from that point on

    Phase 0

    Will be in production soon over Jaguar , s, , x

    10 couplets of DDN 8500s, FC 2 Gb direct links w/ failoverconfigured

    Phase 1: additional 20 GB/s by the end of 2007

    Oak Ridge National Laboratory U.S. Department of Energy

  • 8/7/2019 Nccs Lustre

    20/24

    Ot her NCCS Lust re Fi le Syst em s

    Spider provides

    + Ability to analyze data offline+ On the fl data anal sis/visualization ca abilit

    + Ease of diagnostics/decoupling

    +

    Lower acquisition/expansion cost

    Oak Ridge National Laboratory U.S. Department of Energy

  • 8/7/2019 Nccs Lustre

    21/24

    Ot her NCCS Lust re Fi le Syst em s

    Lustre router nodes on Jaguar

    Route Lustre packets between

    TCP to/from Cray Portals

    IB to/from Cray Portals

    ~ 450 MB/s/XT4 SIO node over TCP/Cray Portals ~ 600-700 MB/s/XT4 SIO node over IB/Cray Portals

    Oak Ridge National Laboratory U.S. Department of Energy

  • 8/7/2019 Nccs Lustre

    22/24

    Ot her NCCS Lust re Fi le Syst em s

    Jaguar SIO

    Ewok

    Router Node

    (Cray Portals/IB)

    Everest

    Infiniband

    NetworkTCP

    Network

    Spider

    Oak Ridge National Laboratory U.S. Department of Energy

    Backend

    Disks

    Systems

  • 8/7/2019 Nccs Lustre

    23/24

  • 8/7/2019 Nccs Lustre

    24/24

    Lust re Cent re of Ex c el lenc e (LCE) at ORNL

    Lustre Centre of Excellence (LCE) established in December 2006.

    Create an on-site presence at ORNL (1st floor back hall) - ,

    Oleg Drokin and Wang Di

    Develop a risk mitigation Lustre package for ORNL ,

    MPILND

    In out-years explore possible 1 TB/s solutions

    Develop local expertise to reduce dependence on CFS and Cray Peter Braam gave a 3 day tutorial on Lustre Internals in January A sys admin training is being planned

    Assist Science teams in tuning their application I/O Focus on 2-3 key apps initially and document results

    Wang Di

    Oak Ridge National Laboratory U.S. Department of Energy

    On-site Lustre workshops for application teams