Top Banner

of 37

Lustre Scalability

Apr 06, 2018

ReportDownload

Documents

limanaki

  • 8/3/2019 Lustre Scalability

    1/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Oak Ridge National Laboratory

    Lustre Scalability Workshop

    February 10, 2009

    Presented by:

    Galen M. Shipman

    Collaborators:

    David Dillow

    Sarp Oral

    Feiyi Wang

  • 8/3/2019 Lustre Scalability

    2/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Hardware scaled from single-corethrough dual-core to quad-core anddual-socket SMP nodes

    Scaling applications and system software is the biggestchallenge

    y NNSA and DoD have funded muchof the basic system architectureresearchy Cray XT based on Sandia Red Storm

    y IBM BG designed with Livermorey Cray X1 designed in collaboration

    with DoD

    y SciDAC program is funding scalable application workthat has advanced many science applications

    y DOE-SC and NSF have funded much of the library andapplied math as well as tools

    y Computational liaisons are key to usingdeployed systems

    Cray XT58-core, dual-socket SMP1+ PF

    Cray XT4119 TF

    FY 2006 FY 2007 FY 2008 FY 2009

    Cray XT3dual-core54 TF

    Cray XT4quad-core263 TF

    2 Managed by UT-Battellefor the Department of Energy

    We have increased system performance300 times since 2004

    FY 2005

    Cray X13 TF

    Cray XT3

    single-core26 TF

  • 8/3/2019 Lustre Scalability

    3/37

    Managed by UT-Battelle for theU. S. Department of Energy

    We will advance computationalcapability by 1000 over the next decade

    Mission: Deploy and operatethe computational resourcesrequired to tackle global challenges

    Vision: Maximize scientific productivityand progress on the largest-scalecomputational problems

    y Deliver transforming discoveriesin materials, biology, climate,energy technologies, etc.

    y Ability to investigate otherwise

    inaccessible systems, fromsupernovae to energy griddynamics

    y Providing world-class computational resources andspecialized services for the most computationallyintensive problems

    y Providing stable hardware/software path of increasing

    scale to maximize productive applications development

    Cray XT5: 1+ PFLeadership-classsystem for science

    DARPA HPCS: 20 PFLeadership-class system

    FY 2009 FY 2011 FY 2015 FY 2018

    Future system: 1 EF

    100250 PF

    3 Managed by UT-Battellefor the Department of Energy

  • 8/3/2019 Lustre Scalability

    4/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Explosive Data Growth

  • 8/3/2019 Lustre Scalability

    5/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Parallel File Systems in the 21st Century

    y Lessons learned from deploying a Peta-scaleI/O infrastructure

    y Storage system hardware trends

    y File system requirements for 2012

  • 8/3/2019 Lustre Scalability

    6/37

    Managed by UT-Battelle for theU. S. Department of Energy

    The SpiderParallel File System

    y ORNL has successfully deployed a direct

    attached parallel file system for the Jaguar XT5simulation platform Over 240 GB/sec of raw bandwidth

    Over 10 Petabytes of aggregate storage

    Demonstrated file system level bandwidth of >200GB/sec (more optimizations to come)

    y Work is ongoing to deploy this file system in arouter attached configuration Services multiple compute resources

    Eliminates islands of data

    Maximizes the impact of storage investment Enhances manageability

    Demonstrated on Jaguar XT5 using of availablestorage (96 routers)

  • 8/3/2019 Lustre Scalability

    7/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Spider

    Scalable I/O Network (SION) - DDR InfiniBand 889 GB/s

    OSS OSS OSS OSS OSS OSS OSS

    RTR RTR RTR

    Jaguar(XT5)

    192 Routers

    192 OSSs

    1344 OSTs

    RTR RTR

    Jaguar(XT4)

    48 Routers

    LensSeaStarTorus

    S

    eaStarTorus

    HPSS

    Archive

    (10 PB)

    10.7 PB 240 GB/s

    GridFTPServers

    ESnet, USN,

    TeraGrid,Internet2, NLR

    Smokey

    10 40 Gbit/s

    Lustre-WANGateways

  • 8/3/2019 Lustre Scalability

    8/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Spider facts

    y 240 GB/s of Aggregate Bandwidth

    y 48 DDN 9900 Couplets

    y 13,440 1 TB SATA Drives

    y Over 10 PB of RAID6 Capacity

    y 192 Storage Servers

    y Over 1000 InfiniBand Cables

    y ~0.5 MW of Power

    y ~20,000 LBs of Disks

    y Fits in 32 Cabinets using 572 ft2

  • 8/3/2019 Lustre Scalability

    9/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Spider Configuration

  • 8/3/2019 Lustre Scalability

    10/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Spider Couplet View

  • 8/3/2019 Lustre Scalability

    11/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Lessons Learned: Network Congestion

    y I/O infrastructure doesnt expose resource locality

    There is currently no analog of nearest neighborcommunication that will save us

    y Multiple areas of congestion

    Infiniband SAN

    SeaStar Torus LNET routing doesnt expose locality

    y May take a very long route unnecessarily

    y Assumption of flat network space wont scale

    Wrong assumption on even a single compute environment Center wide file system will aggravate this

    y Solution - Expose Locality

    Lustre modifications allow fine grained routing capabilities

  • 8/3/2019 Lustre Scalability

    12/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Design To Minimize Contention

    y Pair routers and object storage servers on

    the same line card (crossbar) So long as routers only talk to OSSes on the same

    line card contention in the fat-tree is eliminated

    Required small changes to Open SM

    y Place routers strategically within the Torus

    In some use cases routers (or groups of routers)can be thought of as a replicated resource

    Assign clients to routers as to minimizecontention

    y Allocate objects to nearest OST

    Requires changes to Lustre and/or I/O libraries

  • 8/3/2019 Lustre Scalability

    13/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Intelligent LNET RoutingClients prefer specific routers to

    these OSSes - minimizes IB

    congestion (same line card)

    Assign clients to specific Router

    Groups - minimizes SeaStar

    Congestion

  • 8/3/2019 Lustre Scalability

    14/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Performance Results

    y Even in a direct attached configuration wehave demonstrated the impact of networkcongestion on I/O performance

    By strategically placing writers within the torus

    and pre-allocating file system objects we cansubstantially improve performance

    Performance results obtained on Jaguar XT5using of the available backend storage

  • 8/3/2019 Lustre Scalability

    15/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Performance Results (1/2 of Storage)Backend throughput

    - bypassing SeaStar torus

    - congestion free on IB fabric

    SeaStar Torus

    Congestion

  • 8/3/2019 Lustre Scalability

    16/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Lessons Learned: Journaling Overhead

    y Even sequential writes can exhibit randomI/O behavior due to journaling

    y Special file (contiguous block space) reservedfor journaling on ldiskfs Located all together

    Labeled as journal device

    Towards the beginning on the physical disk layout

    y After the file data portion is committed on disk Journal meta data portion needs to committed as well

    y Extra head seek needed for every journaltransaction commit

  • 8/3/2019 Lustre Scalability

    17/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Minimizing extra disk head seeks

    External journal on solid state devicesy No disk seeks

    y Trade off between extra network transaction latency and diskseek latency

    Tested on a RamSan-400 devicey

    4 IB SDR 4x host portsy 7 external journal devices per host port

    y More than doubled the per DDN performance w.r.t. to internaljournal devices on DDN devices

    internal journal 1398.99

    external journal on RAMSAN 3292.60

    Encountered some scalability problems per host portinherent to RamSan firmwarey Reported to Texas Memory Systems Inc. and awaiting a

    resolution in next firmware release

  • 8/3/2019 Lustre Scalability

    18/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Minimizing synchronous journaltransaction commit penalty

    y Two active transactions per ldiskfs (per OST) One running and one closed

    Running transaction cant be closed until closedtransaction fully committed to disk

    y Up to 8 RPCs (write ops) might be in flight perclient

    With synchronous journal committingy Some can be concurrently blocked until the closed

    transaction fully committed

    Lower the client number, higher the possibility of

    lower utilization due to blocked RPCsy More writes are able to better utilize the pipeline

  • 8/3/2019 Lustre Scalability

    19/37

    Managed by UT-Battelle for theU. S. Department of Energy

    Minimizing synchronous journaltransaction commit penalty

    y To alleviate the problem Reply to client when data portion of RPC is committed to disk

    y Existing mechanism for client completion replies with

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.