ANL Site Update - ScicomPspscicomp.org/.../2016/05/loy-anl-site-update-2016-v1.4.pdfANL Site Update ScicomP/SPXXL Summer Meeting 2016 Ray Loy Ben Allen Gordon McPheeters Jack O'Connell

ANL Site Update ScicomP/SPXXL Summer Meeting 2016 RayLoy

BenAllenGordonMcPheetersJackO'ConnellWilliamScullinTimWilliams

Site Update ¤  System overview ¤  Support topics ¤  GPFS AFM Burst Buffer ¤  GHI deployment ¤  Performance Portability

2

ALCF Resources

3

ALCF Roadmap ¤  CORAL

¥  Theta – interim system (late 2016) ¡  Cray/Intel KNL >=2500 nodes, >= 60 cores/node (8.5PF) ¡  Peak comparable to Mira

¥  Aurora – late 2018 ¡  ALCF-3, successor to Mira ¡  Cray/Intel KNH >=50K nodes

¤  Blue Gene will continue to be a primary resource through the installation of Aurora ¥  Both HW and SW considerations for extended life

4

Support/Requirements

5

GPFS on BG/Q ¤  Current: ESS 3.5.x @ GPFS 4.1.1-2 ; DDNs, BG/Q clients @ GPFS 3.5

¥  Cannot upgrade ESS without BG/Q GPFS >=4.1 ¥  End-of-service for GPFS 3.5 coming up in 2017 ¥  BG/Q will continue in service well beyond that

¤  Support for GPFS 4.1 (or 4.2) on BG/Q IONs would be very helpful ¥  Concern that fixes for GPFS and AFM at GPFS 4.2 will continue to be

back-ported to GPFS 4.1.1 in a timely manner

¤  ANL supports SPXXL 2016 requirement (Ticket 21)

6

Compilers

¤  IBM XL Compilers for Blue Gene/Q ¥  XL Fortran 14.1

¡  Current BG/Q version August 2015 ¡  AIX, Linux last update April/May 2016

¥  XL C/C++ 12.1 ¡  Current BG/Q version August 2015 ¡  AIX, Linux last update April 2016

¤  Hoping for BG/Q update

7

PMRs ¤  PMR 74893

¥  getdents/getdents64 alias conflict with 4.7.2 toolchain [Apr 2016] ¥  Success - Fixed in V1R2M4

¤  PMR 30358 ¥  No way to list contents of node's /dev/shmem or /dev/persistent from a

CNK program ¥  (Can only access from off-node using CDTI) ¥  Negotiating with IBM on a solution (stalled on ANL)

8

Implementing GPFS AFM as a Burst Buffer

9

ALCF Operational File System Configuration ¤  There are three main production level GPFS file systems:

¥  mira-fs0 – 19 PiB ¥  mira-fs1 - 7 PiB ¥  mira-home – 1.1 PiB

¤  These are based on DDN 12K-E (Embedded NSD servers) storage. ¤  These file systems are mounted to all the BG IO servers, a Cray

visualization cluster, Globus DTN nodes and HPSS data movers. ¤  symlinks to the project filesets are used to mask the underlying file

system from the users. /project contains links to mira-fs0 and mira-fs1 filesets.

10

The big picture

11

A,B,C,D,E,F,G,H8PB,400GB/sfs2(30ESS)

Current

A,C,D,G,H21PB,240GB/sfs0(16DDN)

B,E,F7PB,90GB/sfs1(6DDN)

Step1AFM



A,B,C,D,E,F,G,H8PB,400GB/sfs2(30ESS)

Step2GHI



HPSSviaGHI

ESS 50,000 foot view ¤  Terminology:

¥  AFM: Active File Management ¥  cache – fs2, non-permanent location ¥  home – fs0, fs1, permanent location; not /home, which is confusing

¤  fs2 is a cache; Like any other cache, files can be evicted; AFM ensures there is a good copy in home before eviction. ¥  Eviction is policy driven; Our policies will probably be fairly simple (high water / lower

water and LRU), but most any file system parameter can be checked) ¤  AFM is “constantly” syncing files between the cache and home

¥  Default is every 15 seconds, but is alterable ¥  Basically “replays” events from the cache on home (metadata updates, block changes,

etc.) ¥  We can influence priorities of new writes vs. replication with tuning parameters, but true

QoS is not here yet (it is in the roadmap) ¤  Fundamental assumption: Our file systems are big enough that recalls will be rare

12

Why? ¤  Overall, better experience for both the users and the admins ¤  No single point of failure

¥  We got this by adding fs1 to the existing fs0 file system ¤  All users see the same performance

¥  There is not uniform bw between fs0 and fs1, but fs2 and AFM give it back ¥  A cache miss might cause run to run variation, but all users are equally subject

to this and given our cache size, the assumption is this will be rare. ¤  Minimal user/admin intervention required

¥  Ideally policies will do the right thing at the right time ¡  Example: Project data will automatically disappear, via GHI, after a project

completes; Storage team doesn’t need to do anything. ¥  Should trivially (single command) be able to cause the right behavior

¡  Probably based on extended attribute and policy ¤  Minimal (ideally zero) manual copying of files

¥  All handled by the file system

13

Implementation Overview ¤  Current State: Internal testing

¥  2 ESS node pairs deleted from ESS and used to form a test “home” cluster. File system called: fm-fs1.

¤  Using AFM mode that allows for a GPFS home file system. ¤  mira-fs0 and mira-fs1 will be remotely mounted to ESS system

¥  mira-home will not be AFM managed

¤  Home file system defined as: gpfs:/// ¥  Null server list which signifies remotely mounted to cache cluster

14

Implementation Challenges ¤  This cluster was originally based on the GSS product which was

xSeries (Intel) based. ¥  Cluster was migrated from xSeries to pSeries (ESS) to address support

concerns.

¤  Kernel Panics in mpt2sas kernel module ¥  Redhat bug: 1259907 (bug) /1318560 (back port to 7.1 zStream) ¥  Difficult PD/recreate path. Much of the work was done by ALCF dealing

directly with Red Hat and Avago. Up to 3 weeks between recreates. ¥  Fixed in RH 7.2 stream but since ESS won’t support RH 7.2 in calendar

year 2016 we needed to ask Red Hat for a port to RH 7.1 zStream. This fix is under final tests now at Red Hat.

¥  As soon as available this will be incorporated in our ESS cluster. ¥  Fixed in: kernel-3.10.0-229.33.1.el7

15

Implementation Challenges ¤  Quotas:

¥  No communication between home and cache WRT to quota settings. ¥  Data is sync’ed to home as root and root does not error on over quota ¥  Found best to turn off auto-migration on cache filesets.

¡  Prevents over-running home cache hard limits

¤  Bug found after file evicted and subsequently re-read from cache cluster it was not becoming resident again in cache. Efix under construction.

16

HSI deployment

17

FileSystemBackupOverview

Ø Mira-homeisintendedtostoreexecutablefilesandconfigura[onfiles.Ø Userquotalimitsareenforced.Ø DataisfullyprotectedthruGPFSmetadataanddatareplica[on,GPFSsnapshotsandnightlybackups.

Ø Mira-fs0andmira-fs1areintendedasintermediate-termstorageforMira/Cetusjoboutputsuchascheckpointdatasets.Ø Projectassociatedfilesetquotalimitsareenforced.Ø Onlymetadatareplica[onisenabled.Ø Theuserisresponsibleforarchivingtheirowndata.Ø Aaerprojectexpira[on,quotalimitsarereduced,dataisarchivedandthefilesetisremoved.

Current archiving (fs0, fs1) ¤  Users manually archive files via HSI or Globus/GridFTP ¤  When user project expires, script invokes HSI

¥  Copy to tape, shrink disk quota ¥  90 days after expiration

¡  unmount (but still on disk)

¥  180 days ¡  delete from disk, copy retained on tape

¤  No other system-related archiving

19

GHI value added ¤  Instead of scripting archiving directly, generate GPFS policy files to

tell GHI when/how to migrate expired projects ¤  Improved disaster recovery

¥  Migrate everything (leave in place on disk) ¥  Optional: Implement threshold policy e.g. when fs reaches 90% start

migration down to 80% (punch holes in files leaving metadata) ¥  GHI image backup – create image of the entire fs only as "punched out"

files (metadata) ¡  Initial restore of entire fs quickly, retrieve remaining contents on demand

until caught up

20

GHI Deployment Status ¤  In progress (~6 mo)

¥  GHI installed and enabled on a test fs

¤  Some delays due to PMR (snapshots not deleted, deletion times out due to other processes) ¥  Other sites report issue resolved

¤  Expecting deployment this year

21

DiskCache Tape

HPSS

DataMover DataMover

GPFSCluster

GPFSNSDNodes

GHISessionNode

GHIIOMNodesDDN

Storage 10GbEQDRIBFC

High Level GHI Integration View

1GbE

4x6Gb/sSAS

Portability/Interoperability

23

Coordinating SC Centers Efforts

24

¤  Meetings of cross-labs applications readiness staff ¤  Tools and libraries working group (W. Joubert) ¤  Cross-lab training committee (F. Foerttner)

¥  Shared calendar ¥  Shared training events

¤  Manage nondisclosure, export control challenges ¥  CORAL partners, APEX partners (NERSC & Trinity)

¤  Portability

March2014Mee[ng•  Appsreadinesscoordina[on•  ~15representa[ves

September2014Mee[ng•  AppsPortability•  Appsreadinesscoordina[on•  ~25representa[ves

January2015Mee[ng•  Next-genhardware&soaware•  Portableprogramming•  ~40centerstaff•  ~10vendorreps

OpenMP 4.5 – Source Code Portability

25

¤  Host (CPU) ¤  Device (accelerator)

¥  GPGPU ¥  Intel Mic ¥  omp_get_num_devices()

¤  map maybe shared location

¤  simd standard vectorization

¤  #ifdef MANYCORE #else GPU #endif <code>

doublex[128],y[128];#pragmaomptargetdatamap(to:x[0:64])map(tofrom:y[0;64]){#pragmaomptarget{//ycomputedondevice}}

doublex[128],y[128];#pragmaompforsimdaligned(x,y:32)for(inti=0;i<128;i++){//thread’siteratesàSIMDlanes}

#pragmaomptargetdatamap(to:x){#pragmaomptargetmap(tofrom:y)#pragmaompteams#pragmaompdistribute#pragmaompparallelforfor(inti=0;i<n;++i){y[i]=a*x[i]+y[i];}}

Performance Portability

26

¤  When directives-based approach performs: OpenMP 4.5 ¤  Use libraries or frameworks to encapsulate node level parallelism

OpenMP4.0 Libraries AppModules

CUDAVectorintrinsicsshmem

…

Charm++

ADLB

PETSc

Chombo

Trilinos

MADNESS

Kokkos

RAJA

Effort Continuing ¤  Coordination between ESP, CAAR, NESAP.

¥  ALCC allocation at ANL, OLCF, NESAP for portability work ¥  Shared training including live/videocon

¤  Bringing NNSA labs into the discussion ¥  HPCOR (9/2015), COEPP (4/2016)

¤  SC15 Workshop on Portability Among HPC Architectures for Scientific Applications

¤  New effort on kernels/mini-apps: optimize on 2 platforms, then rework with OMP 4.5 or OpenACC, compare ¥  NekBone (ALCF), BoxLib (NERSC), DSL-based library for MD (OLCF)

27

Training - ATPESC ¤  Argonne Training Program on Extreme Scale Computing ¤  Intensive 2-week course held off-site

¥  Audience: doctoral students, postdocs, and computational scientists ¥  Presenters are leaders in all major areas of HPC

¤  Planning in progress for the 4th session in August 2016

¤  http://extremecomputingtraining.anl.gov

28

29

The End

ANL Site Update - ScicomPspscicomp.org/.../2016/05/loy-anl-site-update-2016-v1.4.pdfANL Site Update ScicomP/SPXXL Summer Meeting 2016 Ray Loy Ben Allen Gordon McPheeters Jack O'Connell

Documents