ANL Site Update ScicomP/SPXXL Summer Meeting 2016 Ray Loy Ben Allen Gordon McPheeters Jack O'Connell William Scullin Tim Williams
ANL Site Update ScicomP/SPXXL Summer Meeting 2016 RayLoy
BenAllenGordonMcPheetersJackO'ConnellWilliamScullinTimWilliams
Site Update ¤ System overview ¤ Support topics ¤ GPFS AFM Burst Buffer ¤ GHI deployment ¤ Performance Portability
2
ALCF Resources
3
ALCF Roadmap ¤ CORAL
¥ Theta – interim system (late 2016) ¡ Cray/Intel KNL >=2500 nodes, >= 60 cores/node (8.5PF) ¡ Peak comparable to Mira
¥ Aurora – late 2018 ¡ ALCF-3, successor to Mira ¡ Cray/Intel KNH >=50K nodes
¤ Blue Gene will continue to be a primary resource through the installation of Aurora ¥ Both HW and SW considerations for extended life
4
Support/Requirements
5
GPFS on BG/Q ¤ Current: ESS 3.5.x @ GPFS 4.1.1-2 ; DDNs, BG/Q clients @ GPFS 3.5
¥ Cannot upgrade ESS without BG/Q GPFS >=4.1 ¥ End-of-service for GPFS 3.5 coming up in 2017 ¥ BG/Q will continue in service well beyond that
¤ Support for GPFS 4.1 (or 4.2) on BG/Q IONs would be very helpful ¥ Concern that fixes for GPFS and AFM at GPFS 4.2 will continue to be
back-ported to GPFS 4.1.1 in a timely manner
¤ ANL supports SPXXL 2016 requirement (Ticket 21)
6
Compilers
¤ IBM XL Compilers for Blue Gene/Q ¥ XL Fortran 14.1
¡ Current BG/Q version August 2015 ¡ AIX, Linux last update April/May 2016
¥ XL C/C++ 12.1 ¡ Current BG/Q version August 2015 ¡ AIX, Linux last update April 2016
¤ Hoping for BG/Q update
7
PMRs ¤ PMR 74893
¥ getdents/getdents64 alias conflict with 4.7.2 toolchain [Apr 2016] ¥ Success - Fixed in V1R2M4
¤ PMR 30358 ¥ No way to list contents of node's /dev/shmem or /dev/persistent from a
CNK program ¥ (Can only access from off-node using CDTI) ¥ Negotiating with IBM on a solution (stalled on ANL)
8
Implementing GPFS AFM as a Burst Buffer
9
ALCF Operational File System Configuration ¤ There are three main production level GPFS file systems:
¥ mira-fs0 – 19 PiB ¥ mira-fs1 - 7 PiB ¥ mira-home – 1.1 PiB
¤ These are based on DDN 12K-E (Embedded NSD servers) storage. ¤ These file systems are mounted to all the BG IO servers, a Cray
visualization cluster, Globus DTN nodes and HPSS data movers. ¤ symlinks to the project filesets are used to mask the underlying file
system from the users. /project contains links to mira-fs0 and mira-fs1 filesets.
10
The big picture
11
A,B,C,D,E,F,G,H8PB,400GB/sfs2(30ESS)
Current
A,C,D,G,H21PB,240GB/sfs0(16DDN)
B,E,F7PB,90GB/sfs1(6DDN)
Step1AFM
A,C,D,G,H21PB,240GB/sfs0(16DDN)
B,E,F7PB,90GB/sfs1(6DDN)
A,B,C,D,E,F,G,H8PB,400GB/sfs2(30ESS)
Step2GHI
A,C,D,G,H21PB,240GB/sfs0(16DDN)
B,E,F7PB,90GB/sfs1(6DDN)
HPSSviaGHI
ESS 50,000 foot view ¤ Terminology:
¥ AFM: Active File Management ¥ cache – fs2, non-permanent location ¥ home – fs0, fs1, permanent location; not /home, which is confusing
¤ fs2 is a cache; Like any other cache, files can be evicted; AFM ensures there is a good copy in home before eviction. ¥ Eviction is policy driven; Our policies will probably be fairly simple (high water / lower
water and LRU), but most any file system parameter can be checked) ¤ AFM is “constantly” syncing files between the cache and home
¥ Default is every 15 seconds, but is alterable ¥ Basically “replays” events from the cache on home (metadata updates, block changes,
etc.) ¥ We can influence priorities of new writes vs. replication with tuning parameters, but true
QoS is not here yet (it is in the roadmap) ¤ Fundamental assumption: Our file systems are big enough that recalls will be rare
12
Why? ¤ Overall, better experience for both the users and the admins ¤ No single point of failure
¥ We got this by adding fs1 to the existing fs0 file system ¤ All users see the same performance
¥ There is not uniform bw between fs0 and fs1, but fs2 and AFM give it back ¥ A cache miss might cause run to run variation, but all users are equally subject
to this and given our cache size, the assumption is this will be rare. ¤ Minimal user/admin intervention required
¥ Ideally policies will do the right thing at the right time ¡ Example: Project data will automatically disappear, via GHI, after a project
completes; Storage team doesn’t need to do anything. ¥ Should trivially (single command) be able to cause the right behavior
¡ Probably based on extended attribute and policy ¤ Minimal (ideally zero) manual copying of files
¥ All handled by the file system
13
Implementation Overview ¤ Current State: Internal testing
¥ 2 ESS node pairs deleted from ESS and used to form a test “home” cluster. File system called: fm-fs1.
¤ Using AFM mode that allows for a GPFS home file system. ¤ mira-fs0 and mira-fs1 will be remotely mounted to ESS system
¥ mira-home will not be AFM managed
¤ Home file system defined as: gpfs:/// ¥ Null server list which signifies remotely mounted to cache cluster
14
Implementation Challenges ¤ This cluster was originally based on the GSS product which was
xSeries (Intel) based. ¥ Cluster was migrated from xSeries to pSeries (ESS) to address support
concerns.
¤ Kernel Panics in mpt2sas kernel module ¥ Redhat bug: 1259907 (bug) /1318560 (back port to 7.1 zStream) ¥ Difficult PD/recreate path. Much of the work was done by ALCF dealing
directly with Red Hat and Avago. Up to 3 weeks between recreates. ¥ Fixed in RH 7.2 stream but since ESS won’t support RH 7.2 in calendar
year 2016 we needed to ask Red Hat for a port to RH 7.1 zStream. This fix is under final tests now at Red Hat.
¥ As soon as available this will be incorporated in our ESS cluster. ¥ Fixed in: kernel-3.10.0-229.33.1.el7
15
Implementation Challenges ¤ Quotas:
¥ No communication between home and cache WRT to quota settings. ¥ Data is sync’ed to home as root and root does not error on over quota ¥ Found best to turn off auto-migration on cache filesets.
¡ Prevents over-running home cache hard limits
¤ Bug found after file evicted and subsequently re-read from cache cluster it was not becoming resident again in cache. Efix under construction.
16
HSI deployment
17
FileSystemBackupOverview
Ø Mira-homeisintendedtostoreexecutablefilesandconfigura[onfiles.Ø Userquotalimitsareenforced.Ø DataisfullyprotectedthruGPFSmetadataanddatareplica[on,GPFSsnapshotsandnightlybackups.
Ø Mira-fs0andmira-fs1areintendedasintermediate-termstorageforMira/Cetusjoboutputsuchascheckpointdatasets.Ø Projectassociatedfilesetquotalimitsareenforced.Ø Onlymetadatareplica[onisenabled.Ø Theuserisresponsibleforarchivingtheirowndata.Ø Aaerprojectexpira[on,quotalimitsarereduced,dataisarchivedandthefilesetisremoved.
Current archiving (fs0, fs1) ¤ Users manually archive files via HSI or Globus/GridFTP ¤ When user project expires, script invokes HSI
¥ Copy to tape, shrink disk quota ¥ 90 days after expiration
¡ unmount (but still on disk)
¥ 180 days ¡ delete from disk, copy retained on tape
¤ No other system-related archiving
19
GHI value added ¤ Instead of scripting archiving directly, generate GPFS policy files to
tell GHI when/how to migrate expired projects ¤ Improved disaster recovery
¥ Migrate everything (leave in place on disk) ¥ Optional: Implement threshold policy e.g. when fs reaches 90% start
migration down to 80% (punch holes in files leaving metadata) ¥ GHI image backup – create image of the entire fs only as "punched out"
files (metadata) ¡ Initial restore of entire fs quickly, retrieve remaining contents on demand
until caught up
20
GHI Deployment Status ¤ In progress (~6 mo)
¥ GHI installed and enabled on a test fs
¤ Some delays due to PMR (snapshots not deleted, deletion times out due to other processes) ¥ Other sites report issue resolved
¤ Expecting deployment this year
21
DiskCache Tape
HPSS
DataMover DataMover
GPFSCluster
GPFSNSDNodes
GHISessionNode
GHIIOMNodesDDN
Storage 10GbEQDRIBFC
High Level GHI Integration View
1GbE
4x6Gb/sSAS
Portability/Interoperability
23
Coordinating SC Centers Efforts
24
¤ Meetings of cross-labs applications readiness staff ¤ Tools and libraries working group (W. Joubert) ¤ Cross-lab training committee (F. Foerttner)
¥ Shared calendar ¥ Shared training events
¤ Manage nondisclosure, export control challenges ¥ CORAL partners, APEX partners (NERSC & Trinity)
¤ Portability
March2014Mee[ng• Appsreadinesscoordina[on• ~15representa[ves
September2014Mee[ng• AppsPortability• Appsreadinesscoordina[on• ~25representa[ves
January2015Mee[ng• Next-genhardware&soaware• Portableprogramming• ~40centerstaff• ~10vendorreps
OpenMP 4.5 – Source Code Portability
25
¤ Host (CPU) ¤ Device (accelerator)
¥ GPGPU ¥ Intel Mic ¥ omp_get_num_devices()
¤ map maybe shared location
¤ simd standard vectorization
¤ #ifdef MANYCORE #else GPU #endif <code>
doublex[128],y[128];#pragmaomptargetdatamap(to:x[0:64])map(tofrom:y[0;64]){#pragmaomptarget{//ycomputedondevice}}
doublex[128],y[128];#pragmaompforsimdaligned(x,y:32)for(inti=0;i<128;i++){//thread’siteratesàSIMDlanes}
#pragmaomptargetdatamap(to:x){#pragmaomptargetmap(tofrom:y)#pragmaompteams#pragmaompdistribute#pragmaompparallelforfor(inti=0;i<n;++i){y[i]=a*x[i]+y[i];}}
Performance Portability
26
¤ When directives-based approach performs: OpenMP 4.5 ¤ Use libraries or frameworks to encapsulate node level parallelism
OpenMP4.0 Libraries AppModules
CUDAVectorintrinsicsshmem
…
Charm++
ADLB
PETSc
Chombo
Trilinos
MADNESS
Kokkos
RAJA
Effort Continuing ¤ Coordination between ESP, CAAR, NESAP.
¥ ALCC allocation at ANL, OLCF, NESAP for portability work ¥ Shared training including live/videocon
¤ Bringing NNSA labs into the discussion ¥ HPCOR (9/2015), COEPP (4/2016)
¤ SC15 Workshop on Portability Among HPC Architectures for Scientific Applications
¤ New effort on kernels/mini-apps: optimize on 2 platforms, then rework with OMP 4.5 or OpenACC, compare ¥ NekBone (ALCF), BoxLib (NERSC), DSL-based library for MD (OLCF)
27
Training - ATPESC ¤ Argonne Training Program on Extreme Scale Computing ¤ Intensive 2-week course held off-site
¥ Audience: doctoral students, postdocs, and computational scientists ¥ Presenters are leaders in all major areas of HPC
¤ Planning in progress for the 4th session in August 2016
¤ http://extremecomputingtraining.anl.gov
28
29
The End