Top Banner
Critical Flags, Variables, and Other Important ALCF Minutiae Jini Ramprakash Technical Support Specialist Argonne Leadership Computing Facility
23

Critical Flags, Variables, and Other Important ALCF Minutiae

Feb 24, 2016

Download

Documents

Sherri

Critical Flags, Variables, and Other Important ALCF Minutiae. Jini Ramprakash Technical Support Specialist Argonne Leadership Computing Facility. Presentation outline. It’s all about your job! Job management Job basics Submission Queuing Execution Termination Software environment - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Critical Flags, Variables, and Other Important ALCF Minutiae

Critical Flags, Variables, and Other Important ALCF Minutiae

Jini RamprakashTechnical Support Specialist

Argonne Leadership Computing Facility

Page 2: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

2

Presentation outline

It’s all about your job!– Job management– Job basics

• Submission• Queuing• Execution• Termination

Software environment Optimization for beginners ALCF resources, outlined

Page 3: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

3

Job management

Cobalt (the ALCF resource scheduler) is used on all ALCF systems – Similar to PBS but not the same– Find more information at http://trac.mcs.anl.gov/projects/cobalt

Job management commands:– qsub: submit a job– qstat: query a job status– qdel: delete a job– qalter: alter batched job parameters– qmove: move job to different queue– qhold: place queued (non-running) job on hold– qrls: release hold on job– showres: show current and future reservations

Page 4: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

4

Job basics – submission

Two modes of submitting jobs– Basic– Script mode

Get all flags and options by running ‘man qsub’ For example: qsub -A alchemy -n 40960 --mode c1 -t 720 --env “OMP_NUM_THREADS=4”

lead_to_gold– In English: Charge project “Alchemy” for this job. Run on 40960 nodes, with one MPI

rank per node. Run for 720 minutes. Set the “OMP_NUM_THREADS” environment variable to 4. Run the “lead_to_gold” binary.

Page 5: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

5

qsub checks your submission for sanity

Did you specify a nodecount and walltime? Are they legal? Is the mode you specified valid? Did you ask for more than the minimum runtime? Are you a member of the project you specified? Does that project have a usable

allocation? If so … all systems go! Get a JOBID, and put it in the queue

Page 6: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

6

Not there yet!

Page 7: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

7

Job basics - life in the queue

Periodically, your job’s score will increase Periodically, the scheduler will decide if there are any jobs it wants to run Check current state with qstat At some point, your score will be high enough, and it will be YOUR TURN!

Page 8: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

8

Score accrual

Large jobs are prioritized Jobs that have been waiting long are prioritized INCITE/ALCC projects are prioritized Negative allocations have a score cap lower than the starting score of other jobs

Page 9: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

9

Job basics - execution

Book-keeping– Put a start record in the database. Output a log file start record. Send email of job start

if –notify was requested. Start job timers Fire up to execute the job

– Cobalt boots partition– runjob starts executable

Page 10: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

10

Script mode jobs

All jobs launch via runjob on the service nodes Script mode jobs launch your script on a special login node That script is responsible for calling runjob to launch the actual compute-node job You are charged for the duration of the script

Page 11: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

11

Job basics – termination aka are we there yet?

Your requested wall-time ticks down. Either your runjob returns, or you run out of wall-time and your job is forcibly removed

Job-end cleanup happens– If your partition wasn’t cleaned up, that happens now

Job-end book-keeping happens– Database, log file, notify if requested

Page 12: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

12

Job basics – Termination, life after your job

If you had a job depending on you, it can be released to run. If you had a non-zero exit code, it moves to dep_fail instead

That night, the log files will be fed into clusterbank (the ALCF accounting system) to create charges

Page 13: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

13

Non-standard job events

Reservations and/or draining qsub rejection Job holds Job redefinition (qalter) Job removal (qdel) Abnormal job failure Why isn’t this job running?

Page 14: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

14

Software environment - SoftEnv

A tool for managing your environment– Sets your PATH to access desired front-end tools– Your compiler version can be changed here

Settings:– Maintained in the file ~/.soft– Add/remove keywords from ~/.soft to change environment– Make sure @default is at the very end

Commands:– softenv

• a list of all keywords defined on the systems– resoft

• reloads initial environment from ~/.soft file– soft add|remove keyword

• Temporarily modify environment by adding/removing keywords http://www.mcs.anl.gov/hs/software/systems/softenv/softenv-intro.html

Page 15: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

15

Software libraries

ALCF Supports two sets of libraries:– IBM system and provided libraries: /bgsys/drivers/ppcfloor

• glibc• mpi

– Site supported libraries and programs: /soft/• PETSc• ESSL

– And many others • See http://www.alcf.anl.gov/resource-guides/software-and-libraries

Page 16: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

16

Compiler wrappers

MPI wrappers for IBM XL cross-compilers:

MPI wrappers for GNU cross-compilers:

Wrapper Thread-Safe Wrapper

Underlying Compiler

Description

mpixlc mpixlc_r bgxlc IBM BG C Compiler

mpixlcxx mpixlcxx_r bgxlC IBM BG C++ Compiler

mpixlf77 mpixlf77_r bgxlf IBM BG Fortran 77 Compiler

mpixlf90 mpixlf90_r bgxlf90 IBM BG Fortran 90 Compiler

mpixlf95 mpixlf95_r bgxlf95 IBM BG Fortran 95 Compiler

mpixlf2003 mpixlf2003_r bgxlf2003 IBM BG Fortran 2003 Compiler

Wrapper Underlying Compiler Description

mpicc powerpc-bgp-linux-gcc GNU BG C Compiler

mpicxx powerpc-bgp-linux-g++ GNU BG C++ Compiler

mpif77 powerpc-bgp-linux-gfortran GNU BG Fortran 77 Compiler

mpif90 powerpc-bgp-linux-gfortran GNU BG Fortran 90 Compiler

Page 17: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

17

Optimization for beginners

Suggested set of optimization levels from least to most optimization: -O0 # best level for use with a debugger -O2 # good level for verifying correctness, baseline perf -O2 -qmaxmem=-1 -qhot=level=0 -O3 -qstrict (preserves program semantics) -O3 -O3 -qhot=level=1 -O4 -O5

Page 18: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

18

Optimization tips

-qlistopt generates a listing with all flags used in compilation -qreport produces a listing, shows how code was optimized Performance can decrease at higher levels of optimization, especially at -O4 or -O5 May specify different optimization levels for different routines/files

Page 19: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

19

ALCF Resources – BG/Q systems

Mira – BG/Q system – 49,152 nodes / 786,432 cores – 786 TB of memory – Peak flop rate: 10 PF – Linpack flop rate: 8.1 PF

Cetus (T&D) – BG/Q system– 1024 nodes / 16,384 cores– 16 TB of memory– Peak flop rate: 208 TF

Vesta (T&D) - BG/Q systems ‐– 2,048 nodes / 32,768 cores – 32 TB of memory – Peak flop rate: 416 TF

Page 20: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

20

ALCF Resources – supporting systems

Tukey – Nvidia system – 100 nodes / 1600 x86 cores/ 200 M2070 GPUs – 6.4 TB x86 memory / 1.2 TB GPU memory – Peak flop rate: 220 TF

Storage – Scratch: 28.8 PB raw capacity, 240 GB/s bw (GPFS) – Home: 1.8 PB raw capacity, 45 GB/s bw (GPFS) – Storage upgrade planned in 2015

Page 21: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

21

ALCF Resources

Mira48 racks/768K cores10 PF

Cetus (Dev)1 rack/16K cores208 TF

Tukey (Viz)100 nodes/1600 cores200 NVIDIA GPUs220 TFNetworks100Gb (via Esnet, internet2 UltraScienceNet)

Vesta (Dev)2 racks/32K cores416 TF

Page 22: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

22

Coming up next…

Data Transfers in the ALCF - Robert Scott, ALCF

Page 23: Critical Flags, Variables, and Other Important ALCF Minutiae

Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy

23

Thank You!

Questions?