Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Compiling applications for the Cray XC

Compiler Driver Wrappers (1)

● All applications that will run in parallel on the Cray XC should be compiled with the standard language wrappers. The compiler drivers for each language are: ●  cc – wrapper around the C compiler ●  CC – wrapper around the C++ compiler ●  ftn – wrapper around the Fortran compiler

●  These scripts will choose the required compiler version, target architecture options, scientific libraries and their include files automatically from the current used module environment. Use the –craype-‐verbose flag to see the default options.

● Use them exactly like you would the original compiler, e.g. To compile prog1.f90:

> ftn -‐c <any_other_flags> prog1.f90

2

Compiler Driver Wrappers (2)

●  The scripts choose which compiler to use from the PrgEnv module loaded

● Use module swap to change PrgEnv, e.g. > module swap PrgEnv-‐cray PrgEnv-‐intel

●  PrgEnv-‐cray is loaded by default at login. This may differ on other Cray systems. ●  use module list to check what is currently loaded

●  The Cray MPI module is loaded by default (cray-‐mpich). ●  To support SHMEM load the cray-‐shmem module.

PrgEnv Description Real Compilers

PrgEnv-‐cray Cray Compilation Environment crayftn, craycc, crayCC

PrgEnv-‐intel Intel Composer Suite ifort, icc, icpc

PrgEnv-‐gnu GNU Compiler Collection gfortran, gcc, g++

PrgEnv-‐pgi Portland Group Compilers pgf90, pgcc, pgCC

3

Compiler Versions

●  There are usually multiple versions of each compiler available to users. ●  The most recent version is usually the default and will be loaded when

swapping the PrgEnv. ●  To change the version of the compiler in use, swap the Compiler

Module. e.g. module swap cce cce/8.3.10

PrgEnv Compiler Module PrgEnv-‐cray cce PrgEnv-‐intel intel PrgEnv-‐gnu gcc PrgEnv-‐pgi pgi

4

EXCEPTION: Cross Compiling Environment

●  The wrapper scripts, ftn, cc, and CC, will create a highly optimized executable tuned for the Cray XC’s compute nodes (cross compilation).

●  This executable may not run on the login nodes ●  Login nodes do not support running distributed memory applications ●  Some Cray architectures may have different processors in the login

and compute nodes. Typical error is ‘… illegal Instruction …’

●  If you are compiling for the login nodes ●  You should use the original direct compiler commands, e.g. ifort,

pgcc, crayftn, gcc, … PATH will change with modules. All libraries will have to be linked in manually.

●  Conversely, you can use the compiler wrappers {cc,CC,ftn} and use the -‐target-‐cpu= option among {abudhabi, haswell, interlagos, istanbul, ivybridge, mc12, mc8, sandybridge, shanghai, x86_64. The x86_64 is the most compatible but also less specific.

5

About the –I, –L and –l flags

●  For libraries and include files being triggered by module files, you should NOT add anything to your Makefile ●  No additional MPI flags are needed (included by wrappers) ●  You do not need to add any -‐I, -‐l or –L flags for the Cray provided

libraries

●  If your Makefile needs an input for –L to work correctly, try using ‘.’

●  If you really, really need a specific path, try checking ‘module show <X>’ for some environment variables

6

Dynamic vs Static linking ●  Currently static linking is default

●  May change in the future ●  Already changed when linking for GPUs (XK6/XK7 nodes)

●  To decide how to link, 1.  you can either set CRAYPE_LINK_TYPE to “static” or “dynamic” 2.  Or pass the ‘-‐static’ or ‘-‐dynamic’ option to the linking wrapper (cc, CC or ftn).

●  Features of dynamic linking : ●  smaller executable, automatic use of new libs ●  Might need longer startup time to load and find the libs ●  Environment (loaded modules) should be the same between your compiler setup and

your batch script (eg. when switching to PrgEnv-‐intel) ●  Features of static linking :

●  Larger executable (usually not a problem) ●  Faster startup ●  Application will run the same code every time it runs (independent of environment)

●  If you want to hardcode the rpath into the executable use ●  Set CRAY_ADD_RPATH=yes during compilation ●  This will always load the same version of the lib when running, independent of the

version loaded by modules

7

OpenMP

● OpenMP is support by all of the PrgEnvs. ●  CCE (PrgEnv-‐cray) recognizes and interprets OpenMP directives by

default. If you have OpenMP directives in your application but do not wish to use them, disable OpenMP recognition with –hnoomp.

●  Intel OpenMP spawns an extra helper thread which may cause oversubscription. Hints on that will follow.

PrgEnv Enable OpenMP Disable OpenMP PrgEnv-‐cray -‐homp -‐hnoomp PrgEnv-‐intel -‐openmp PrgEnv-‐gnu -‐fopenmp PrgEnv-‐pgi -‐mp

8

Compiler man Pages

●  For more information on individual compilers

●  To verify that you are using the correct version of a compiler, use: ●  -V option on a cc, CC, or ftn command with PGI, Intel and Cray ●  --version option on a cc, CC, or ftn command with GNU

PrgEnv C C++ Fortran

PrgEnv-‐cray man craycc man crayCC man crayftn PrgEnv-‐intel man icc man icpc man ifort PrgEnv-‐gnu man gcc man g++ man gfortran PrgEnv-‐pgi man pgcc man pgCC man pgf90

Wrappers man cc man CC man ftn

9

Using Compilers

Quick Overview

Using Compiler Feedback

● Compilers can generate annotated listing of your source code indicating important optimizations. Useful for targeted use of compiler flags.

● CCE

●  ftn -‐rm ●  {cc,CC} -‐hlist=a

●  Intel

●  ftn/cc -‐opt-‐report 3 -‐vec-‐report6 ●  If you want this into a file: add -‐opt-‐report-‐file=filename ●  See ifort -‐-‐help reports

● GNU ●  -‐ftree-‐vectorizer-‐verbose=9

● PGI ●  -‐Minfo=<…>

11

Compiler feedback: Loopmark

●  For example, with the Cray compiler %%% L o o p m a r k L e g e n d %%% Primary Loop Type Modifiers -‐-‐-‐-‐-‐-‐-‐ -‐-‐-‐-‐ -‐-‐-‐-‐ -‐-‐-‐-‐-‐-‐-‐-‐-‐ A -‐ Pattern matched a -‐ vector atomic memory operation

b – blocked C -‐ Collapsed f – fused D -‐ Deleted i – interchanged E -‐ Cloned m -‐ streamed but not partitioned I -‐ Inlined p -‐ conditional, partial and/or computed M -‐ Multithreaded r – unrolled P -‐ Parallel/Tasked s – shortloop V -‐ Vectorized t -‐ array syntax temp used

w -‐ unwound

12

Compiler feedback: Loopmark (cont.) 29. b-‐-‐-‐-‐-‐-‐-‐< do i3=2,n3-‐1 30. b b-‐-‐-‐-‐-‐< do i2=2,n2-‐1 31. b b Vr-‐-‐< do i1=1,n1 32. b b Vr u1(i1) = u(i1,i2-‐1,i3) + u(i1,i2+1,i3) 33. b b Vr * + u(i1,i2,i3-‐1) + u(i1,i2,i3+1) 34. b b Vr u2(i1) = u(i1,i2-‐1,i3-‐1) + u(i1,i2+1,i3-‐1) 35. b b Vr * + u(i1,i2-‐1,i3+1) + u(i1,i2+1,i3+1) 36. b b Vr-‐-‐> enddo 37. b b Vr-‐-‐< do i1=2,n1-‐1 38. b b Vr r(i1,i2,i3) = v(i1,i2,i3) 39. b b Vr * -‐ a(0) * u(i1,i2,i3) 40. b b Vr * -‐ a(2) * ( u2(i1) + u1(i1-‐1) + u1(i1+1) ) 41. b b Vr * -‐ a(3) * ( u2(i1-‐1) + u2(i1+1) ) 42. b b Vr-‐-‐> enddo 43. b b-‐-‐-‐-‐-‐> enddo 44. b-‐-‐-‐-‐-‐-‐-‐> enddo

13

Compiler Feedback: Loopmark (cont.) ftn-‐6289 ftn: VECTOR File = resid.f, Line = 29 A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines 32 and 38. ftn-‐6049 ftn: SCALAR File = resid.f, Line = 29 A loop starting at line 29 was blocked with block size 4. ftn-‐6289 ftn: VECTOR File = resid.f, Line = 30 A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32 and 38. ftn-‐6049 ftn: SCALAR File = resid.f, Line = 30 A loop starting at line 30 was blocked with block size 4. ftn-‐6005 ftn: SCALAR File = resid.f, Line = 31 A loop starting at line 31 was unrolled 4 times. ftn-‐6204 ftn: VECTOR File = resid.f, Line = 31 A loop starting at line 31 was vectorized. ftn-‐6005 ftn: SCALAR File = resid.f, Line = 37 A loop starting at line 37 was unrolled 4 times. ftn-‐6204 ftn: VECTOR File = resid.f, Line = 37 A loop starting at line 37 was vectorized.

14

Recommended compiler optimization levels

● Cray compiler ●  The default optimization level (i.e. no flags) is equivalent to –O3 of

most other compilers. CCE optimizes rather aggressively by default, but this is also most thoroughly tested configuration

●  Try with –O3 –hfp3 (also tested this thoroughly) ●  -‐hfp3 gives you a lot more floating point optimization, esp. 32-bit ●  In case of precision errors, try a lower –hfp<number> (-‐hfp1 first, only -‐hfp0 if absolutely necessary)

● GNU compiler ●  Almost all HPC applications compile correctly with using -‐O3, so do

that instead of the cautious default. ●  -‐ffast-‐math may give some extra performance

●  Intel compiler ●  The default optimization level (equal to -‐O2) is safe. ●  Try with –O3. If that works still, you may try with -‐Ofast

-‐fp-‐model fast=2 ●  –craype-‐verbose flag to {cc,CC,ftn} to show options.

Inlining & inter-procedural optimization

● Cray compiler ●  Inlining within a file is enabled by default. ●  Command line options –OipaN (ftn) and –hipaN (cc/CC) where

N=0..4, provides a set of choices for inlining behavior ●  0 disables inlining, 3 is the default, 4 is even more elaborate

●  The –Oipafrom= (ftn) or –hipafrom= (cc/CC) option instructs the compiler to look for inlining candidates from other source files, or a directory of source files.

●  The -‐hwp combined with -‐h pl=… enables whole program automatic inlining.

● GNU compiler

●  Quite elaborate inlining enabled by –O3

●  Intel compiler ●  Inlining within a file is enabled by default ●  Multi-file inlining enabled by the flag -‐ipo

Loop transformations

● Cray compiler ●  Most useful techniques in their aggressive state already by default ●  One may try to improve loop restructuration for better vectorization

with –h vector3

● GNU compiler ●  Loop blocking (aka tiling) with-‐floop-‐block ●  Loop unrolling -‐funroll-‐loops or -‐funroll-‐all-‐loops

●  Intel compiler ●  Loop unrolling with -‐funroll-‐loops or -‐unroll-‐aggressive

Directives for the Cray Compiler

●  I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are convinced that it should be, you can use compiler directives instead of rising the optimization level –O…

● Cray compiler supports a full and growing set of directives

and pragmas, e.g. ●  !dir$ concurrent ●  !dir$ ivdep ●  !dir$ interchange ●  !dir$ unroll ●  !dir$ loop_info

[max_trips] [cache_na] ●  !dir$ blockable

● More information given in

●  man directives ●  man loop_info

!dir$ blockable(j,k) !dir$ blockingsize(16) do k = 6, nz-‐5 do j = 6, ny-‐5 do i = 6, nx-‐5 ! stencil end do end do end do

18

Summary

●  Four compiler environments available on the XC40: ●  Cray (PrgEnv-cray is the default) ●  Intel (PrgEnv-intel) ●  GNU (PrgEnv-gnu) ●  PGI (PrgEnv-pgi)

●  All of them accessed through the wrappers ftn, cc and CC – just do module swap to change a compiler or a version.

●  There is no universally fastest compiler

●  Performance strongly depends on the application (even input) ●  We try however to excel with the Cray Compiler Environment ●  If you see a case where some other compiler yields better

performance, let us know!

● Compiler flags do matter ●  be ready to spend some effort for finding the best ones for your

application. ●  More information is given at the end of this presentation.

19

Cray Scientific Libraries

Overview

Cray Scientific Libraries

FFT

FFTW

Dense BLAS

LAPACK

ScaLAPACK

IRT

CASE

Sparse CASK

PETSc

Trilinos

IRT – Iterative Refinement Toolkit CASK – Cray Adaptive Sparse Kernels CASE – Cray Adaptive Simplified Eigensolver

21

●  Large variety of standard libraries available via modules ●  Optimized for Cray Hardware and also for Haswell processor.

What makes Cray libraries special

1.  Node performance ●  Highly tuned routines at the low-level (ex. BLAS)

2.  Network performance

●  Optimized for network performance ●  Overlap between communication and computation ●  Use the best available low-level mechanism ●  Use adaptive parallel algorithms

3.  Highly adaptive software ●  Use auto-tuning and adaptation to give the user the known best

(or very good) codes at runtime

4.  Productivity features ●  Simple interfaces into complex software

22

Library Usage Overview.

●  LibSci ●  Includes BLAS, CBLAS, BLACS, LAPACK, ScaLAPACK ●  Module is loaded by default (man libsci) ●  Threading used within LibSci (OMP_NUM_THREADS). If you call within a parallel region, single thread used. More info later on.

●  FFTW ●  module load fftw and man fftw

● PETSc ●  module load cray-‐petsc{-‐complex} and man intro_petsc

●  Trilinos ●  module load cray-‐trilinos and man intro_trilinos

●  Third Party Scientific Libraries ●  module load cray-‐tpsl (use online documentation)

●  Iterative Refiniment Toolkit (IRT) through LibSci. ●  man intro_irt

● Cray Adaptive Sparse Kernels (CASK) are used in cray-petsc and cray-trilinos (transparent to the developer).

23

Third party Scientific Libraries (cray-tpsl)

●  TPSL (Third Party Scientific Libraries) contains a collection of outside mathematical libraries that can be used with PETSc and Trilinos.

●  This module will increase the flexibility of PETSc and Trilinos by providing users with multiple options for solving problems in dense and sparse linear algebra.

●  The cray-tpsl module is automatically loaded when PETSc or Trilinos is loaded. The libraries included are MUMPs, SuperLU, SuperLU_dist, ParMetis, Hypre, Sundials, and Scotch.

24

Check you got the right library!

● Add options to the linker to make sure you have the correct library loaded.

●  -‐Wl adds a command to the linker from the driver ● You can ask for the linker to tell you where an object was

resolved from using the –y option. ●  E.g. –Wl,-‐ydgemm_ (notice the ‘_’ at the end of the name)

Note: do not explicitly link “-lsci”. This will not be found from libsci 11+ and means a single core library for 10.x.

.//main.o: reference to dgemm_ /opt/xt-‐libsci/11.0.05.2/cray/73/mc12/lib/libsci_cray_mp.a(dgemm.o): definition of dgemm_

25

Threading for BLAS and LAPACK

●  LibSci is compatible with OpenMP ●  Control the number of threads to be used in your program using

OMP_NUM_THREADS ●  e.g., in job script export OMP_NUM_THREADS=16 ●  Then run with srun with –cpus-per-task=16

● What behavior you get from the library depends on your code 1.  No threading in code

●  The BLAS call will use OMP_NUM_THREADS threads 2.  Threaded code, outside parallel regions

●  The BLAS call will use OMP_NUM_THREADS threads 3.  Threaded code, inside parallel regions

●  The BLAS call will use a single thread ●  Threaded LAPACK works exactly the same as threaded BLAS ●  Anywhere LAPACK uses BLAS, those BLAS can be threaded. ●  Some LAPACK routines are threaded at the higher level

26

Intel MKL

●  The Intel Math Kernel Libraries (MKL) is an alternative to LibSci ●  Features tuned performance for Intel CPUs as well

●  Linking quite complicated, but the Intel MKL Link Line Advisor can tell you what to add to your link line ●  http://software.intel.com/sites/products/mkl/

● Using MKL together with the Intel compilers (PrgEnv-intel) is usually straightforward. Simply add –mkl to your compile and linker options

27

Running applications on the Cray XC

With Native SLURM

How applications are generally run on a XC

● Most Cray XCs are batch systems. ●  Users submit batch job scripts to a scheduler from a login node (e.g.

PBS, MOAB, SLURM) for execution at some point in the future. Each job requires resources and a predicts how long it will run.

●  The scheduler (running on an external server) chooses which jobs to run and allocates appropriate resources

●  The batch system will then execute the user’s job script on an a different node as the login node.

●  The scheduler monitors the job and kills any that overrun their runtime prediction.

● User job scripts typically contain two types of statements. 1.  Serial commands that are executed by the MOM node, e.g.

●  quick setup and post processing commands ●  e.g. (rm, cd, mkdir etc)

2.  Parallel executables that run on compute nodes. 1.  Launched using the srun command.

32

SLURM on the XC40 (Beginner Guide)

●  The main Cray system uses the Simple Linux Utility for Resource Management (SLURM) ●  Plenty of documentation can be found on

http://slurm.schedmd.com/documentation.html ●  In your daily work you will mainly encounter the following

commands: ●  sbatch – Submit a batch script to SLURM. ●  srun – Run parallel jobs. ●  scancel– Signal jobs under the control of SLURM ●  squeue – information about running jobs

●  The entire information about your simulation execution is contained in a batch script which is submitted via sbatch.

●  The batch script contains one or more parallel job runs executed via srun (job step). Nodes are used exclusively.

●  The simulations have to be executed on /scratch/…

34

Lifecycle of a batch script

CDL nodes

sbatch job.sl

SLURM gateway Node

Cray XC Compute Nodes

#!/bin/bash #SBATCH -p <your_workq> #SBATCH –A <your_account> #SBATCH -t 30 #SBATCH –N 100 cd <some_working_directory> srun –n 640 ./simulation.exe rm –r <my_work_dir>/<tmp_files>

Example Batch Job Script – job.sl

Parallel Serial

Scheduler Resources

35

The script will start by default in the directory where sbatch has been executed. This directory is available in the environment variable SLURM_SUBMIT_DIR

Useful SLURM options (Native)

●  srun is the application launcher ●  It must be used to run application on the XC compute nodes:

interactively or in a batch job. ●  If srun is not used, the application is launched on the gateway

node (and will most likely fail). ●  srun launches groups of Processing Elements (PEs) or tasks on

the compute nodes. (PE == (MPI RANK || Coarray Image || UPC Thread || ..) )

●  Some important parameters to set are:

●  No need for all –N, -c, –n, --ntasks-per-node but need consistency ●  Can also be specified via #SBATCH in batch script.

Description Option Total Number of tasks -n,--ntasks

Number of tasks per compute node --ntasks-per-node Number of threads per task -c,--cpus-per-task

Number of nodes -N,--nodes Walltime -t,--time

36

XC40 MPI-Job Examples

… #SBATCH -‐N 1 srun –n 1 ./<exe>

… #SBATCH -‐N 1 srun –n 64 ./<exe> #srun –n 32 ./<exe> #srun –n 16 ./<exe>

… #SBATCH –N 4 srun –n 256 ./<exe>

Single node, Single task Run a job on one task on one node with full memory.

Single node Run a pure MPI job with 64 Ranks on one node. The user can request a value for -‐n smaller than 64 but not larger.

Multi node fully packed Run a pure MPI job on 4 nodes with 64 MPI ranks on each node. The nodes are fully packed.

XC40 MPI-Job Examples

#!<your_shell> … #SBATCH –N 4 srun –tasks-‐per-‐node=32 ./<exe> #srun –n=128 ./<exe> srun –tasks-‐per-‐node=16 ./<exe> #srun –n=64 ./<exe>

#!<your_shell> … #SBATCH –N 4 export OMP_NUM_THREADS=4 #srun –n 64 –c 4 ./<exe> srun –tasks-‐per-‐node=16 –c 4 \ ./<exe>

Multi node paritally filled Run a pure MPI job on 4 nodes with less than 64 tasks per node. If you specify the number of nodes –N you can either specify the total number of tasks –n or the -–ntasks-‐per-‐node.

Hybrid MPI/OpenMP Run a hybrid applications on 4 nodes with 16 tasks per node and 4 OpenMP threads per task using the -‐-‐cpus-‐per-‐task (-‐c) parameter.

Hyperthreads on the XC40 with SLURM

●  Intel Hyper-Threading is a method of improving the throughput of a CPU by allowing two independent program threads to share the execution resources of one CPU ●  When one thread stalls the processor can execute read instructions

from a second thread instead of sitting idle ●  Because only the thread context state and a few other resources are

replicated (unlike replicating entire processor cores), the throughput improvement depends on whether the shared execution resources are a bottleneck

●  Typically much less than 2x with two hyperthreads. ●  With srun the hyper-threading is turned off with -‐-‐hint=nomultithread ●  Simply try it, if it does not help, switch back.

#SBATCH –N 4 export OMP_NUM_THREADS=4 srun –tasks-‐per-‐node=8 –c 4 \ -‐-‐hint=nomultithread ./<exe>

SLURM Output and Error

40

•  Redirects stdout and stderr to two separate files specified by the user. •  By default the script output will be written to files of the form slurm-‐<num>.out in your submit directory, where num is your SLURM batch job number.

•  Output is written immediately to files so please do not move or delete them. •  To collect stderr and stdout to a single file, specify same -output and -error

#SBATCH –output=<my_output_file_name>.out #SBATCH –error=<my_output_file_name>.err

•  You can use %j to add the SLURM batch job number to your output files.

#SBATCH –output=<my_output_file_name>-‐%j.all.out #SBATCH –error=<my_output_file_name>-‐%j.all.out

•  Finally, you can specify a job name which will appear after squeue.

#SBATCH –job-‐name=<my_job_name>

Monitoring your SLURM Job

41

•  Start your job with from the shell with sbatch. •  You will see the corresponding job id right away.

> sbatch <your_job>.slurm Submitted batch job <JOBID>

•  While running you can inspect your job with squeue. •  In order to inspect only your own jobs you can use the –u option to squeue. •  Always check that the reported resources are what you expect. •  For more information you can use > scontrol show job <JOBID> or > sstat <JOBID> from an interactive session to get the job steps.

> squeue -‐u <username> JOBID USER ACCOUNT NAME ST REASON START_TIME TIME TIME_LEFT NODES CPUS 74914 esposito cray job3 R None 2015-‐06-‐02T13:12:37 0:08 29:52 2 128

•  Only if you think that your job is not running properly after inspecting your output files, you can cancel it with scancel.

•  If your job exceeds the time limits specified with #SLURM –t your job will be automatically canceled by SLURM.

> scancel <JOBID>

> ssh gateway<num> > salloc <your_slurm_parameters>

More on SLURM

●  Behavior in specific cases: ●  If you do not specify anything you can run a single task on one node for one hour. ●  Specifying –n without -‐-‐ntasks-‐per-‐node still spreads the task evenly among

nodes. ●  The node memory limit is currently set to 32GB. You can use -‐-‐mem=131072 to

access the full memory of the node ●  If –c is specified without –n, then enough nodes are allocated and filled to satisfy

–c and –n. ●  Be careful when you specify SLURM parameters both in the batch script via

#SBATCH and on the line of srun in the script. It is possible that you do not get an abort for conflicting parameters.

●  More information on core binding and numa affinity is given later on. ●  User is responsible to get right partition and account !!! Use sinfo ●  For debugging and other diagnostic you can request an interactive

session.

Summary of SLURM commands and variables

Slurm Workload Manager

Job Submissionsalloc - Obtain a job allocation.

sbatch - Submit a batch script for later execution.

srun - Obtain a job allocation (as needed) and execute an application.

--array=<indexes>

(e.g. “--array=1-10”)

Job array specification.

(sbatch command only)

--account=<name> Account to be charged for resources used.

--begin=<time>

(e.g. “--begin=18:00:00”)

Initiate job after specified

time.

--clusters=<name> Cluster(s) to run the job.


--constraint=<features> Required node features.

--cpu_per_task=<count> Number of CPUs required

per task.

--dependency=<state:jobid> Defer job until specified jobs

reach specified state.

--error=<filename> File in which to store job

error messages.

--exclude=<names> Specific host names to

exclude from job allocation.

--exclusive[=user] Allocated nodes can not be

shared with other jobs/users.

--export=<name[=value]> Export identified

environment variables.

--gres=<name[:count]> Generic resources required

per node.

--input=<name> File from which to read job

input data.

--job-name=<name> Job name.

--label Prepend task ID to output.

(srun command only)

--licenses=<name[:count]> License resources required

for entire job.

--mem=<MB> Memory required per node.

--mem_per_cpu=<MB> Memory required per

allocated CPU.

-N<minnodes[-maxnodes]> Node count required for the

job.

-n<count> Number of tasks to be

launched.

--nodelist=<names> Specific host names to

include in job allocation.

--output=<name> File in which to store job

output.

--partition=<names> Partition/queue in which to

run the job.

--qos=<name> Quality Of Service.

--signal=[B:]<num>[@time] Signal job when approaching

time limit.

--time=<time> Wall clock time limit.

--wrap=<command_string> Wrap specified command in a

simple “sh” shell.


Accountingsacct - Display accounting data.

--allusers Displays all users jobs.

--accounts=<name> Displays jobs with specified

accounts.

--endtime=<time> End of reporting period.

--format=<spec> Format output.

--name=<jobname> Display jobs that have any of these

name(s).

--partition=<names> Comma separated list of partitions

to select jobs and job steps from.

--state=<state_list> Display jobs with specified states.

--starttime=<time> Start of reporting period.

sacctmgr - View and modify account information.

Options:

--immediate Commit changes immediately.

--parseable Output delimited by '|'

Commands:

add <ENTITY> <SPECS>

create <ENTITY> <SPECS>

Add an entity. Identical to

the create command.

delete <ENTITY> where

<SPECS>

Delete the specified entities.

list <ENTITY> [<SPECS>] Display information about

the specific entity.

modify <ENTITY> where

<SPECS> set <SPECS>

Modify an entity.

Entities:

account Account associated with job.

association Group information for job.

cluster ClusterName parameter in the

slurm.conf.

qos Quality of Service.

Job Managementsbcast - Transfer file to a job's compute nodes.

sbcast [options] SOURCE DESTINATION

--force Replace previously existing file.

--preserve Preserve modification times, access times, and

access permissions.

scancel - Signal jobs, job arrays, and/or job steps.

--account=<name> Operate only on jobs charging the

specified account.

--name=<name> Operate only on jobs with specified

name.

--partition=<names> Operate only on jobs in the specified

partition/queue.

--qos=<name> Operate only on jobs using the

specified quality of service.

http://slurm.schedmd.com/documentation.html

Summary of SLURM commands and variables

--reservation=<name> Operate only on jobs using the

specified reservation.

--state=<names> Operate only on jobs in the specified

state.

--user=<name> Operate only on jobs from the

specified user.

--nodelist=<names> Operate only on jobs using the

specified compute nodes.

squeue - View information about jobs.

--account=<name> View only jobs with specified

accounts.

--clusters=<name> View jobs on specified clusters.

--format=<spec>

(e.g. “--format=%i %j”)

Output format to display.

Specify fields, size, order, etc.

--jobs<job_id_list> Comma separated list of job IDs

to display.

--name=<name> View only jobs with specified

names.

--partition=<names> View only jobs in specified

partitions.

--priority Sort jobs by priority.

--qos=<name> View only jobs with specified

Qualities Of Service.

--start Report the expected start time

and resources to be allocated for

pending jobs in order of

increasing start time.

--state=<names> View only jobs with specified

states.

--users=<names> View only jobs for specified

users.

sinfo - View information about nodes and partitions.

--all Display information about all

partitions.

--dead If set, only report state information

for non-responding (dead) nodes.

--format=<spec> Output format to display.

--iterate=<seconds> Print the state at specified interval.

--long Print more detailed information.

--Node Print information in a node-oriented

format.

--partition=<names> View only specified partitions.

--reservation Display information about advanced

reservations.

-R Display reasons nodes are in the

down, drained, fail or failing state.

--state=<names> View only nodes specified states.

scontrol - Used view and modify configuration and state.

Also see the sview graphical user interface version.

--details Make show command print more details.

--oneliner Print information on one line.

Commands:

create SPECIFICATION Create a new partition or

reservation.

delete SPECIFICATION Delete the entry with the

specified SPECIFICATION.

reconfigure All Slurm daemons will re-read

the configuration file.

requeue JOB_LIST Requeue a running, suspended or

completed batch job.

show ENTITY ID Display the state of the specified

entity with the specified

identification.

update SPECIFICATION Update job, step, node, partition,

or reservation configuration per

the supplied specification.

Environment Variables

SLURM_ARRAY_JOB_ID Set to the job ID if part of a

job array.

SLURM_ARRAY_TASK_ID Set to the task ID if part of

a job array.

SLURM_CLUSTER_NAME Name of the cluster

executing the job.

SLURM_CPUS_PER_TASK Number of CPUs requested

per task.

SLURM_JOB_ACCOUNT Account name.

SLURM_JOB_ID Job ID.

SLURM_JOB_NAME Job Name.

SLURM_JOB_NODELIST Names of nodes allocated

to job.

SLURM_JOB_NUM_NODES Number of nodes allocated

to job.

SLURM_JOB_PARTITION Partition/queue running the

job.

SLURM_JOB_UID User ID of the job's owner.

SLURM_JOB_USER User name of the job's

owner.

SLURM_RESTART_COUNT Number of times job has

restarted.

SLURM_PROCID Task ID (MPI rank).

SLURM_STEP_ID Job step ID.

SLURM_STEP_NUM_TASKS Task count (number of

MPI ranks).

Daemons

slurmctld Executes on cluster's “head” node to

manage workload.

slurmd Executes on each compute node to

locally manage resources.

slurmdbd Manages database of resources limits,

licenses, and archives accounting

records.

Copyright 2015 SchedMD LLC. All rights reserved.

http://www.schedmd.com

Last Update: 3 April 2015


SLURM compared to others

28-Apr-2013User Commands PBS/Torque Slurm LSF SGE LoadLevelerJob submission qsub [script_file] sbatch [script_file] bsub [script_file] qsub [script_file] llsubmit [script_file]Job deletion qdel [job_id] scancel [job_id] bkill [job_id] qdel [job_id] llcancel [job_id]Job status (by job) qstat [job_id] squeue [job_id] bjobs [job_id] qstat -u \* [-j job_id] llq -u [username]Job status (by user) qstat -u [user_name] squeue -u [user_name] bjobs -u [user_name] qstat [-u user_name] llq -u [user_name]Job hold qhold [job_id] scontrol hold [job_id] bstop [job_id] qhold [job_id] llhold -r [job_id]Job release qrls [job_id] scontrol release [job_id] bresume [job_id] qrls [job_id] llhold -r [job_id]Queue list qstat -Q squeue bqueues qconf -sql llclassNode list pbsnodes -l sinfo -N OR scontrol show nodes bhosts qhost llstatus -L machineCluster status qstat -a sinfo bqueues qhost -q llstatus -L clusterGUI xpbsmon sview xlsf OR xlsbatch qmon xload

Environment PBS/Torque Slurm LSF SGE LoadLevelerJob ID $PBS_JOBID $SLURM_JOBID $LSB_JOBID $JOB_ID $LOAD_STEP_IDSubmit Directory $PBS_O_WORKDIR $SLURM_SUBMIT_DIR $LSB_SUBCWD $SGE_O_WORKDIR $LOADL_STEP_INITDIRSubmit Host $PBS_O_HOST $SLURM_SUBMIT_HOST $LSB_SUB_HOST $SGE_O_HOSTNode List $PBS_NODEFILE $SLURM_JOB_NODELIST $LSB_HOSTS/LSB_MCPU_HOST $PE_HOSTFILE $LOADL_PROCESSOR_LISTJob Array Index $PBS_ARRAYID $SLURM_ARRAY_TASK_ID $LSB_JOBINDEX $SGE_TASK_ID

Job Specification PBS/Torque Slurm LSF SGE LoadLevelerScript directive #PBS #SBATCH #BSUB #$ #@Queue -q [queue] -p [queue] -q [queue] -q [queue] class=[queue]Node Count -l nodes=[count] -N [min[-max]] -n [count] N/A node=[count]

CPU Count-l ppn=[count] OR -lmppwidth=[PE_count] -n [count] -n [count] -pe [PE] [count]

Wall Clock Limit -l walltime=[hh:mm:ss] -t [min] OR -t [days-hh:mm:ss] -W [hh:mm:ss] -l h_rt=[seconds] wall_clock_limit=[hh:mm:ss]Standard Output FIle -o [file_name] -o [file_name] -o [file_name] -o [file_name] output=[file_name]Standard Error File -e [file_name] e [file_name] -e [file_name] -e [file_name] error=[File_name]

Combine stdout/err-j oe (both to stdout) OR -j eo(both to stderr) (use -o without -e) (use -o without -e) -j yes

Copy Environment -V --export=[ALL | NONE | variables] -V environment=COPY_ALLEvent Notification -m abe --mail-type=[events] -B or -N -m abe notification=start|error|complete|never|alwaysEmail Address -M [address] --mail-user=[address] -u [address] -M [address] notify_user=[address]Job Name -N [name] --job-name=[name] -J [name] -N [name] job_name=[name]

Job Restart -r [y|n]--requeue OR --no-requeue (NOTE:configurable default) -r -r [yes|no] restart=[yes|no]

Working Directory N/A --workdir=[dir_name] (submission directory) -wd [directory] initialdir=[directory]Resource Sharing -l naccesspolicy=singlejob --exclusive OR--shared -x -l exclusive node_usage=not_shared

Memory Size -l mem=[MB]--mem=[mem][M|G|T] OR --mem-per-cpu=[mem][M|G|T] -M [MB] -l mem_free=[memory][K|M|G] requirements=(Memory >= [MB])

Account to charge -W group_list=[account] --account=[account] -P [account] -A [account]Tasks Per Node -l mppnppn [PEs_per_node] --tasks-per-node=[count] (Fixed allocation_rule in PE) tasks_per_node=[count]CPUs Per Task --cpus-per-task=[count] Job Dependency -d [job_id] --depend=[state:job_id] -w [done | exit | finish] -hold_jid [job_id | job_name]Job Project --wckey=[name] -P [name] -P [name]

Job host preference--nodelist=[nodes] AND/OR --exclude=[nodes] -m [nodes]

-q [queue]@[node] OR -q[queue]@@[hostgroup]

Quality Of Service -l qos=[name] --qos=[name]Job Arrays -t [array_spec] --array=[array_spec] (Slurm version 2.6+) J "name[array_spec]" -t [array_spec]Generic Resources -l other=[resource_spec] --gres=[resource_spec] -l [resource]=[value]Licenses --licenses=[license_spec] -R "rusage[license_spec]" -l [license]=[count]

Begin Time-A "YYYY-MM-DD HH:MM:SS" --begin=YYYY-MM-DD[THH:MM[:SS]] -b[[year:][month:]daty:]hour:minute -a [YYMMDDhhmm]


Compiling applications for the Cray XC · Directives for the Cray Compiler I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are

Documents