Compiling applications for the Cray XC
Compiling applications for the Cray XC
Compiler Driver Wrappers (1)
● All applications that will run in parallel on the Cray XC should be compiled with the standard language wrappers. The compiler drivers for each language are: ● cc – wrapper around the C compiler ● CC – wrapper around the C++ compiler ● ftn – wrapper around the Fortran compiler
● These scripts will choose the required compiler version, target architecture options, scientific libraries and their include files automatically from the current used module environment. Use the –craype-‐verbose flag to see the default options.
● Use them exactly like you would the original compiler, e.g. To compile prog1.f90:
> ftn -‐c <any_other_flags> prog1.f90
2
Compiler Driver Wrappers (2)
● The scripts choose which compiler to use from the PrgEnv module loaded
● Use module swap to change PrgEnv, e.g. > module swap PrgEnv-‐cray PrgEnv-‐intel
● PrgEnv-‐cray is loaded by default at login. This may differ on other Cray systems. ● use module list to check what is currently loaded
● The Cray MPI module is loaded by default (cray-‐mpich). ● To support SHMEM load the cray-‐shmem module.
PrgEnv Description Real Compilers
PrgEnv-‐cray Cray Compilation Environment crayftn, craycc, crayCC
PrgEnv-‐intel Intel Composer Suite ifort, icc, icpc
PrgEnv-‐gnu GNU Compiler Collection gfortran, gcc, g++
PrgEnv-‐pgi Portland Group Compilers pgf90, pgcc, pgCC
3
Compiler Versions
● There are usually multiple versions of each compiler available to users. ● The most recent version is usually the default and will be loaded when
swapping the PrgEnv. ● To change the version of the compiler in use, swap the Compiler
Module. e.g. module swap cce cce/8.3.10
PrgEnv Compiler Module PrgEnv-‐cray cce PrgEnv-‐intel intel PrgEnv-‐gnu gcc PrgEnv-‐pgi pgi
4
EXCEPTION: Cross Compiling Environment
● The wrapper scripts, ftn, cc, and CC, will create a highly optimized executable tuned for the Cray XC’s compute nodes (cross compilation).
● This executable may not run on the login nodes ● Login nodes do not support running distributed memory applications ● Some Cray architectures may have different processors in the login
and compute nodes. Typical error is ‘… illegal Instruction …’
● If you are compiling for the login nodes ● You should use the original direct compiler commands, e.g. ifort,
pgcc, crayftn, gcc, … PATH will change with modules. All libraries will have to be linked in manually.
● Conversely, you can use the compiler wrappers {cc,CC,ftn} and use the -‐target-‐cpu= option among {abudhabi, haswell, interlagos, istanbul, ivybridge, mc12, mc8, sandybridge, shanghai, x86_64. The x86_64 is the most compatible but also less specific.
5
About the –I, –L and –l flags
● For libraries and include files being triggered by module files, you should NOT add anything to your Makefile ● No additional MPI flags are needed (included by wrappers) ● You do not need to add any -‐I, -‐l or –L flags for the Cray provided
libraries
● If your Makefile needs an input for –L to work correctly, try using ‘.’
● If you really, really need a specific path, try checking ‘module show <X>’ for some environment variables
6
Dynamic vs Static linking ● Currently static linking is default
● May change in the future ● Already changed when linking for GPUs (XK6/XK7 nodes)
● To decide how to link, 1. you can either set CRAYPE_LINK_TYPE to “static” or “dynamic” 2. Or pass the ‘-‐static’ or ‘-‐dynamic’ option to the linking wrapper (cc, CC or ftn).
● Features of dynamic linking : ● smaller executable, automatic use of new libs ● Might need longer startup time to load and find the libs ● Environment (loaded modules) should be the same between your compiler setup and
your batch script (eg. when switching to PrgEnv-‐intel) ● Features of static linking :
● Larger executable (usually not a problem) ● Faster startup ● Application will run the same code every time it runs (independent of environment)
● If you want to hardcode the rpath into the executable use ● Set CRAY_ADD_RPATH=yes during compilation ● This will always load the same version of the lib when running, independent of the
version loaded by modules
7
OpenMP
● OpenMP is support by all of the PrgEnvs. ● CCE (PrgEnv-‐cray) recognizes and interprets OpenMP directives by
default. If you have OpenMP directives in your application but do not wish to use them, disable OpenMP recognition with –hnoomp.
● Intel OpenMP spawns an extra helper thread which may cause oversubscription. Hints on that will follow.
PrgEnv Enable OpenMP Disable OpenMP PrgEnv-‐cray -‐homp -‐hnoomp PrgEnv-‐intel -‐openmp PrgEnv-‐gnu -‐fopenmp PrgEnv-‐pgi -‐mp
8
Compiler man Pages
● For more information on individual compilers
● To verify that you are using the correct version of a compiler, use: ● -V option on a cc, CC, or ftn command with PGI, Intel and Cray ● --version option on a cc, CC, or ftn command with GNU
PrgEnv C C++ Fortran
PrgEnv-‐cray man craycc man crayCC man crayftn PrgEnv-‐intel man icc man icpc man ifort PrgEnv-‐gnu man gcc man g++ man gfortran PrgEnv-‐pgi man pgcc man pgCC man pgf90
Wrappers man cc man CC man ftn
9
Using Compilers
Quick Overview
Using Compiler Feedback
● Compilers can generate annotated listing of your source code indicating important optimizations. Useful for targeted use of compiler flags.
● CCE
● ftn -‐rm ● {cc,CC} -‐hlist=a
● Intel
● ftn/cc -‐opt-‐report 3 -‐vec-‐report6 ● If you want this into a file: add -‐opt-‐report-‐file=filename ● See ifort -‐-‐help reports
● GNU ● -‐ftree-‐vectorizer-‐verbose=9
● PGI ● -‐Minfo=<…>
11
Compiler feedback: Loopmark
● For example, with the Cray compiler %%% L o o p m a r k L e g e n d %%% Primary Loop Type Modifiers -‐-‐-‐-‐-‐-‐-‐ -‐-‐-‐-‐ -‐-‐-‐-‐ -‐-‐-‐-‐-‐-‐-‐-‐-‐ A -‐ Pattern matched a -‐ vector atomic memory operation
b – blocked C -‐ Collapsed f – fused D -‐ Deleted i – interchanged E -‐ Cloned m -‐ streamed but not partitioned I -‐ Inlined p -‐ conditional, partial and/or computed M -‐ Multithreaded r – unrolled P -‐ Parallel/Tasked s – shortloop V -‐ Vectorized t -‐ array syntax temp used
w -‐ unwound
12
Compiler feedback: Loopmark (cont.) 29. b-‐-‐-‐-‐-‐-‐-‐< do i3=2,n3-‐1 30. b b-‐-‐-‐-‐-‐< do i2=2,n2-‐1 31. b b Vr-‐-‐< do i1=1,n1 32. b b Vr u1(i1) = u(i1,i2-‐1,i3) + u(i1,i2+1,i3) 33. b b Vr * + u(i1,i2,i3-‐1) + u(i1,i2,i3+1) 34. b b Vr u2(i1) = u(i1,i2-‐1,i3-‐1) + u(i1,i2+1,i3-‐1) 35. b b Vr * + u(i1,i2-‐1,i3+1) + u(i1,i2+1,i3+1) 36. b b Vr-‐-‐> enddo 37. b b Vr-‐-‐< do i1=2,n1-‐1 38. b b Vr r(i1,i2,i3) = v(i1,i2,i3) 39. b b Vr * -‐ a(0) * u(i1,i2,i3) 40. b b Vr * -‐ a(2) * ( u2(i1) + u1(i1-‐1) + u1(i1+1) ) 41. b b Vr * -‐ a(3) * ( u2(i1-‐1) + u2(i1+1) ) 42. b b Vr-‐-‐> enddo 43. b b-‐-‐-‐-‐-‐> enddo 44. b-‐-‐-‐-‐-‐-‐-‐> enddo
13
Compiler Feedback: Loopmark (cont.) ftn-‐6289 ftn: VECTOR File = resid.f, Line = 29 A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines 32 and 38. ftn-‐6049 ftn: SCALAR File = resid.f, Line = 29 A loop starting at line 29 was blocked with block size 4. ftn-‐6289 ftn: VECTOR File = resid.f, Line = 30 A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32 and 38. ftn-‐6049 ftn: SCALAR File = resid.f, Line = 30 A loop starting at line 30 was blocked with block size 4. ftn-‐6005 ftn: SCALAR File = resid.f, Line = 31 A loop starting at line 31 was unrolled 4 times. ftn-‐6204 ftn: VECTOR File = resid.f, Line = 31 A loop starting at line 31 was vectorized. ftn-‐6005 ftn: SCALAR File = resid.f, Line = 37 A loop starting at line 37 was unrolled 4 times. ftn-‐6204 ftn: VECTOR File = resid.f, Line = 37 A loop starting at line 37 was vectorized.
14
Recommended compiler optimization levels
● Cray compiler ● The default optimization level (i.e. no flags) is equivalent to –O3 of
most other compilers. CCE optimizes rather aggressively by default, but this is also most thoroughly tested configuration
● Try with –O3 –hfp3 (also tested this thoroughly) ● -‐hfp3 gives you a lot more floating point optimization, esp. 32-bit ● In case of precision errors, try a lower –hfp<number> (-‐hfp1 first, only -‐hfp0 if absolutely necessary)
● GNU compiler ● Almost all HPC applications compile correctly with using -‐O3, so do
that instead of the cautious default. ● -‐ffast-‐math may give some extra performance
● Intel compiler ● The default optimization level (equal to -‐O2) is safe. ● Try with –O3. If that works still, you may try with -‐Ofast
-‐fp-‐model fast=2 ● –craype-‐verbose flag to {cc,CC,ftn} to show options.
Inlining & inter-procedural optimization
● Cray compiler ● Inlining within a file is enabled by default. ● Command line options –OipaN (ftn) and –hipaN (cc/CC) where
N=0..4, provides a set of choices for inlining behavior ● 0 disables inlining, 3 is the default, 4 is even more elaborate
● The –Oipafrom= (ftn) or –hipafrom= (cc/CC) option instructs the compiler to look for inlining candidates from other source files, or a directory of source files.
● The -‐hwp combined with -‐h pl=… enables whole program automatic inlining.
● GNU compiler
● Quite elaborate inlining enabled by –O3
● Intel compiler ● Inlining within a file is enabled by default ● Multi-file inlining enabled by the flag -‐ipo
Loop transformations
● Cray compiler ● Most useful techniques in their aggressive state already by default ● One may try to improve loop restructuration for better vectorization
with –h vector3
● GNU compiler ● Loop blocking (aka tiling) with-‐floop-‐block ● Loop unrolling -‐funroll-‐loops or -‐funroll-‐all-‐loops
● Intel compiler ● Loop unrolling with -‐funroll-‐loops or -‐unroll-‐aggressive
Directives for the Cray Compiler
● I you see from the compiler feedback that a loop has not been blocked, unrolled, or vectorized but you are convinced that it should be, you can use compiler directives instead of rising the optimization level –O…
● Cray compiler supports a full and growing set of directives
and pragmas, e.g. ● !dir$ concurrent ● !dir$ ivdep ● !dir$ interchange ● !dir$ unroll ● !dir$ loop_info
[max_trips] [cache_na] ● !dir$ blockable
● More information given in
● man directives ● man loop_info
!dir$ blockable(j,k) !dir$ blockingsize(16) do k = 6, nz-‐5 do j = 6, ny-‐5 do i = 6, nx-‐5 ! stencil end do end do end do
18
Summary
● Four compiler environments available on the XC40: ● Cray (PrgEnv-cray is the default) ● Intel (PrgEnv-intel) ● GNU (PrgEnv-gnu) ● PGI (PrgEnv-pgi)
● All of them accessed through the wrappers ftn, cc and CC – just do module swap to change a compiler or a version.
● There is no universally fastest compiler
● Performance strongly depends on the application (even input) ● We try however to excel with the Cray Compiler Environment ● If you see a case where some other compiler yields better
performance, let us know!
● Compiler flags do matter ● be ready to spend some effort for finding the best ones for your
application. ● More information is given at the end of this presentation.
19
Cray Scientific Libraries
Overview
Cray Scientific Libraries
FFT
FFTW
Dense BLAS
LAPACK
ScaLAPACK
IRT
CASE
Sparse CASK
PETSc
Trilinos
IRT – Iterative Refinement Toolkit CASK – Cray Adaptive Sparse Kernels CASE – Cray Adaptive Simplified Eigensolver
21
● Large variety of standard libraries available via modules ● Optimized for Cray Hardware and also for Haswell processor.
What makes Cray libraries special
1. Node performance ● Highly tuned routines at the low-level (ex. BLAS)
2. Network performance
● Optimized for network performance ● Overlap between communication and computation ● Use the best available low-level mechanism ● Use adaptive parallel algorithms
3. Highly adaptive software ● Use auto-tuning and adaptation to give the user the known best
(or very good) codes at runtime
4. Productivity features ● Simple interfaces into complex software
22
Library Usage Overview.
● LibSci ● Includes BLAS, CBLAS, BLACS, LAPACK, ScaLAPACK ● Module is loaded by default (man libsci) ● Threading used within LibSci (OMP_NUM_THREADS). If you call within a parallel region, single thread used. More info later on.
● FFTW ● module load fftw and man fftw
● PETSc ● module load cray-‐petsc{-‐complex} and man intro_petsc
● Trilinos ● module load cray-‐trilinos and man intro_trilinos
● Third Party Scientific Libraries ● module load cray-‐tpsl (use online documentation)
● Iterative Refiniment Toolkit (IRT) through LibSci. ● man intro_irt
● Cray Adaptive Sparse Kernels (CASK) are used in cray-petsc and cray-trilinos (transparent to the developer).
23
Third party Scientific Libraries (cray-tpsl)
● TPSL (Third Party Scientific Libraries) contains a collection of outside mathematical libraries that can be used with PETSc and Trilinos.
● This module will increase the flexibility of PETSc and Trilinos by providing users with multiple options for solving problems in dense and sparse linear algebra.
● The cray-tpsl module is automatically loaded when PETSc or Trilinos is loaded. The libraries included are MUMPs, SuperLU, SuperLU_dist, ParMetis, Hypre, Sundials, and Scotch.
24
Check you got the right library!
● Add options to the linker to make sure you have the correct library loaded.
● -‐Wl adds a command to the linker from the driver ● You can ask for the linker to tell you where an object was
resolved from using the –y option. ● E.g. –Wl,-‐ydgemm_ (notice the ‘_’ at the end of the name)
Note: do not explicitly link “-lsci”. This will not be found from libsci 11+ and means a single core library for 10.x.
.//main.o: reference to dgemm_ /opt/xt-‐libsci/11.0.05.2/cray/73/mc12/lib/libsci_cray_mp.a(dgemm.o): definition of dgemm_
25
Threading for BLAS and LAPACK
● LibSci is compatible with OpenMP ● Control the number of threads to be used in your program using
OMP_NUM_THREADS ● e.g., in job script export OMP_NUM_THREADS=16 ● Then run with srun with –cpus-per-task=16
● What behavior you get from the library depends on your code 1. No threading in code
● The BLAS call will use OMP_NUM_THREADS threads 2. Threaded code, outside parallel regions
● The BLAS call will use OMP_NUM_THREADS threads 3. Threaded code, inside parallel regions
● The BLAS call will use a single thread ● Threaded LAPACK works exactly the same as threaded BLAS ● Anywhere LAPACK uses BLAS, those BLAS can be threaded. ● Some LAPACK routines are threaded at the higher level
26
Intel MKL
● The Intel Math Kernel Libraries (MKL) is an alternative to LibSci ● Features tuned performance for Intel CPUs as well
● Linking quite complicated, but the Intel MKL Link Line Advisor can tell you what to add to your link line ● http://software.intel.com/sites/products/mkl/
● Using MKL together with the Intel compilers (PrgEnv-intel) is usually straightforward. Simply add –mkl to your compile and linker options
27
Running applications on the Cray XC
With Native SLURM
How applications are generally run on a XC
● Most Cray XCs are batch systems. ● Users submit batch job scripts to a scheduler from a login node (e.g.
PBS, MOAB, SLURM) for execution at some point in the future. Each job requires resources and a predicts how long it will run.
● The scheduler (running on an external server) chooses which jobs to run and allocates appropriate resources
● The batch system will then execute the user’s job script on an a different node as the login node.
● The scheduler monitors the job and kills any that overrun their runtime prediction.
● User job scripts typically contain two types of statements. 1. Serial commands that are executed by the MOM node, e.g.
● quick setup and post processing commands ● e.g. (rm, cd, mkdir etc)
2. Parallel executables that run on compute nodes. 1. Launched using the srun command.
32
SLURM on the XC40 (Beginner Guide)
● The main Cray system uses the Simple Linux Utility for Resource Management (SLURM) ● Plenty of documentation can be found on
http://slurm.schedmd.com/documentation.html ● In your daily work you will mainly encounter the following
commands: ● sbatch – Submit a batch script to SLURM. ● srun – Run parallel jobs. ● scancel– Signal jobs under the control of SLURM ● squeue – information about running jobs
● The entire information about your simulation execution is contained in a batch script which is submitted via sbatch.
● The batch script contains one or more parallel job runs executed via srun (job step). Nodes are used exclusively.
● The simulations have to be executed on /scratch/…
34
Lifecycle of a batch script
CDL nodes
sbatch job.sl
SLURM gateway Node
Cray XC Compute Nodes
#!/bin/bash #SBATCH -p <your_workq> #SBATCH –A <your_account> #SBATCH -t 30 #SBATCH –N 100 cd <some_working_directory> srun –n 640 ./simulation.exe rm –r <my_work_dir>/<tmp_files>
Example Batch Job Script – job.sl
Parallel Serial
Scheduler Resources
35
The script will start by default in the directory where sbatch has been executed. This directory is available in the environment variable SLURM_SUBMIT_DIR
Useful SLURM options (Native)
● srun is the application launcher ● It must be used to run application on the XC compute nodes:
interactively or in a batch job. ● If srun is not used, the application is launched on the gateway
node (and will most likely fail). ● srun launches groups of Processing Elements (PEs) or tasks on
the compute nodes. (PE == (MPI RANK || Coarray Image || UPC Thread || ..) )
● Some important parameters to set are:
● No need for all –N, -c, –n, --ntasks-per-node but need consistency ● Can also be specified via #SBATCH in batch script.
Description Option Total Number of tasks -n,--ntasks
Number of tasks per compute node --ntasks-per-node Number of threads per task -c,--cpus-per-task
Number of nodes -N,--nodes Walltime -t,--time
36
XC40 MPI-Job Examples
… #SBATCH -‐N 1 srun –n 1 ./<exe>
… #SBATCH -‐N 1 srun –n 64 ./<exe> #srun –n 32 ./<exe> #srun –n 16 ./<exe>
… #SBATCH –N 4 srun –n 256 ./<exe>
Single node, Single task Run a job on one task on one node with full memory.
Single node Run a pure MPI job with 64 Ranks on one node. The user can request a value for -‐n smaller than 64 but not larger.
Multi node fully packed Run a pure MPI job on 4 nodes with 64 MPI ranks on each node. The nodes are fully packed.
XC40 MPI-Job Examples
#!<your_shell> … #SBATCH –N 4 srun –tasks-‐per-‐node=32 ./<exe> #srun –n=128 ./<exe> srun –tasks-‐per-‐node=16 ./<exe> #srun –n=64 ./<exe>
#!<your_shell> … #SBATCH –N 4 export OMP_NUM_THREADS=4 #srun –n 64 –c 4 ./<exe> srun –tasks-‐per-‐node=16 –c 4 \ ./<exe>
Multi node paritally filled Run a pure MPI job on 4 nodes with less than 64 tasks per node. If you specify the number of nodes –N you can either specify the total number of tasks –n or the -–ntasks-‐per-‐node.
Hybrid MPI/OpenMP Run a hybrid applications on 4 nodes with 16 tasks per node and 4 OpenMP threads per task using the -‐-‐cpus-‐per-‐task (-‐c) parameter.
Hyperthreads on the XC40 with SLURM
● Intel Hyper-Threading is a method of improving the throughput of a CPU by allowing two independent program threads to share the execution resources of one CPU ● When one thread stalls the processor can execute read instructions
from a second thread instead of sitting idle ● Because only the thread context state and a few other resources are
replicated (unlike replicating entire processor cores), the throughput improvement depends on whether the shared execution resources are a bottleneck
● Typically much less than 2x with two hyperthreads. ● With srun the hyper-threading is turned off with -‐-‐hint=nomultithread ● Simply try it, if it does not help, switch back.
#SBATCH –N 4 export OMP_NUM_THREADS=4 srun –tasks-‐per-‐node=8 –c 4 \ -‐-‐hint=nomultithread ./<exe>
SLURM Output and Error
40
• Redirects stdout and stderr to two separate files specified by the user. • By default the script output will be written to files of the form slurm-‐<num>.out in your submit directory, where num is your SLURM batch job number.
• Output is written immediately to files so please do not move or delete them. • To collect stderr and stdout to a single file, specify same -output and -error
#SBATCH –output=<my_output_file_name>.out #SBATCH –error=<my_output_file_name>.err
• You can use %j to add the SLURM batch job number to your output files.
#SBATCH –output=<my_output_file_name>-‐%j.all.out #SBATCH –error=<my_output_file_name>-‐%j.all.out
• Finally, you can specify a job name which will appear after squeue.
#SBATCH –job-‐name=<my_job_name>
Monitoring your SLURM Job
41
• Start your job with from the shell with sbatch. • You will see the corresponding job id right away.
> sbatch <your_job>.slurm Submitted batch job <JOBID>
• While running you can inspect your job with squeue. • In order to inspect only your own jobs you can use the –u option to squeue. • Always check that the reported resources are what you expect. • For more information you can use > scontrol show job <JOBID> or > sstat <JOBID> from an interactive session to get the job steps.
> squeue -‐u <username> JOBID USER ACCOUNT NAME ST REASON START_TIME TIME TIME_LEFT NODES CPUS 74914 esposito cray job3 R None 2015-‐06-‐02T13:12:37 0:08 29:52 2 128
• Only if you think that your job is not running properly after inspecting your output files, you can cancel it with scancel.
• If your job exceeds the time limits specified with #SLURM –t your job will be automatically canceled by SLURM.
> scancel <JOBID>
> ssh gateway<num> > salloc <your_slurm_parameters>
More on SLURM
● Behavior in specific cases: ● If you do not specify anything you can run a single task on one node for one hour. ● Specifying –n without -‐-‐ntasks-‐per-‐node still spreads the task evenly among
nodes. ● The node memory limit is currently set to 32GB. You can use -‐-‐mem=131072 to
access the full memory of the node ● If –c is specified without –n, then enough nodes are allocated and filled to satisfy
–c and –n. ● Be careful when you specify SLURM parameters both in the batch script via
#SBATCH and on the line of srun in the script. It is possible that you do not get an abort for conflicting parameters.
● More information on core binding and numa affinity is given later on. ● User is responsible to get right partition and account !!! Use sinfo ● For debugging and other diagnostic you can request an interactive
session.
Summary of SLURM commands and variables
Slurm Workload Manager
Job Submissionsalloc - Obtain a job allocation.
sbatch - Submit a batch script for later execution.
srun - Obtain a job allocation (as needed) and execute an application.
--array=<indexes>
(e.g. “--array=1-10”)
Job array specification.
(sbatch command only)
--account=<name> Account to be charged for resources used.
--begin=<time>
(e.g. “--begin=18:00:00”)
Initiate job after specified
time.
--clusters=<name> Cluster(s) to run the job.
(sbatch command only)
--constraint=<features> Required node features.
--cpu_per_task=<count> Number of CPUs required
per task.
--dependency=<state:jobid> Defer job until specified jobs
reach specified state.
--error=<filename> File in which to store job
error messages.
--exclude=<names> Specific host names to
exclude from job allocation.
--exclusive[=user] Allocated nodes can not be
shared with other jobs/users.
--export=<name[=value]> Export identified
environment variables.
--gres=<name[:count]> Generic resources required
per node.
--input=<name> File from which to read job
input data.
--job-name=<name> Job name.
--label Prepend task ID to output.
(srun command only)
--licenses=<name[:count]> License resources required
for entire job.
--mem=<MB> Memory required per node.
--mem_per_cpu=<MB> Memory required per
allocated CPU.
-N<minnodes[-maxnodes]> Node count required for the
job.
-n<count> Number of tasks to be
launched.
--nodelist=<names> Specific host names to
include in job allocation.
--output=<name> File in which to store job
output.
--partition=<names> Partition/queue in which to
run the job.
--qos=<name> Quality Of Service.
--signal=[B:]<num>[@time] Signal job when approaching
time limit.
--time=<time> Wall clock time limit.
--wrap=<command_string> Wrap specified command in a
simple “sh” shell.
(sbatch command only)
Accountingsacct - Display accounting data.
--allusers Displays all users jobs.
--accounts=<name> Displays jobs with specified
accounts.
--endtime=<time> End of reporting period.
--format=<spec> Format output.
--name=<jobname> Display jobs that have any of these
name(s).
--partition=<names> Comma separated list of partitions
to select jobs and job steps from.
--state=<state_list> Display jobs with specified states.
--starttime=<time> Start of reporting period.
sacctmgr - View and modify account information.
Options:
--immediate Commit changes immediately.
--parseable Output delimited by '|'
Commands:
add <ENTITY> <SPECS>
create <ENTITY> <SPECS>
Add an entity. Identical to
the create command.
delete <ENTITY> where
<SPECS>
Delete the specified entities.
list <ENTITY> [<SPECS>] Display information about
the specific entity.
modify <ENTITY> where
<SPECS> set <SPECS>
Modify an entity.
Entities:
account Account associated with job.
association Group information for job.
cluster ClusterName parameter in the
slurm.conf.
qos Quality of Service.
Job Managementsbcast - Transfer file to a job's compute nodes.
sbcast [options] SOURCE DESTINATION
--force Replace previously existing file.
--preserve Preserve modification times, access times, and
access permissions.
scancel - Signal jobs, job arrays, and/or job steps.
--account=<name> Operate only on jobs charging the
specified account.
--name=<name> Operate only on jobs with specified
name.
--partition=<names> Operate only on jobs in the specified
partition/queue.
--qos=<name> Operate only on jobs using the
specified quality of service.
http://slurm.schedmd.com/documentation.html
Summary of SLURM commands and variables
--reservation=<name> Operate only on jobs using the
specified reservation.
--state=<names> Operate only on jobs in the specified
state.
--user=<name> Operate only on jobs from the
specified user.
--nodelist=<names> Operate only on jobs using the
specified compute nodes.
squeue - View information about jobs.
--account=<name> View only jobs with specified
accounts.
--clusters=<name> View jobs on specified clusters.
--format=<spec>
(e.g. “--format=%i %j”)
Output format to display.
Specify fields, size, order, etc.
--jobs<job_id_list> Comma separated list of job IDs
to display.
--name=<name> View only jobs with specified
names.
--partition=<names> View only jobs in specified
partitions.
--priority Sort jobs by priority.
--qos=<name> View only jobs with specified
Qualities Of Service.
--start Report the expected start time
and resources to be allocated for
pending jobs in order of
increasing start time.
--state=<names> View only jobs with specified
states.
--users=<names> View only jobs for specified
users.
sinfo - View information about nodes and partitions.
--all Display information about all
partitions.
--dead If set, only report state information
for non-responding (dead) nodes.
--format=<spec> Output format to display.
--iterate=<seconds> Print the state at specified interval.
--long Print more detailed information.
--Node Print information in a node-oriented
format.
--partition=<names> View only specified partitions.
--reservation Display information about advanced
reservations.
-R Display reasons nodes are in the
down, drained, fail or failing state.
--state=<names> View only nodes specified states.
scontrol - Used view and modify configuration and state.
Also see the sview graphical user interface version.
--details Make show command print more details.
--oneliner Print information on one line.
Commands:
create SPECIFICATION Create a new partition or
reservation.
delete SPECIFICATION Delete the entry with the
specified SPECIFICATION.
reconfigure All Slurm daemons will re-read
the configuration file.
requeue JOB_LIST Requeue a running, suspended or
completed batch job.
show ENTITY ID Display the state of the specified
entity with the specified
identification.
update SPECIFICATION Update job, step, node, partition,
or reservation configuration per
the supplied specification.
Environment Variables
SLURM_ARRAY_JOB_ID Set to the job ID if part of a
job array.
SLURM_ARRAY_TASK_ID Set to the task ID if part of
a job array.
SLURM_CLUSTER_NAME Name of the cluster
executing the job.
SLURM_CPUS_PER_TASK Number of CPUs requested
per task.
SLURM_JOB_ACCOUNT Account name.
SLURM_JOB_ID Job ID.
SLURM_JOB_NAME Job Name.
SLURM_JOB_NODELIST Names of nodes allocated
to job.
SLURM_JOB_NUM_NODES Number of nodes allocated
to job.
SLURM_JOB_PARTITION Partition/queue running the
job.
SLURM_JOB_UID User ID of the job's owner.
SLURM_JOB_USER User name of the job's
owner.
SLURM_RESTART_COUNT Number of times job has
restarted.
SLURM_PROCID Task ID (MPI rank).
SLURM_STEP_ID Job step ID.
SLURM_STEP_NUM_TASKS Task count (number of
MPI ranks).
Daemons
slurmctld Executes on cluster's “head” node to
manage workload.
slurmd Executes on each compute node to
locally manage resources.
slurmdbd Manages database of resources limits,
licenses, and archives accounting
records.
Copyright 2015 SchedMD LLC. All rights reserved.
http://www.schedmd.com
Last Update: 3 April 2015
http://slurm.schedmd.com/documentation.html
SLURM compared to others
28-Apr-2013User Commands PBS/Torque Slurm LSF SGE LoadLevelerJob submission qsub [script_file] sbatch [script_file] bsub [script_file] qsub [script_file] llsubmit [script_file]Job deletion qdel [job_id] scancel [job_id] bkill [job_id] qdel [job_id] llcancel [job_id]Job status (by job) qstat [job_id] squeue [job_id] bjobs [job_id] qstat -u \* [-j job_id] llq -u [username]Job status (by user) qstat -u [user_name] squeue -u [user_name] bjobs -u [user_name] qstat [-u user_name] llq -u [user_name]Job hold qhold [job_id] scontrol hold [job_id] bstop [job_id] qhold [job_id] llhold -r [job_id]Job release qrls [job_id] scontrol release [job_id] bresume [job_id] qrls [job_id] llhold -r [job_id]Queue list qstat -Q squeue bqueues qconf -sql llclassNode list pbsnodes -l sinfo -N OR scontrol show nodes bhosts qhost llstatus -L machineCluster status qstat -a sinfo bqueues qhost -q llstatus -L clusterGUI xpbsmon sview xlsf OR xlsbatch qmon xload
Environment PBS/Torque Slurm LSF SGE LoadLevelerJob ID $PBS_JOBID $SLURM_JOBID $LSB_JOBID $JOB_ID $LOAD_STEP_IDSubmit Directory $PBS_O_WORKDIR $SLURM_SUBMIT_DIR $LSB_SUBCWD $SGE_O_WORKDIR $LOADL_STEP_INITDIRSubmit Host $PBS_O_HOST $SLURM_SUBMIT_HOST $LSB_SUB_HOST $SGE_O_HOSTNode List $PBS_NODEFILE $SLURM_JOB_NODELIST $LSB_HOSTS/LSB_MCPU_HOST $PE_HOSTFILE $LOADL_PROCESSOR_LISTJob Array Index $PBS_ARRAYID $SLURM_ARRAY_TASK_ID $LSB_JOBINDEX $SGE_TASK_ID
Job Specification PBS/Torque Slurm LSF SGE LoadLevelerScript directive #PBS #SBATCH #BSUB #$ #@Queue -q [queue] -p [queue] -q [queue] -q [queue] class=[queue]Node Count -l nodes=[count] -N [min[-max]] -n [count] N/A node=[count]
CPU Count-l ppn=[count] OR -lmppwidth=[PE_count] -n [count] -n [count] -pe [PE] [count]
Wall Clock Limit -l walltime=[hh:mm:ss] -t [min] OR -t [days-hh:mm:ss] -W [hh:mm:ss] -l h_rt=[seconds] wall_clock_limit=[hh:mm:ss]Standard Output FIle -o [file_name] -o [file_name] -o [file_name] -o [file_name] output=[file_name]Standard Error File -e [file_name] e [file_name] -e [file_name] -e [file_name] error=[File_name]
Combine stdout/err-j oe (both to stdout) OR -j eo(both to stderr) (use -o without -e) (use -o without -e) -j yes
Copy Environment -V --export=[ALL | NONE | variables] -V environment=COPY_ALLEvent Notification -m abe --mail-type=[events] -B or -N -m abe notification=start|error|complete|never|alwaysEmail Address -M [address] --mail-user=[address] -u [address] -M [address] notify_user=[address]Job Name -N [name] --job-name=[name] -J [name] -N [name] job_name=[name]
Job Restart -r [y|n]--requeue OR --no-requeue (NOTE:configurable default) -r -r [yes|no] restart=[yes|no]
Working Directory N/A --workdir=[dir_name] (submission directory) -wd [directory] initialdir=[directory]Resource Sharing -l naccesspolicy=singlejob --exclusive OR--shared -x -l exclusive node_usage=not_shared
Memory Size -l mem=[MB]--mem=[mem][M|G|T] OR --mem-per-cpu=[mem][M|G|T] -M [MB] -l mem_free=[memory][K|M|G] requirements=(Memory >= [MB])
Account to charge -W group_list=[account] --account=[account] -P [account] -A [account]Tasks Per Node -l mppnppn [PEs_per_node] --tasks-per-node=[count] (Fixed allocation_rule in PE) tasks_per_node=[count]CPUs Per Task --cpus-per-task=[count] Job Dependency -d [job_id] --depend=[state:job_id] -w [done | exit | finish] -hold_jid [job_id | job_name]Job Project --wckey=[name] -P [name] -P [name]
Job host preference--nodelist=[nodes] AND/OR --exclude=[nodes] -m [nodes]
-q [queue]@[node] OR -q[queue]@@[hostgroup]
Quality Of Service -l qos=[name] --qos=[name]Job Arrays -t [array_spec] --array=[array_spec] (Slurm version 2.6+) J "name[array_spec]" -t [array_spec]Generic Resources -l other=[resource_spec] --gres=[resource_spec] -l [resource]=[value]Licenses --licenses=[license_spec] -R "rusage[license_spec]" -l [license]=[count]
Begin Time-A "YYYY-MM-DD HH:MM:SS" --begin=YYYY-MM-DD[THH:MM[:SS]] -b[[year:][month:]daty:]hour:minute -a [YYMMDDhhmm]
http://slurm.schedmd.com/documentation.html