Top Banner
Introduction to Abel and SLURM Sabry Razick (Slides by: Katerina Michalickova) The Research Computing Services Group, USIT November 02, 2015
35

Introduction to Abel and SLURM

Feb 02, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Abel and SLURM

Introduction to Abel and SLURM

Sabry Razick(Slides by: Katerina Michalickova)

The Research Computing Services Group, USIT

November 02, 2015

Page 2: Introduction to Abel and SLURM

Topics• The Research Computing Services group

• Abel details

• Getting an account & Logging in

• Copying files

• Running a simple job

• Queuing system

• Job administration

• User administration

• Parallel jobs

Page 3: Introduction to Abel and SLURM

The Research Computing ServicesSeksjon for IT i Forskning

• The RCS group provides access to IT resources

and high performance computing to

researchers at UiO and to NOTUR users

• http://www.uio.no/english/services/it/research/

• Part of USIT

• Contact: [email protected]

Page 4: Introduction to Abel and SLURM

The Research Computing Services

• Operation of Abel - a computer cluster

• User support

• Data storage

• Secure data analysis and storage - TSD

• Portals

• Lifeportal (https://lifeportal.uio.no/)

• Geo (https://geoportal-dev.hpc.uio.no)

• Language (https://lap.hpc.uio.no)

Page 5: Introduction to Abel and SLURM

Abel• Computer cluster

– Similar Computers connected by a local area network

(LAN). Different than a Cloud or a Grid.

• Enables parallel computing

• Science presents multiple problems of parallel nature

– Sequence database searches

– Genome assembly and annotation

– Simulations

Page 6: Introduction to Abel and SLURM

We expected that users know what they are doing !

Page 7: Introduction to Abel and SLURM

Bunch of computers -> Cluster• Hardware

– Powerful computers(nodes)

– High-speed connection between node

– Access to a common file system

• Software

– Operating system Linux, 64 bit Centos 6 (Rocks Cluster Distribution based)

– Identical mass installations.

– Queuing system enables timely execution of many concurrent processes

Page 8: Introduction to Abel and SLURM

Abel in numbers

• Nodes - 650+

• Cores - 10000+

• Total memory - 40 TiB (40 TebiBytes)

• Total storage - 400 TiB using BeeGFS

• 96th most powerful in 2012 , now 444th (June 2015)

Page 9: Introduction to Abel and SLURM

Accessing Abel

• If you are working or studying at UiO, you can have an Abel account directly from us.

• If you are Norwegian scientist (or need large resources), you can apply through NOTUR – http://www.notur.no

• Write to us for information:

[email protected]

• Read about getting access: http://www.uio.no/hpc/abel/help/access

Page 10: Introduction to Abel and SLURM

Connecting to Abel

• Linux • Ubuntu

• Redhat - RHEL

• Windows - using Putty and WinSCP

• Mac OS

Page 11: Introduction to Abel and SLURM

Putty http://www.putty.org/

Page 12: Introduction to Abel and SLURM

File upload/download - WinSCP

• http://winscp.net/eng/download.php

Page 13: Introduction to Abel and SLURM
Page 14: Introduction to Abel and SLURM

File transfer on command line

• Unix users can use secure copy or rsync

commands

– Copy myfile.txt from the current directory on your

machine to your home area on Abel:

scp myfile.txt [email protected]:~

– For large files, use rsync command:

rsync -z myfile.tar [email protected]:~

Page 15: Introduction to Abel and SLURM

Software on Abel

• Available on Abel:

http://www.uio.no/hpc/abel/help/software

• Software on Abel is organized in modules.

– List all software (and version) organized in modules:

module avail

– Load software from a module:

module load module_name

• If you cannot find what you looking for: ask us

Page 16: Introduction to Abel and SLURM

Your own software

• You can install own software in your home

area

• This time we have a special lecture on this

by an expert (Bjørn-Helge Mevik) -

November-03,14:15

Page 17: Introduction to Abel and SLURM

Using Abel

• When you log into Abel, you are either on login0 or login1 login node.

• It is not allowed to execute programs (jobs) directly on the login nodes.

• Jobs are submitted to Abel via the queuing system.

• The login nodes are just for logging in, copying files, editing, compiling, running short tests (no more than a couple of minutes), submitting jobs, checking job status, etc.

• For interactive execution use qlogin.

Page 18: Introduction to Abel and SLURM

SLURM

● Simple Linux Utility for Resource Management

(workload manager)

● Allocates exclusive and/or non-exclusive

access to resources (computer nodes) to users

for some duration of time

● Provides a framework for starting, executing,

and monitoring work on a set of allocated

nodes.

● Managing a queue of pending work.

Page 19: Introduction to Abel and SLURM

Computing on Abel• Submit a job to the queuing system

– Software that executes jobs on available resources on the cluster (and much more)

• Communicate with the queuing system using a shell (or job) script

• Retrieve results (or errors) when the job is done

• Read tutorial: http://www.uio.no/hpc/abel/help/user-guide

Page 20: Introduction to Abel and SLURM

Job script• Job script - shell script including the command that one needs to

execute

• EXTRA comments read by the queuing system – “#SBATCH --xxxx”

• Compulsory values:

#SBATCH --account

#SBATCH --time

#SBATCH --mem-per-cpu

• Setting up a job environment

source /cluster/bin/jobsetup

Page 21: Introduction to Abel and SLURM

Project/Account

• Each user belongs to a project on Abel

• Each project has set of resources

• Learn about your project(s):

– Use: projects

Page 22: Introduction to Abel and SLURM

Minimal job script#!/bin/bash## Job name:#SBATCH --job-name=jobname## Project:#SBATCH --account=your_account## Wall time:#SBATCH --time=hh:mm:ss## Max memory #SBATCH --mem-per-cpu=max_size_in_memory

## Set up environmentsource /cluster/bin/jobsetup

## Run command./executable > outfile

Page 23: Introduction to Abel and SLURM

Example job script

#!/bin/bash

#SBATCH --job-name=RCS1115_hello

#SBATCH --account=xxx

#SBATCH --time=00:01:05

#SBATCH --mem-per-cpu=512M

source /cluster/bin/jobsetup

set -o errexit

sleep 1m

python hello.py

Resources

Setup

Job

Page 24: Introduction to Abel and SLURM

Submitting a job - sbatch

Job ID

Page 25: Introduction to Abel and SLURM

Checking a job - squeue

• squeue -u <USER_NAME>

Page 26: Introduction to Abel and SLURM

scontrol show job 1111231

JobId=1111231 Name=mitehunter

UserId=XXXX(100379) GroupId=users(100) Priority=21479 Nice=0 Account=nn9244k QOS=notur

JobState=RUNNING Reason=None Dependency=(null)

Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0

RunTime=15-16:25:15 TimeLimit=21-00:00:00 TimeMin=N/A SubmitTime=2015-10-12T20:43:14 EligibleTime=2015-10-12T20:43:14 StartTime=2015-10-12T20:51:26 EndTime=2015-11-02T19:51:27 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=long AllocNode:Sid=login-0-2:14827 ReqNodeList=(null) ExcNodeList=(null) NodeList=c18-34 BatchHost=c18-34 NumNodes=1 NumCPUs=4 CPUs/Task=4 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0 MinCPUsNode=4 MinMemoryCPU=3900M MinTmpDiskNode=0

Features=(null) Gres=(null) Reservation=(null)

Shared=0 Contiguous=0 Licenses=(null) Network=(null)

Command=/work/users/xxxxx/gadMor2/mite_hunter_1.0/mite_hunter.slurm

WorkDir=/work/users/xxxxx/gadMor2/mite_hunter_1.0

StdErr=/work/users/xxxxx/../slurm-12780231.out StdIn=/dev/null

StdOut=/work/users/XXX/gadMor2/mite_hunter_1.0/slurm-12780231.out

Page 27: Introduction to Abel and SLURM

Use of the SCRATCH area#!/bin/sh #SBATCH --job-name=YourJobname #SBATCH --account=YourProject #SBATCH --time=hh:mm:ss #SBATCH --mem-per-cpu=max_size_in_memory source /cluster/bin/jobsetup

## Copy files to work directory: cp $SUBMITDIR/YourData $SCRATCH

## Mark outfiles for automatic copying to $SUBMITDIR: chkfile YourOutput

## Run command cd $SCRATCH executable YourData > YourOutput

Page 28: Introduction to Abel and SLURM

Some usefull commands

• scancel <JOBID> - Cancel a job before it ends

• dusage - find out your disk usage

• squeue - list all queued jobs and find out the

• squeue -t R | more -position of your job

Page 29: Introduction to Abel and SLURM

qlogin

• Reserve some resources for a given time.

• Example - Reserve one node (or 16 cores) on

Abel for your interactive use for 1 hour:

qlogin --account=your_project --ntasks-per-node=16 --time=01:00:00

• Run “source /cluster/bin/jobsetup“ after

receiving allocation

http://www.uio.no/english/services/it/research/hpc/abel/help/user-guide/interactive-logins.htm

Page 30: Introduction to Abel and SLURM

Interactive use of Abel - qlogin• Send request for a resource (e.g. 4 cores)

• Work on command line when a node becomes available

• Example - book one node (or 16 cores) on Abel for your interactive use for 1 hour:

qlogin --account=your_project --ntasks-per-node=16 --time=01:00:00

• Run “source /cluster/bin/jobsetup“ after receiving allocation

• For more info, see:

http://www.uio.no/hpc/abel/help/user-guide/interactive-logins.html

Page 31: Introduction to Abel and SLURM

Interactive use of Abel - qlogin

Page 32: Introduction to Abel and SLURM

Environment variables

• SLURM_JOBID – job-id of the job

• SCRATCH – name of job-specific scratch-area

• SLURM_NPROCS – total number of cpus requested

• SLURM_CPUS_ON_NODE – number of cpus allocated on

node

• SUBMITDIR – directory where sbatch were issued

• TASK_ID – task number (for arrayrun-jobs)

Page 33: Introduction to Abel and SLURM

Arrayrun

● Parallel jobs - executing many instances of

the same executable at the same time.

● Many input datasets

● Simulations with different input parameters.

● Possible to split a large input file into chunks

and parallelize you job.

Page 34: Introduction to Abel and SLURM

MPI

● Message Passing Interface

● MPI is a language-independent communications

protocol used for programming parallel computers.

● We support Open MPI

○ module load openmpi

● jobs specifying more than one node automatically get

○ #SBATCH --constraint=ib

Page 35: Introduction to Abel and SLURM

Thank you.

[email protected]

http://www.uio.no/english/services/it/research/hpc/abel/