Memory restriction, limits and heterogeneous grids. A case study. Txema Heredia Or an example of how to adapt your policies to your needs.

Memory restriction, limits and

heterogeneous grids.

A case study.Txema Heredia

Or an example of how to adapt your policies to your needs

DISCLAIMERWhat I am going to present is not either the

panacea nor has to adapt to nor solve immediately your cluster issues. This is just a brief description of the problems we faced and how did we use different SGE’s options

to handle them.Also, no animal was harmed in the making of

this powerpoint.

Our story

“hey, let’s buy a cluster”

- my boss

What did we need?

What did we need?

•Users:

•biologists, not programmers

•Processes:

•user-made scripts

•single core biological software

What did we NOT need?

•Nopes:

•threads / parallel programming (mostly)

•GPUs

•Ayes:

•thousands of single-core jobs

And thus, our baby was born

Our cluster

•8 computing nodes

•8 cores

•8 Gb RAM

•1 front-end

Our cluster

•NFS

•Rocks cluster (CentOS)

•SGE

First steps with SGE


•1st try:

•One queue to rule them all


•1st try:

•all.q queue

•free for all


•1st try - conclusions:

•chaos reigned

•constant conflicts between users (specially time related)

•FIFO queuing

•swapping

2nd try

•2nd try

•round-robin-like scheduling

•share tree/functional tickets

•split cluster by time usage:

•3 queues: fast / medium / slow

2nd try

•fast:

•2 hours / 2 nodes

•medium:

•48 hours / 3 nodes

•slow:

•∞ hours / 3 nodes

2nd try

•Conclusions:

•↓ chaos

•↓ user conflicts

•Still swapping

•High undersubscription of the cluster

2nd try

•3 types of jobs

•Don’t need to coexist at the same time

•1 user → 1 type of job

•User knowledge

•Saturation of the unlimited queue

2nd try•Queue tinkering:

•wallclock time

•number of hosts

•Better results, but not good enough:

•Waiting jobs & idle nodes

2nd try

•There are 2 wars here:

•memory / swap

•splitting leads to undersubscription

The memory war

Memory

•Buy more memory

•from 8x8Gb

•to 4x 32Gb, 3x 16Gb, 1x 8Gb

•This reduces our problem, but doesn’t fix it

Swap

•Swapping in a cluster is the root of all evil

Swap

•Complex attribute “h_vmem”

h_core

h_rt ≠ h_cpu

h_fsize

h_rss

h_stack

h_data = h_vmem

h_vmem

•h_vmem

•SIGKILL

•s_vmem

•SIGXCPU

•You can combine both

h_vmem

•Requestable by default

•We want them to be consumable

•qmon / qconf -mc

h_vmem

h_vmem

•requestable = YES

•consumable = YES / JOB

•default = whatever you want

h_vmem

•Only for parallel environment jobs:

•consumable = YES

•sge_shepherd memory = h_vmem*slots

•consumble = JOB

•sge_shepherd memory = h_vmem

h_vmem

•default = 100M

•“everything” dies

•default = 6G

•“everything” works

h_vmem

•Now we can limit the memory

•But we can still have swapping

h_vmem

•Define h_vmem in each host

•qmon / qconf -me hostname

h_vmem

•Exact memory:

•more secure

•Bigger memory:

•more margin

Memory

•From now on, any job submission must contain a memory request:

•qsub ... -l h_vmem=3G ...

No more swapping!!

Undersubscription

Undersubscription

•Dual restriction:

•8 jobs/slots per node

•32 / 16 / 8 GB mem per node

•The minimum of both will apply

32 Gb node

8 Gb node

32 Gb node

8 Gb node

8Gb8Gb

1Gb1Gb 1Gb1Gb

1Gb1Gb 1Gb1Gb

1Gb1Gb

1Gb1Gb 1Gb1Gb

1Gb1Gb

7 slots free0 Gb free


Stupid scheduling

32 Gb node

8 Gb node

8Gb8Gb

1Gb1Gb 1Gb1Gb

1Gb1Gb 1Gb1Gb

1Gb1Gb

1Gb1Gb 1Gb1Gb

1Gb1Gb



Smart scheduling

Smart scheduling

•We want each job to go to the node where it better fits.

(another) DISCLAIMERThis is strictly for our case and needs. It may

appeal to you, or some ideas can inspire you, but it is not intended to be a step-by-

step solution for everyone.It is just an example of “things that can be

done”.

Smart scheduling

•Create 3 hostgroups:

•@32G, @16G and @8G

•Group nodes by memory

Smart scheduling

•Maximize the ratio memory/core:

•job <1Gb → 8Gb nodes

•1Gb < job < 2Gb → 16Gb nodes

•2Gb < job → 32Gb nodes

Smart scheduling

•3 different queues:

•all-32

•all-16

•all-8

•assign the corresponding hostgroup

Smart scheduling

•Same problem as before:

•Oversubscription of one queue

•Undersubscription of other queues

Sequence Numbers

Smart scheduling

•Preference for a given hostgroup

Smart scheduling

•all-32:

•@32G > @16G > @8G

•all-16:

•@16G > @32G > @8G

•all-8:

•@8G > @16G > @32G

Smart scheduling

•qmon → queue configuration → general configuration → Sequence Nr

•qconf -mq queuename

Smart scheduling@32GSeq Nr=0

@16GSeq Nr=1

@8GSeq Nr=2

all-32 queue

32 Gb queue

16 Gb queue

8Gb queue

Waiting queue

?✗✓

?✗✓

?✗✓

Are we done?

Qsub wrapper

•Users already choose the memory

•Why ask for a queue?

•We can let the system do it

Qsub wrapper

•Wrapper script around qsubparse parameters searching for queue or memory requests

if ( no memory ) { memory = default }

if ( no queues ) {

if (memory < 1Gb) { queue = all-8 }

if (1Gb < memory < 2Gb) { queue = all-16 }

if (2Gb < memory ) { queue = all-32 }

}

qsub -q $queue parameters

Qsub wrapper

•You can add whatever you need

Qsub wrapper

•“home-made” parameters

•--slow / --fast

•allow access to 2 kind of special nodes

•instead of

•-q all-16@compute-1-*

Qsub wrapper

•One queue to rule them all

•but...

•No swap!!!

•No undersubscription!!!

Now the icing

Punishment

•System relies in “good behaviour”

•Teach users how to use it

•Prevent & “punish” bad usage

Punishment

•epilog script

•runs when the job finishes

•global: qconf -mconf

•or by queue

•/opt/gridengine/default/common/sge_epilog.sh

Punishment

•Check memory

•requested

•maxvmem

•log

•send an email

Punishment•no memory

•teaches how to request it properly

•too much memory

•tells and advises.

•reasonable memory

•no email

Punishment

•epilog writes a logfile

•cron process “punishes” or “rewards” users according to last day memory usage

Punishment

•Modify user’s shared ticket policy

•For each “bad” job:

•-10 tickets

•For each “good” job:

•+5 tickets

Punishment

•“bad users”

•delayed scheduling

•“good users”

•more priority

Control other resources

Shared disk

•NFS shared disk

•avoid filling it

•suspend all jobs before its too late

Shared disk

•New complex attribute: scratch_pct

•type = INT

•operation >=

•requestable = NO

•consumable = NO

•default = 0

Shared disk

•Load Report

• /opt/gridengine/default/common/sge_load_report.sh

Shared diskinfinite loop {

scratch=`df| grep scratch| awk '{print $4}' | grep % | sed 's/%//g'`

echo begin

echo "$myhost:scratch_pct:$scratch"

echo end

}

Shared disk

Shared disk

•whenever the disk gets to 97%

•all jobs freeze

Conclusions

•Combining SGE options give access to much more powerful configurations

Questions?Special thanks:

•Angel Carreño

•Carles Perarnau

•Marc Esteve

•Jordi Rambla

•Arcadi Navarro

Memory restriction, limits and heterogeneous grids. A case study. Txema Heredia Or an example of how to adapt your policies to your needs.

Documents

vmem slide

cluster slide

undersubscription slide

frontend slide

evil slide

needs slide

story slide

boss slide