Memory restriction, limits and heterogeneous grids. A case study. Txema Heredia Or an example of how to adapt your policies to your needs
Dec 24, 2015
Memory restriction, limits and
heterogeneous grids.
A case study.Txema Heredia
Or an example of how to adapt your policies to your needs
DISCLAIMERWhat I am going to present is not either the
panacea nor has to adapt to nor solve immediately your cluster issues. This is just a brief description of the problems we faced and how did we use different SGE’s options
to handle them.Also, no animal was harmed in the making of
this powerpoint.
Our story
“hey, let’s buy a cluster”
- my boss
What did we need?
What did we need?
•Users:
•biologists, not programmers
•Processes:
•user-made scripts
•single core biological software
What did we NOT need?
•Nopes:
•threads / parallel programming (mostly)
•GPUs
•Ayes:
•thousands of single-core jobs
And thus, our baby was born
Our cluster
•8 computing nodes
•8 cores
•8 Gb RAM
•1 front-end
Our cluster
•NFS
•Rocks cluster (CentOS)
•SGE
First steps with SGE
First steps with SGE
•1st try:
•One queue to rule them all
First steps with SGE
•1st try:
•all.q queue
•free for all
First steps with SGE
•1st try - conclusions:
•chaos reigned
•constant conflicts between users (specially time related)
•FIFO queuing
•swapping
2nd try
•2nd try
•round-robin-like scheduling
•share tree/functional tickets
•split cluster by time usage:
•3 queues: fast / medium / slow
2nd try
•fast:
•2 hours / 2 nodes
•medium:
•48 hours / 3 nodes
•slow:
•∞ hours / 3 nodes
2nd try
•Conclusions:
•↓ chaos
•↓ user conflicts
•Still swapping
•High undersubscription of the cluster
2nd try
•3 types of jobs
•Don’t need to coexist at the same time
•1 user → 1 type of job
•User knowledge
•Saturation of the unlimited queue
2nd try•Queue tinkering:
•wallclock time
•number of hosts
•Better results, but not good enough:
•Waiting jobs & idle nodes
2nd try
•There are 2 wars here:
•memory / swap
•splitting leads to undersubscription
The memory war
Memory
•Buy more memory
•from 8x8Gb
•to 4x 32Gb, 3x 16Gb, 1x 8Gb
•This reduces our problem, but doesn’t fix it
Swap
•Swapping in a cluster is the root of all evil
Swap
•Complex attribute “h_vmem”
h_core
h_rt ≠ h_cpu
h_fsize
h_rss
h_stack
h_data = h_vmem
h_vmem
•h_vmem
•SIGKILL
•s_vmem
•SIGXCPU
•You can combine both
h_vmem
•Requestable by default
•We want them to be consumable
•qmon / qconf -mc
h_vmem
h_vmem
•requestable = YES
•consumable = YES / JOB
•default = whatever you want
h_vmem
•Only for parallel environment jobs:
•consumable = YES
•sge_shepherd memory = h_vmem*slots
•consumble = JOB
•sge_shepherd memory = h_vmem
h_vmem
•default = 100M
•“everything” dies
•default = 6G
•“everything” works
h_vmem
•Now we can limit the memory
•But we can still have swapping
h_vmem
•Define h_vmem in each host
•qmon / qconf -me hostname
h_vmem
•Exact memory:
•more secure
•Bigger memory:
•more margin
Memory
•From now on, any job submission must contain a memory request:
•qsub ... -l h_vmem=3G ...
No more swapping!!
Undersubscription
Undersubscription
•Dual restriction:
•8 jobs/slots per node
•32 / 16 / 8 GB mem per node
•The minimum of both will apply
32 Gb node
8 Gb node
32 Gb node
8 Gb node
8Gb8Gb
1Gb1Gb 1Gb1Gb
1Gb1Gb 1Gb1Gb
1Gb1Gb
1Gb1Gb 1Gb1Gb
1Gb1Gb
7 slots free0 Gb free
0 slots free24 Gb free
Stupid scheduling
32 Gb node
8 Gb node
8Gb8Gb
1Gb1Gb 1Gb1Gb
1Gb1Gb 1Gb1Gb
1Gb1Gb
1Gb1Gb 1Gb1Gb
1Gb1Gb
0 slots free0 Gb free
7 slots free24 Gb free
Smart scheduling
Smart scheduling
•We want each job to go to the node where it better fits.
(another) DISCLAIMERThis is strictly for our case and needs. It may
appeal to you, or some ideas can inspire you, but it is not intended to be a step-by-
step solution for everyone.It is just an example of “things that can be
done”.
Smart scheduling
•Create 3 hostgroups:
•@32G, @16G and @8G
•Group nodes by memory
Smart scheduling
•Maximize the ratio memory/core:
•job <1Gb → 8Gb nodes
•1Gb < job < 2Gb → 16Gb nodes
•2Gb < job → 32Gb nodes
Smart scheduling
•3 different queues:
•all-32
•all-16
•all-8
•assign the corresponding hostgroup
Smart scheduling
•Same problem as before:
•Oversubscription of one queue
•Undersubscription of other queues
Sequence Numbers
Smart scheduling
•Preference for a given hostgroup
Smart scheduling
•all-32:
•@32G > @16G > @8G
•all-16:
•@16G > @32G > @8G
•all-8:
•@8G > @16G > @32G
Smart scheduling
•qmon → queue configuration → general configuration → Sequence Nr
•qconf -mq queuename
Smart scheduling@32GSeq Nr=0
@16GSeq Nr=1
@8GSeq Nr=2
all-32 queue
32 Gb queue
16 Gb queue
8Gb queue
Waiting queue
?✗✓
?✗✓
?✗✓
Are we done?
Qsub wrapper
•Users already choose the memory
•Why ask for a queue?
•We can let the system do it
Qsub wrapper
•Wrapper script around qsubparse parameters searching for queue or memory requests
if ( no memory ) { memory = default }
if ( no queues ) {
if (memory < 1Gb) { queue = all-8 }
if (1Gb < memory < 2Gb) { queue = all-16 }
if (2Gb < memory ) { queue = all-32 }
}
qsub -q $queue parameters
Qsub wrapper
•You can add whatever you need
Qsub wrapper
•“home-made” parameters
•--slow / --fast
•allow access to 2 kind of special nodes
•instead of
•-q all-16@compute-1-*
Qsub wrapper
•One queue to rule them all
•but...
•No swap!!!
•No undersubscription!!!
Now the icing
Punishment
•System relies in “good behaviour”
•Teach users how to use it
•Prevent & “punish” bad usage
Punishment
•epilog script
•runs when the job finishes
•global: qconf -mconf
•or by queue
•/opt/gridengine/default/common/sge_epilog.sh
Punishment
•Check memory
•requested
•maxvmem
•log
•send an email
Punishment•no memory
•teaches how to request it properly
•too much memory
•tells and advises.
•reasonable memory
•no email
Punishment
•epilog writes a logfile
•cron process “punishes” or “rewards” users according to last day memory usage
Punishment
•Modify user’s shared ticket policy
•For each “bad” job:
•-10 tickets
•For each “good” job:
•+5 tickets
Punishment
•“bad users”
•delayed scheduling
•“good users”
•more priority
Control other resources
Shared disk
•NFS shared disk
•avoid filling it
•suspend all jobs before its too late
Shared disk
•New complex attribute: scratch_pct
•type = INT
•operation >=
•requestable = NO
•consumable = NO
•default = 0
Shared disk
•Load Report
• /opt/gridengine/default/common/sge_load_report.sh
Shared diskinfinite loop {
scratch=`df| grep scratch| awk '{print $4}' | grep % | sed 's/%//g'`
echo begin
echo "$myhost:scratch_pct:$scratch"
echo end
}
Shared disk
Shared disk
•whenever the disk gets to 97%
•all jobs freeze
Conclusions
•Combining SGE options give access to much more powerful configurations
Questions?Special thanks:
•Angel Carreño
•Carles Perarnau
•Marc Esteve
•Jordi Rambla
•Arcadi Navarro