Queueing Systems Configuration vs. Resource Management

POZNAÑ SUPERCOMPUTING AND NETWORKING

Queueing Systems Configuration vs. Resource

Management

Mirosław Kupczyk

[email protected]


Agenda

• LSF

• NQE


LSF - overview• Manage Networked Resources

• Run Jobs

• Manage Applications

• Control Access to System Resources

• Resource and Job Accounting

• Fault Tolerance

• Support for Heterogeneous Systems

• Checkpointing and Migration

• Parallel Processing


LSF - overview• Scheduling Policies

• Resource Reservation

• Job Accounting

• Job Arrays

• Interactive Batch Jobs

• Scalability

• Shared Resources

• Parallel Job Processor Reservation

• Job Starters, Pre- and Post-Execution Scripts

• Configurable Job Control Actions

• LSF SNMP Agent

• Command Interpreter


LSF Suite Products • LSF Batch

• LSF JobScheduler

• LSF Analyzer

• LSF Parallel

• LSF MultiCluster • LSF Make


LSF Architecture


Structure of LSF Batch


ConceptsJobs:• job ID • job name • task or interactive task • interactive batch job • job report • job output • job errors • place a job • dispatch a job

• job states:

• PEND--Waiting for schedule

• RUN--Running

• DONE--Finished with zero exit value

• EXITED--Finished with non-zero exit value

• PSUSP--Pending Suspended

• SSUSP--Suspended by LSF

• USUSP--Suspended by user

• POST_DONE--Post-processing completed without errors

• POST_ERR--Post-processing completed with errors


Job states


Pending Jobs A job remains pending until all conditions for its execution are met. Some of

the conditions are:

• Start time specified by the user when the job is submitted • Load conditions on qualified hosts • Dispatch windows during which the queue can dispatch and qualified

hosts can accept• jobs • Run windows during which jobs from the queue can run • Limits on the number of job slots configured for a queue, a host, or a

user • Relative priority to other users and jobs • Availability of the specified resources • Job dependency and pre-execution conditions


Abnormal Termination of Jobs • An abnormally terminated job goes into EXIT state. The situations where a job

• terminates abnormally include:

• The job is cancelled by its owner or the LSF administrator while pending, or after being

• dispatched to a host.

• The job is not able to be dispatched before it reaches its termination deadline, and thus

• is aborted by LSF.

• The job fails to start successfully. For example, the wrong executable is specified by the

• user when the job is submitted.

• The job exits with a non-zero exit status.


Concepts (contd.)

• Hosts

• host types

• local and remote hosts

• master host or LSF master

• submission and execution hosts

• server host or LSF server

• client host or LSF client


Concept: Queues

• Queues represent a set of pending jobs, lined up in a defined order and waiting for their opportunity to use LSF resources.

• Queues implement different job scheduling and control policies.

• All jobs submitted to the same queue share the same scheduling and control policy.

• Queues do not correspond to individual hosts; each queue can use all server hosts in the cluster, or a configured subset of the server hosts.


Queue: exampleBegin QueueQUEUE_NAME = normalPRIORITY = 30NICE = 20STACKLIMIT= 2048DESCRIPTION = For normal low priority jobs, running only if hosts are

lightly loaded.QJOB_LIMIT = 60 # job limit of the queuePJOB_LIMIT = 2 # job limit per processor ut = 0.2 io = 50/240CPULIMIT = 180/hostA # 3 hours of hostAUSERS = allHOSTS = all End Queue


Queue configurationADMINISTRATORS

BACKFILL

CHKPNT

CHUNK_JOB_SIZE

CORELIMIT

CPULIMIT

DATALIMIT

DEFAULT_HOST_SPEC

DISPATCH_WINDOW

EXCL_RMTJOB

EXCLUSIVE

FAIRSHARE

FILELIMIT

HJOB_LIMIT

HOSTS

IGNORE_DEADLINE

IMPT_JOBBKLG

INTERACTIVE JOB_ACCEPT_INTERVAL JOB_CONTROLS JOB_STARTER load_index MAX_RSCHED_TIME MC_FAST_SCHEDULE MEMLIMIT MIG NEW_JOB_SCHED_DELAY NICE NQS_QUEUES PJOB_LIMIT POST_EXEC PRE_EXEC PREEMPTION PRIORITY PROCESSLIMIT PROCLIMIT

QJOB_LIMIT QUEUE_NAME RCVJOBS_FROM REQUEUE_EXIT_VALUES RERUNNABLE RES_REQ RESUME_COND RUN_WINDOW RUNLIMIT SLOT_RESERVE SNDJOBS_TO STACKLIMIT STOP_COND SWAPLIMIT TERMINATE_WHEN UJOB_LIMIT USERS


Clusters• Load sharing in LSF is based on clusters. A cluster is a collection of hosts

running LSF. Hosts are configured centrally and managed from any machine in the LSF cluster.

• A cluster can contain a mixture of host types. By putting all hosts types into a single cluster, you can have easy access to the resources available on all host types.

• Clusters are normally set up based on administrative boundaries. LSF clusters work best when each user has an account on all hosts in the cluster, and user files are shared among the hosts so that they can be accessed from any host. This way LSF can send a job to any host. You need not worry about whether the job will be able to access the correct files.

• LSF can also run batch jobs when files are not shared among the hosts. LSF includes facilities to copy files to and from the host where the batch job is run, so your data will always be in the right place.


Clusters contd.A cluster is a group of hosts that provide shared computing resources. Hosts can be grouped into clusters in a number of ways. A cluster could contain:

• All the hosts in a single administrative group

• All the hosts on one file server or sub-network

• Hosts that perform similar functions

If you have hosts of more than one type, it is often convenient to group them together in the same cluster. LSF allows you to use these hosts transparently, so applications that run on only one host type are available to the entire cluster.


first-come, first-served (FCFS) - The default type of scheduling in LSF. Jobs are

considered for dispatch based on their order in the queue (FCFS).

- job slot: A job slot is a bucket into which a single unit of work is assigned in the LSF system. Hosts are configured to have a number of job slots available and queues dispatch jobs to fill job slots.


LSF Daemons LIM (Load Information Manager) on each LSF server, monitors its host's load,

and forwards load information to the master LIMs. LIM collects 11 built-in load indices.

Master LIM is elected to store load data collected by LIMs running on hosts in the LSF cluster.

On one host in the cluster, the LIM acts as the master. The master LIM runs on the master host and forwards load information to MBD. The master LIM collects information for all hosts and provides that information to the applications. The master LIM is chosen among all the LIMs running in the cluster. If the master LIM becomes unavailable, a LIM on another host will automatically take over the role of master.

External LIMs are site-definable to collect up to 256 different resources.

RES (Remote Execution Server) runs on each LSF server and accepts remote execution requests and provides fast, transparent and secure remote execution of interactive jobs. RES executes jobs and tasks in the background as the job owner. RES is similar to rshd (Remote Shell Daemon).


LSF Daemons contd.SBD (Slave Batch Daemon) runs on each LSF server, receives job requests from

MBD, and starts the jobs using RES. SBD is responsible for enforcing local LSF policies and maintaining the state of jobs on the machine.

MBD (Master Batch Daemon) receives job requests from LSF clients and servers and applies scheduling policies to dispatch the jobs to LSF servers in the cluster. MBD is responsible for the overall state of all jobs in the batch system. MBD keeps a file of all transactions performed on jobs throughout their lifecycle. MBD manages queues and schedules jobs on all hosts in the LSF cluster. Each cluster has one MBD, which runs on the master host.

PIM (Process Information Manager) runs on each LSF server, and is responsible for monitoring all jobs and monitoring every process created for all jobs running on the server. PIM periodically walks the process tree, and accumulates memory and CPU use data which is reported to SBD. PIM provides run time resource use for all LSF jobs.


How LSF works 1. Receive the job. Create a job file. Return the job

ID to the user.

2. During the next dispatch turn, consider the job for dispatch.

3. Place the job on the best available host.

4. Set the environment on the host.

5. Start the job.


Job Submission The job must be submitted to a queue.

How Automatic Queue Selection Works:

The criteria LSF uses for selecting a suitable queue are as follows:

– User access restriction. Queues that do not allow this user to submit jobs are not considered.

– Host restriction. If the job explicitly specifies a list of hosts on which the job can be run, then the selected queue must be configured to send jobs to all hosts in the list.

– Queue status. Closed queues are not considered.

– Exclusive execution restriction. If the job requires exclusive execution, then queues that are not configured to accept exclusive jobs are not considered.

– Job's requested resources. These must be within the resource limits of the selected queue.

If multiple queues satisfy the above requirements, then the first queue listed in the candidate queues (as defined by DEFAULT_QUEUE or LSB_DEFAULTQUEUE) that satisfies the requirements is selected.


Host SelectionA number of conditions determine whether a host is eligible:• Host dispatch windows • Resource requirements of the job • Resource requirements of the queue • Host list of the queue • Host load levels • Job slot limits of the host.


Job Dispatch When a job is submitted to LSF, many factors control when and where the job starts to run:

• Active time window of the queue or hosts • Resource requirements of the job • Availability of eligible hosts • Various job slot limits • Job dependency conditions • Fairshare constraints • Load conditions


Fairshare Scheduling• Fairshare scheduling divides the processing power of the LSF cluster among

users and groups to provide fair access to resources.

• By default, LSF considers jobs for dispatch in the same order as they appear in the queue.

• If your cluster has many users competing for limited resources, the FCFS policy might not be enough. For example, one user could submit many long jobs at once and monopolize the cluster's resources for a long time, while other users submit urgent jobs that must wait in queues until all the first user's jobs are all done. To prevent this, use fairshare scheduling to control how resources should be shared by competing users.

• Fairshare is not necessarily equal share: you can assign a higher priority to the most important users. If there are two users competing for resources, you can:

- Give all the resources to the most important user

- Share the resources so the most important user gets the most resources

- Share the resources so that all users have equal importance


Global Fairshare

Global fairshare balances resource usage across the entire cluster according to one single fairshare policy. Resources used in one queue affect job dispatch order in another queue.

To configure global fairshare, you must use host partition fairshare. Use the keyword all to configure a single partition that includes all the hosts in the cluster.

Example Begin HostPartition HPART_NAME =GlobalPartition HOSTS = all USER_SHARES = [groupA@, 3] [groupB, 7] [default, 1] End HostPartition


Chargeback Fairshare Chargeback fairshare lets competing users share the same hardware resources according to a fixed ratio. Each user is entitled to a specified portion of the available resources.

Example

Suppose two departments contributed to the purchase of a large system. The engineering department contributed 70 percent of the cost, and the accounting department 30 percent. Each department wants to get their money's worth from the system.

1.Define 2 user groups in lsb.users, one listing all the engineers, and one listing

all the accountants.

Begin UserGroup Group_Name Group_Member eng_users (user6 user4) acct_users (user2 user5) End UserGroup2.Configure a host partition for the host, and assign the shares appropriately.

Begin HostPartition HPART_NAME = big_servers HOSTS = hostH USER_SHARES = [eng_users, 7] [acct_users, 3] End HostPartition


Priority User Fairshare

Priority user fairshare gives priority to important users, so their jobs override the jobs of other users.

Example

A queue is shared by key users and other users. As long as there are jobs from key users waiting for resources, other users' jobs will not be dispatched.

1.Define a user group called key_users in lsb.users.

2.Configure fairshare and assign the overwhelming majority of shares to the critical users:

Begin QueueQUEUE_NAME = production FAIRSHARE = USER_SHARES[[key_users@, 2000] [others, 1]] ...End Queue


ResourcesBoolean resources are custom resources that describe features that may not be available or identical on all machines in a cluster. For example:

• Machines may have different types and versions of operating systems.

• Machines may play different roles in the system, such as file server or compute

• server.

• Some machines may have special-purpose devices needed by some applications.

• Certain software packages or licenses may be available only on some of the

• machines.

Shared resource is a custom resource that is not tied to a specific host, but is associated with the entire cluster, or a specific subset of hosts within the cluster.

Examples of shared resources include:

• Floating licenses for software packages

• Disk space on a file server which is mounted by several machines

• The physical network connecting the hosts


Resource Use

Jobs submitted through the LSF system will have the resources they use monitored while they are running. This information is used to enforce resource limits and load thresholds as well as fairshare scheduling.

LSF collects information such as:

•Total CPU time consumed by all processes in the job

•Total resident memory usage in kB of all currently running processes in a job

•Total virtual memory usage in kilobytes of all currently running processes in a job

•Currently active process group ID in a job

•Currently active processes in a job


Load Indices Collected By LIM Load indices measure the availability of dynamic, non-shared resources on hosts in the LSF cluster. Load indices are numeric in value. Load indices built into the LIM are updated at fixed time intervals.

Viewing Info About Load Indices % lsload HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem hostN ok 0.0 0.0 0.1 1% 0.0 1 224 43M 67M 3M hostK -ok 0.0 0.0 0.0 3% 0.0 3 0 38M 40M 7M hostG busy *6.2 6.9 9.5 85% 1.1 30 0 5M 400M 385M hostF busy 0.1 0.1 0.3 7% *17 6 0 9M 23M 28M hostV unavail


Checkpointing JobsCheckpointing a job involves capturing the state of an executing job, the data necessary to restart the job, and not wasting the work done to get to the current stage. The job state information is saved in a checkpoint file. There are many reasons why you would want to checkpoint a job.

Fault Tolerance

Migration

Load Balancing


Types of Checkpointing

Kernel-Level Checkpointing Kernel-level checkpointing is provided by the operating system and can be applied to arbitrary jobs running on the system. This approach is transparent to the application, there are no source code changes and no need to re-link your application with checkpoint libraries.

User-Level Checkpointing LSF provides a method to checkpoint jobs on systems that do not support kernel-level checkpointing called user-level checkpointing. To implement user-level checkpointing, you must have access to your applications object files (.o files), and they must be re-linked with a set of libraries provided by LSF. This approach is transparent to your application, its code does not have to be changed and the application does not know that a checkpoint and restart has occurred.

Application-Level CheckpointingThe application-level approach applies to those applications which are specially written to accommodate the checkpoint and restart.

The application checkpoints itself either periodically or in response to signals sent by other processes. When restarted, the application itself must look for the checkpoint files and restore its state.


MultiCluster

Resource sharing among separately managed sites Multiple departments/divisions in a large corporation Computing center supporting many sites Multiple cooperating organizations

Resource sharing among loosely connected sites Over long distance or slow links Across WAN with time differences


MultiCluster : Key Requirements

Autonomy

Reliability

Non-shared user accounts and file systems


MultiCluster : Inter-Cluster Batch Job Flow

MBD MBDjobs

status

users

inter-clusterpolicy

inter-clusterpolicy

agreement

MasterLIM

MasterLIM

conf, load info

conf, load info


MultiCluster : Job Submission and Monitoring


Configuration : MultiCluster

lsf.sharedShared or replicated

across clusters

lsf.cluster lsf.cluster

contact hosts

other hosts

contact hosts

other hosts



Features in MultiCluster:– Monitoring of load and host information of

remote clusters– Accessing control of inter-cluster interactive

tasks– Executing batch jobs transparently in remote

clusters– Account mapping between clusters



FEATURE lsf_multicluster lsf_ld 3.200 1-jan-0000 800 BC53D59BDA04DE12166A "Platform”FEATURE lsf_multicluster lsf_ld 3.200 1-jan-0000 800 BC53D59BDA04DE12166A "Platform”

Enabling MultiCluster feature (step 1):

In license.dat files in local/remote clusters:Needs a FEATURE line to enable LSF MultiCluster feature in local and remote clusters.



Begin ParametersPRODUCTS= LSF_Base LSF_Batch … LSF_MultiClusterEnd Parameters

Begin ParametersPRODUCTS= LSF_Base LSF_Batch … LSF_MultiClusterEnd Parameters

Enabling MultiCluster feature (step 2):In lsf.cluster.cluster-name files:

Add LSF_MultiCluster keyword in the PRODUCTS line of the Parameters section.

If the local cluster is only interested in certain remote cluster specified in the lsf.shared file, you can use the RemoteClusters section to limit which remote clusters the local cluster is interested in.

Begin RemoteClustersCLUSTERNAME EQUIV CACHE_INTERVAL RECV_FROMcluster2 Y 30 NEnd RemoteCluster

Begin RemoteClustersCLUSTERNAME EQUIV CACHE_INTERVAL RECV_FROMcluster2 Y 30 NEnd RemoteCluster

In lsf.cluster.cluster1

Begin RemoteClustersCLUSTERNAME EQUIV CACHE_INTERVAL RECV_FROMcluster1 N 45 YEnd RemoteCluster

Begin RemoteClustersCLUSTERNAME EQUIV CACHE_INTERVAL RECV_FROMcluster1 N 45 YEnd RemoteCluster

In lsf.cluster.cluster2


Configuration : MultiClusterEnabling MultiCluster feature (step 3):In lsf.shared files (should be shared or replicated):

Configure LSF Base to distribute interactive tasks across clusters.Should list the names of all clusters. The lim will read the lsf.shared file in LSF_CONFDIR for each remote cluster and save the first 10 host names listed in the Host section (One of them must be the master).

Begin ClusterClusterName # keywordcluster1cluster2End Cluster

Begin ClusterClusterName # keywordcluster1cluster2End Cluster

Begin ClusterClusterName Servers # keywordcluster1 (hostA hostB hostC)cluster2 (hostD hostE hostF hostG hostH)End Cluster

Begin ClusterClusterName Servers # keywordcluster1 (hostA hostB hostC)cluster2 (hostD hostE hostF hostG hostH)End Cluster

If lsf.shared file is not shared or replicated, then it is necessary to specifya list of valid server hosts in each cluster using the option Servers in theCluster section.


Configuration : MultiClusterEnabling MultiCluster feature (step 4):

Begin QueueQUEUE_NAME=normalPRIORITY=30NICE=20SNDJOBS_TO=queue2@cluster2 queue3@cluster3 … queueN@clusterNRCVJOBS_FROM=cluster2 cluster3 … clusterNEnd Queue

Begin QueueQUEUE_NAME=normalPRIORITY=30NICE=20SNDJOBS_TO=queue2@cluster2 queue3@cluster3 … queueN@clusterNRCVJOBS_FROM=cluster2 cluster3 … clusterNEnd Queue

In lsb.queues file:Configure LSF Batch to specify the queues sharing jobs.



Begin QueueQUEUE_NAME=normalPRIORITY=34SNDJOBS_TO=normal@cluster2RCVJOBS_FROM=cluster2RES_REQ=r1m<0.9HOSTS=hostA hostBDESCRIPTION=Multicluster queueEnd Queue

Begin QueueQUEUE_NAME=normalPRIORITY=34SNDJOBS_TO=normal@cluster2RCVJOBS_FROM=cluster2RES_REQ=r1m<0.9HOSTS=hostA hostBDESCRIPTION=Multicluster queueEnd Queue

lsb.queues file in cluster2

Begin QueueQUEUE_NAME=normalPRIORITY=20SNDJOBS_TO=normal@clusterRCVJOBS_FROM=cluster1RES_REQ=r1m<0.9HOSTS=hostC hostD hostEDESCRIPTION=Multicluster queueEnd Queue

Begin QueueQUEUE_NAME=normalPRIORITY=20SNDJOBS_TO=normal@clusterRCVJOBS_FROM=cluster1RES_REQ=r1m<0.9HOSTS=hostC hostD hostEDESCRIPTION=Multicluster queueEnd Queue

lsb.queues file in cluster1

Enabling MultiCluster feature (step 4):

Inter-cluster job flow


Configuration : MultiClusterEnabling MultiCluster feature (step 5):User level account mapping (~username/.lsfhosts) :

Individual users of the LSF cluster can set up their own account mapping by setting up a .lsfhosts file in their home directories.

System level account mapping (lsb.users) :LSF administrator can set up system level account mapping in UserMap section. For example, userA in cluster1 to map to user_A in cluster2.

cluster2 userBcluster2 userB

~userA/.lsfhostson hosts in cluster1

cluster1 userAcluster1 userA

~userB/.lsfhostson hosts in cluster2

Begin UserMapLOCAL REMOTE DIRECTIONuserA userB@cluster2 exportuserC (userD@cluster2 userE@cluster2) export…End UserMap

lsb.users in cluster1

Begin UserMapLOCAL REMOTE DIRECTIONuserB userA@cluster1 import(userD userE) userC@cluster1 import…End UserMap

lsb.users in cluster2


NQSLocal Submission:


Queue ComplexesA queue complex is a set of local batch queues. Each complex has a set of associated

attributes, which provide for control of the total number of concurrently running

requests in member queues. This, in turn, provides a level of control between queue

limits and global limits. The following queue complex limits can be set:

Group limits

Memory limits

Run limits

User limits

MPP processing element (PE) limits (CRAY T3D systems), or MPP application

processing elements (CRAY T3E systems, or number of processors (IRIX

systems)

To create a queue complex (a set of batch queues), use the following qmgr command:

create complex = (queuename(s)) complexname

To add or remove queues in an existing complex, use the following qmgr commands:

add queues = (queuename(s)) complexname

remove queues = (queuename(s)) complexname


QmgrIMPLEMENTATION

• All Cray Research systems

• DEC AXP systems

• HP 9000 systems

• IBM RISC system/6000 systems

• SGI systems

• SPARC systems

The qmgr command provides entry to the queue manager subsystem, which

allows authorized administrators to control requests, queues, and

daemons associated with the Network Queuing System (NQS).

Qmgr> ad[d] des[tinations] = (new_des [, new_des ...]) pipe_queue [position]position = first | before old_des | after old_des | last

Adds valid destinations for pipe_queue at a specific position in the existing set.


NLBThe Network Load Balancer (NLB) provides status and control of work scheduling within the group of components in the NQE cluster. Sites can use the NLB to provide policy-based scheduling of work in the cluster. NLB collectors periodically collect data about the current workload on the machine where they run. The data from the collectors is sent to one or more NLB servers, which store the data and make it accessible to the NQE GUI Status and Load functions. The NQE GUI Status and Load functions display the status of all requests which are in the NQE cluster and machine load data.


Request Processing for Client Submission to NQS


Request Processing for Remote Submission

Queueing Systems Configuration vs. Resource Management

Documents

job job states

job id job

job limit

errors job states

different job scheduling

number of job slots

lsf resources

lsf ususp