Top Banner
Condor Project Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor Condor Administration Paradyn-Condor Week UW Campus March 2002
105

Condor Administration Paradyn-Condor Week UW Campus March 2002

Dec 30, 2015

Download

Documents

darius-farrell

Condor Administration Paradyn-Condor Week UW Campus March 2002. Outline. Other sources of Information User Priorities Policy Expressions Life-cycle of a job – submit to complete Daemons – what they do and require Startd states and activities Useful admin commands - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Condor Administration  Paradyn-Condor Week UW Campus March 2002

Condor ProjectComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

Condor Administration

Paradyn-Condor WeekUW CampusMarch 2002

Page 2: Condor Administration  Paradyn-Condor Week UW Campus March 2002

2http://www.cs.wisc.edu/condor

Outline› Other sources of Information› User Priorities› Policy Expressions› Life-cycle of a job – submit to complete

Daemons – what they do and require Startd states and activities

› Useful admin commands› Authorization and Authentication

General Security Comments/Worries

Page 3: Condor Administration  Paradyn-Condor Week UW Campus March 2002

3http://www.cs.wisc.edu/condor

Outline, cont.

› Installation Layout

› Contrib Modules

› Walk-thru of UW-Madison’s condor_config files

Page 4: Condor Administration  Paradyn-Condor Week UW Campus March 2002

4http://www.cs.wisc.edu/condor

Other Sources

› Condor Manual

› Condor Web Site

› “How to Build a Beowulf Cluster on Linux” by Thomas Sterling, MIT Press, published in 2001

› Email to [email protected]

Page 5: Condor Administration  Paradyn-Condor Week UW Campus March 2002

5http://www.cs.wisc.edu/condor

User Priorities

› Command condor_userprio

› How it all works

› About nice_user

› Config file Settings: Priority_Halflife, Default_Prio_Factor,

Nice_User_Prio_Factor, Remote_Prio_Factor, Account_local_Domain

Page 6: Condor Administration  Paradyn-Condor Week UW Campus March 2002

6http://www.cs.wisc.edu/condor

Introduction to Condor’s Configuration

Files› Condor’s configuration is a concatenation

of multiple files, in order - definitions in later files overwrites previous definitions

› Layout and purpose of the different files: Global config file Other shared files Local config file Root config file (optional)

Page 7: Condor Administration  Paradyn-Condor Week UW Campus March 2002

7http://www.cs.wisc.edu/condor

Global Config File

› All shared settings across your entire pool

› Found either in file pointed to with the CONDOR_CONFIG environment variable, /etc/condor/condor_config, or the home directory of the “condor” user

› Most settings can be in this file› Only works as a “global” file if it is on a

shared file system

Page 8: Condor Administration  Paradyn-Condor Week UW Campus March 2002

8http://www.cs.wisc.edu/condor

Other shared files

› You can configure a number of other shared config files: files to hold common settings to make

it easier to maintain (for example, all policy expressions, which we’ll see later)

platform-specific config files

Page 9: Condor Administration  Paradyn-Condor Week UW Campus March 2002

9http://www.cs.wisc.edu/condor

Local config file

› Any machine-specific settings local policy settings for a given owner different daemons to run (for

example, on the Central Manager!)

› Can either be on the local disk of each machine, or have separate files in a shared directory, each named by hostname

Page 10: Condor Administration  Paradyn-Condor Week UW Campus March 2002

10http://www.cs.wisc.edu/condor

Root config file (optional)

› You can specify a “root” config file, which is always processed after all other files

› This allows root to specify certain settings which cannot be changed by another user (like the path to the Condor daemons)

› Only useful if daemons are started as root but someone else has access to edit Condor’s config files

Page 11: Condor Administration  Paradyn-Condor Week UW Campus March 2002

11http://www.cs.wisc.edu/condor

Basic syntax

› # is a comment

› A “\” at the end of a line is a line-continuation, so both lines are treated as one big entry

› All names are case insensitive

› “Macros” have the form: Attribute_Name = value

› You reference other macros with: A = $(B)

Page 12: Condor Administration  Paradyn-Condor Week UW Campus March 2002

12http://www.cs.wisc.edu/condor

Policy Expressions

Back to Frieda

Page 13: Condor Administration  Paradyn-Condor Week UW Campus March 2002

13http://www.cs.wisc.edu/condor

I am adding nodes to the Cluster… but

the Engineering Department has priority on these

nodes.

(Boss Fat Cat)

Policy Configuration

Page 14: Condor Administration  Paradyn-Condor Week UW Campus March 2002

14http://www.cs.wisc.edu/condor

The Machine (Startd) Policy Expressions

START – When is this machine willing to start a job

RANK - Job PreferencesSUSPEND - When to suspend a jobCONTINUE - When to continue a suspended

jobPREEMPT – When to nicely stop running a jobKILL - When to immediately kill a

preempting job

Page 15: Condor Administration  Paradyn-Condor Week UW Campus March 2002

15http://www.cs.wisc.edu/condor

Freida’s Current Settings

START = TrueRANK =SUSPEND = FalseCONTINUE =PREEMPT = FalseKILL = False

Page 16: Condor Administration  Paradyn-Condor Week UW Campus March 2002

16http://www.cs.wisc.edu/condor

Freida’s New Settings for the Chemistry

nodesSTART = True

RANK = Department == “Chemistry”

SUSPEND = FalseCONTINUE =PREEMPT = FalseKILL = False

Page 17: Condor Administration  Paradyn-Condor Week UW Campus March 2002

17http://www.cs.wisc.edu/condor

Submit file with Custom Attribute

Executable = charm-runUniverse = standard+Department = Chemistryqueue

Page 18: Condor Administration  Paradyn-Condor Week UW Campus March 2002

18http://www.cs.wisc.edu/condor

What if “Department” not specified?

START = TrueRANK = Department =!= UNDEFINED

&& Department == “Chemistry”SUSPEND = FalseCONTINUE =PREEMPT = FalseKILL = False

Page 19: Condor Administration  Paradyn-Condor Week UW Campus March 2002

19http://www.cs.wisc.edu/condor

Another example

START = TrueRANK = Department =!= UNDEFINED

&& ((Department == “Chemistry”)*2 + Department == “Physics”)

SUSPEND = FalseCONTINUE =PREEMPT = FalseKILL = False

Page 20: Condor Administration  Paradyn-Condor Week UW Campus March 2002

20http://www.cs.wisc.edu/condor

The Cluster is fine. But not the

desktop machines. Condor can only use the desktops when they would otherwise be idle.

(Boss Fat Cat)

Policy Configuration, cont

Page 21: Condor Administration  Paradyn-Condor Week UW Campus March 2002

21http://www.cs.wisc.edu/condor

So Frieda decides she wants the desktops to:

› START jobs when their has been no activity on the keyboard/mouse for 5 minutes and the load average is low

› SUSPEND jobs as soon as activity is detected

› PREEMPT jobs if the activity continues for 5 minutes or more

› KILL jobs if they take more than 5 minutes to preempt

Page 22: Condor Administration  Paradyn-Condor Week UW Campus March 2002

22http://www.cs.wisc.edu/condor

Macros in the Config FileNonCondorLoadAvg = (LoadAvg - CondorLoadAvg)

BackgroundLoad = 0.3HighLoad = 0.5KeyboardBusy = (KeyboardIdle < 10)CPU_Busy = ($(NonCondorLoadAvg) >= $

(HighLoad))MachineBusy = ($(CPU_Busy) || $(KeyboardBusy))ActivityTimer = (CurrentTime -

EnteredCurrentActivity)

Page 23: Condor Administration  Paradyn-Condor Week UW Campus March 2002

23http://www.cs.wisc.edu/condor

Desktop Machine Policy

START = $(CPU_Idle) && KeyboardIdle > 300SUSPEND = $(MachineBusy)CONTINUE = $(CPU_Idle) && KeyboardIdle >

120PREEMPT = (Activity == "Suspended") &&

$(ActivityTimer) > 300KILL = $(ActivityTimer) > 300

Page 24: Condor Administration  Paradyn-Condor Week UW Campus March 2002

24http://www.cs.wisc.edu/condor

Policy Review› Users submitting jobs can specify

Requirements and Rank expressions› Administrators can specify Startd Policy

expressions individually for each machine (Start,Suspend,etc)

› Expressions can use any job or machine ClassAd attribute

› Custom attributes easily added› Bottom Line: Enforce almost any policy!

Page 25: Condor Administration  Paradyn-Condor Week UW Campus March 2002

25http://www.cs.wisc.edu/condor

Additional Policy Parameters

› WANT_SUSPEND

› WANT_VACATE

Page 26: Condor Administration  Paradyn-Condor Week UW Campus March 2002

26http://www.cs.wisc.edu/condor

True

True

Road Map of the Policy

Expressions

STARTSTARTSTARTSTART

WANT SUSPENDWANT SUSPENDWANT SUSPENDWANT SUSPEND

SUSPENDSUSPENDSUSPENDSUSPEND

VacatingVacatingVacatingVacating

PREEMPTPREEMPTPREEMPTPREEMPT

KILLKILLKILLKILL

True

True

True

True

False

WANT VACATEWANT VACATEWANT VACATEWANT VACATE

KillingKillingKillingKilling

False

= Expression

= Activity

Page 27: Condor Administration  Paradyn-Condor Week UW Campus March 2002

27http://www.cs.wisc.edu/condor

Negotiator Policy Expressions

› PREEMPTION_REQUIREMENTS

› PREEMPTION_RANKExamples:

PREEMPTION_REQUIREMENTS = $(StateTimer) > (1 * $(HOUR)) && RemoteUserPrio > SubmittorPrio * 1.2

PREEMPTION_RANK = (RemoteUserPrio * 1000000) - ImageSize

Page 28: Condor Administration  Paradyn-Condor Week UW Campus March 2002

28http://www.cs.wisc.edu/condor

The Condor Daemons› condor_master (controls everything else)› condor_startd (executing jobs)

condor_starter (helper for starting jobs)

› condor_schedd (submitting jobs) condor_shadow (submit-side helper)

› condor_collector (only on Central Manager)› condor_negotiator (only on CM)› You only have to run the daemon(s) for the

service(s) you want to provide

Page 29: Condor Administration  Paradyn-Condor Week UW Campus March 2002

29http://www.cs.wisc.edu/condor

condor_master› Starts up all other Condor daemons

› If there are any problems and a daemon exists, it restarts the daemon and sends email to the administrator

› Checks the time stamps on the binaries it is configured to spawn, and if new binaries appear, the master will gracefully shutdown the currently running version and start the new version

Page 30: Condor Administration  Paradyn-Condor Week UW Campus March 2002

30http://www.cs.wisc.edu/condor

condor_master (cont’d)› Provides access to many remote

administration commands: condor_reconfig condor_restart, condor_off, condor_on

› Default server for many other commands: condor_config_val, etc.

› Periodically runs condor_preen to clean up any files Condor might have left on the machine (the rest of the daemons clean up after themselves, as well)

Page 31: Condor Administration  Paradyn-Condor Week UW Campus March 2002

31http://www.cs.wisc.edu/condor

condor_startd

› Represents a machine to the Condor pool› Enforces the wishes of the machine owner

(the owner’s “policy”)› Responsible for starting, suspending, and

stopping jobs› Spawns the appropriate condor_starter,

depending on the type of job› Provides other administrative commands:

(for example, condor_vacate)

Page 32: Condor Administration  Paradyn-Condor Week UW Campus March 2002

32http://www.cs.wisc.edu/condor

condor_starter

› Spawned by the condor_startd to handle all the details of starting and managing the job (for example, transferring the job’s binary to the executing machine or sending back exit status)

› On SMP machines, you get one condor_starter per CPU

› For PVM jobs, the starter also spawns a PVM daemon (condor_pvmd)

Page 33: Condor Administration  Paradyn-Condor Week UW Campus March 2002

33http://www.cs.wisc.edu/condor

condor_schedd

› Represents users to the Condor pool› Maintains persistent queue of jobs

Queue is not strictly FIFO (priority based)

› Responsible for contacting available machines and spawning waiting jobs

› Services most user commands: condor_submit condor_rm condor_q

Page 34: Condor Administration  Paradyn-Condor Week UW Campus March 2002

34http://www.cs.wisc.edu/condor

condor_shadow› Represents the job on the submit machine› Services requests from “standard” jobs for

“remote system calls”, including all file I/O› Is responsible for making decisions on

behalf of the job (for example, where to store the checkpoint file)

› There will be one condor_shadow process running on your submit machine for each currently running Condor job

Page 35: Condor Administration  Paradyn-Condor Week UW Campus March 2002

35http://www.cs.wisc.edu/condor

condor_shadow (cont’d)› The shadow doesn’t put much load

on your submit machine: Almost always blocked waiting for

requests from the job or doing I/O Relatively small memory footprint

› Still, you can limit the impact of the shadows on a given submit machine: They can be started by Condor with a

“nice-level” that you configure (renice) Can put a limit on the total number of

shadows running on a machine

Page 36: Condor Administration  Paradyn-Condor Week UW Campus March 2002

36http://www.cs.wisc.edu/condor

condor_collector

› Collects information from all other Condor daemons in the pool

› Each daemon sends a periodic update called a “ClassAd” to the collector

› Services queries for information: Queries from other Condor daemons Queries from users (condor_status)

Page 37: Condor Administration  Paradyn-Condor Week UW Campus March 2002

37http://www.cs.wisc.edu/condor

condor_negotiator› Performs “matchmaking” in Condor

Gets information from the collector about all available machines and all idle jobs

Tries to match jobs with machines that will serve them

Both the job and the machine must satisfy each other’s requirements (this is called “2-way matching”)

› Handles User Priorities

Page 38: Condor Administration  Paradyn-Condor Week UW Campus March 2002

38http://www.cs.wisc.edu/condor

Layout of a General Condor Pool

Central Manager

master

collector

negotiator

schedd

startd

= ClassAd Communication Pathway

= Process Spawned

Submit-Only

master

schedd

Execute-Only

master

startd

Regular Node

schedd

startd

master

Regular Node

schedd

startd

master

Execute-Only

master

startd

Page 39: Condor Administration  Paradyn-Condor Week UW Campus March 2002

39http://www.cs.wisc.edu/condor

Customer Job

Job Startup

Submit

Schedd

Shadow

Startd

Starter

CondorSyscall Lib

Page 40: Condor Administration  Paradyn-Condor Week UW Campus March 2002

40http://www.cs.wisc.edu/condor

Machine

States

PREEMPTING

CLAIMED

UNCLAIMED

OWNER

MATCHED

begin

Page 41: Condor Administration  Paradyn-Condor Week UW Campus March 2002

41http://www.cs.wisc.edu/condor

Useful Admin Commands

Page 42: Condor Administration  Paradyn-Condor Week UW Campus March 2002

42http://www.cs.wisc.edu/condor

Viewing things with condor_status

› condor_status has lots of different options to display various kinds of info

› Supports “-constraint” so you can only view ClassAds that match an expression you specify

› Supports “-format” so you can get the data in whatever form you want (very useful for writing scripts)

› View any kind of daemon ClassAd

Page 43: Condor Administration  Paradyn-Condor Week UW Campus March 2002

43http://www.cs.wisc.edu/condor

Viewing things with condor_q

› View the job queue

› The “-long” option is useful to see the entire ClassAd for a given job

› Also supports the “-constraint” option

› Can view job queues on remote machines with the “-name” option

Page 44: Condor Administration  Paradyn-Condor Week UW Campus March 2002

44http://www.cs.wisc.edu/condor

Looking at condor_q -analyze

› You specify a job or set of jobs you want to analyze

› condor_q will try to figure out why the job isn’t running

› The output is not as user-friendly as we’d like (though we’re working on it)

› Good at finding errors in Requirements expressions set by users

Page 45: Condor Administration  Paradyn-Condor Week UW Campus March 2002

45http://www.cs.wisc.edu/condor

Host/IP Security in Condor› You can configure each machine in

your pool to allow or deny certain actions from different groups of machines: “read” access - querying information

•condor_status, condor_q, etc “write” access - updating information

•condor_submit, adding a node to the pool, etc

“administrator” access•condor_on, off, reconfig, restart...

“owner” access • Things a machine owner can do (vacate)

Page 46: Condor Administration  Paradyn-Condor Week UW Campus March 2002

46http://www.cs.wisc.edu/condor

Setting up Host/IP-address Security in

Condor (part 1)› To configure, you list what hosts are

allowed or denied to perform each action If you list hosts that are allowed, everything

else is denied If you list hosts that are denied, everything

else is allowed If you list both, only hosts that are listed in

“allow” but not in “deny” are allowed

Page 47: Condor Administration  Paradyn-Condor Week UW Campus March 2002

47http://www.cs.wisc.edu/condor

Setting up Host/IP-address Security in

Condor (part 2)› There are many possibilities for specifying

which hosts are allowed or denied: Host names, domain names IP addresses, subnets Wildcards

• ‘*’ can be used anywhere (once) in a host name (for example, “infn-corsi*.corsi.infn.it)

• ‘*’ can be used at the end of any IP address (e.g. “128.105.101.*” or “128.105.*”)

Page 48: Condor Administration  Paradyn-Condor Week UW Campus March 2002

48http://www.cs.wisc.edu/condor

Setting up Host/IP-address Security in

Condor (part 3)› Can define values that effect all daemons:

HOSTALLOW_WRITE, HOSTDENY_READ, HOSTALLOW_ADMINISTRATOR, etc.

› Can define daemon-specific settings: HOSTALLOW_READ_SCHEDD,

HOSTDENY_WRITE_COLLECTOR, etc.

› Write access doesn’t automatically provide read access: you must grant both!

Page 49: Condor Administration  Paradyn-Condor Week UW Campus March 2002

49http://www.cs.wisc.edu/condor

Example Host/IP Security Settings

HOSTALLOW_WRITE = *.infn.it

HOSTALLOW_ADMINISTRATOR = infn-corsi1*, \$(CONDOR_HOST), axpb07.bo.infn.it, \$(FULL_HOSTNAME)

HOSTDENY_ADMINISTRATOR = infn-corsi15

HOSTDENY_READ = *.gov, *.mil

HOSTDENY_ADMINISTRATOR_NEGOTIATOR = *

Page 50: Condor Administration  Paradyn-Condor Week UW Campus March 2002

50http://www.cs.wisc.edu/condor

New Security Features in v6.3

› AUTHENTICATION_METHODS Kerberos, GSI (X.509 certs), FS,

NTSSPI

› Strong Encryption

› Demo/BoF in 3397

Page 51: Condor Administration  Paradyn-Condor Week UW Campus March 2002

51http://www.cs.wisc.edu/condor

Considerations for Installing a Condor Pool› What machine should be your central

manager?› Does your pool have a shared file system?› Where should you install your Condor

binaries and configuration files?› Where should you put the local directories

for each machine?› Will you start the daemons as root or as

some other user?

Page 52: Condor Administration  Paradyn-Condor Week UW Campus March 2002

52http://www.cs.wisc.edu/condor

What machine should be your central

manager?› The central manager (CM) is very important for the proper functioning of your pool

› You want a machine that will be online all the time, or will be rebooted quickly if there is a problem

› If the CM crashes, jobs that are currently matched will continue to run, but new jobs will not be matched

› A good network connection helps

Page 53: Condor Administration  Paradyn-Condor Week UW Campus March 2002

53http://www.cs.wisc.edu/condor

Does your pool have a shared file system?

› A shared file system is essential in v6.2 if you wish to run “vanilla” jobs In v6.3, it is not essential, but helpful

› It can also make administration of a large pool easier

› NFS works better with Condor than AFS, since Condor does not manage AFS tokens (yet), though either one will work

Page 54: Condor Administration  Paradyn-Condor Week UW Campus March 2002

54http://www.cs.wisc.edu/condor

Where should you install your binaries

and configuration files?› Putting the config files on a shared file

system makes administration much easier› Putting the binaries on a shared file

system makes installing a new version easier, but it can be less stable (since problems with the network can cause daemons to crash)

› condor_master on the local disk is a good compromise

Page 55: Condor Administration  Paradyn-Condor Week UW Campus March 2002

55http://www.cs.wisc.edu/condor

Where should you put the local directories for

each machine?› You need a fair amount of disk space in

the spool directory for each condor_schedd (to hold the job queue and the binaries for each job submitted).

› The execute directory is used by the condor_starter to hold the binary for any Condor job running on a machine

› The log directory is used by all daemons… more space = more saved info

Page 56: Condor Administration  Paradyn-Condor Week UW Campus March 2002

56http://www.cs.wisc.edu/condor

Will you start the daemons as root or some other user?

› If you have root access, we recommend you start the daemons as root More secure Less confusion for users

› If you don’t have root access, Condor will still work, users just have to take some extra steps to submit jobs

› Can have “personal Condor” installed - only you can submit jobs

Page 57: Condor Administration  Paradyn-Condor Week UW Campus March 2002

57http://www.cs.wisc.edu/condor

Basic Installation Procedure

› 1) Decide what version and parts of Condor to install and download them

› 2) Install the “release directory” - all the Condor binaries and libraries

› 3) Setup the Central Manager › 4) (optional) Setup Condor on any other

machines you wish to add to the pool› 5) Spawn the Condor daemons

Page 58: Condor Administration  Paradyn-Condor Week UW Campus March 2002

58http://www.cs.wisc.edu/condor

The Different Versions of Condor

› We distribute two versions of Condor: Stable Series

• Heavily tested, recommended for use• 2nd number of version string is even (6.0.3)

Development Series• Latest features, not necessarily well-tested• 2nd number of version string is odd (6.1.8)• Not recommended unless you know what you are

doing and/or need a new feature

Page 59: Condor Administration  Paradyn-Condor Week UW Campus March 2002

59http://www.cs.wisc.edu/condor

Condor Versions (cont’d)› All daemons advertise a CondorVersion

attribute in the ClassAd they publish

› You can also view the version string by running ident on any Condor binary

› All parts of Condor on a single machine should run the same version!

› Machines in a pool can usually run different versions and communicate with each other

› It will be made very clear when a version is incompatible with older versions

Page 60: Condor Administration  Paradyn-Condor Week UW Campus March 2002

60http://www.cs.wisc.edu/condor

Downloading Condor› Go to http://www.cs.wisc.edu/condor/› Fill out the form and download the different

pieces you need› Normally, you want the full stable release› There are also “contrib” modules for non-

standard parts of Condor, or individual pieces of the development release that you might need (e.g. SMP support)

› Distributed as compressed “tar” files› Once you download, unpack them

Page 61: Condor Administration  Paradyn-Condor Week UW Campus March 2002

61http://www.cs.wisc.edu/condor

Install the Release Directory› In the directory where you unpacked the

tar file, you’ll find a release.tar file with all the binaries and libraries

› condor_install will install this as the release directory for you

› In a pool with a shared release directory, you should run condor_install somewhere with write access to the shared directory

› You need a separate release directory for each platform!

Page 62: Condor Administration  Paradyn-Condor Week UW Campus March 2002

62http://www.cs.wisc.edu/condor

Setup the Central Manager

› You must configure Condor specially on your central manager, so that it knows it needs to spawn the additional daemons

› Easiest way to do this is by using condor_install

› There’s a special option for setting up a central manager

Page 63: Condor Administration  Paradyn-Condor Week UW Campus March 2002

63http://www.cs.wisc.edu/condor

Setup any other machines you wish to

add to the pool› If you have a shared file system, once

you run condor_install on your file server (and again on your central manager if it’s a separate machine) you can just run condor_init on any other machine you wish to add to your pool

› Without a shared file system, you must run condor_install on each host

Page 64: Condor Administration  Paradyn-Condor Week UW Campus March 2002

64http://www.cs.wisc.edu/condor

Spawn the Condor daemons

› Once Condor is configured and setup, you just have to spawn the condor_master on each host to “start” Condor

› You should startup Condor on the Central Manager first

› The user you spawn the condor_master as makes a big difference: root vs. “condor” vs. another user

Page 65: Condor Administration  Paradyn-Condor Week UW Campus March 2002

65http://www.cs.wisc.edu/condor

Having a shared release directory is key› Keep all of your config files in one place

Allows you to have a real global config file, with common values across the whole pool

Much easier to make changes (even for “local” config files in one shared directory)

› Keep all of your binaries in one place Prevents having different versions

accidentally left on different machines Easier to upgrade

Page 66: Condor Administration  Paradyn-Condor Week UW Campus March 2002

66http://www.cs.wisc.edu/condor

Thank you!

Check us out on the Web:http://www.cs.wisc.edu/condor

Email:[email protected]

Page 67: Condor Administration  Paradyn-Condor Week UW Campus March 2002

67http://www.cs.wisc.edu/condor

The rest of the slides

› The rest of the slides are in no particular order, and may or may not have been used during the actual presentation – so if you’re looking at these months after this presentation, it was more organized!

Page 68: Condor Administration  Paradyn-Condor Week UW Campus March 2002

68http://www.cs.wisc.edu/condor

Administering a Real Pool

› Having a shared release directory is key

› Viewing things with condor_status

› Viewing things with condor_q

Page 69: Condor Administration  Paradyn-Condor Week UW Campus March 2002

69http://www.cs.wisc.edu/condor

Viewing things with condor_status

› condor_status has lots of different options to display various kinds of info

› Supports “-constraint” so you can only view ClassAds that match an expression you specify

› Supports “-format” so you can get the data in whatever form you want (very useful for writing scripts)

› View any kind of daemon ClassAd

Page 70: Condor Administration  Paradyn-Condor Week UW Campus March 2002

70http://www.cs.wisc.edu/condor

Viewing things with condor_q

› View the job queue

› The “-long” option is useful to see the entire ClassAd for a given job

› Also supports the “-constraint” option

› Can view job queues on remote machines with the “-name” option

Page 71: Condor Administration  Paradyn-Condor Week UW Campus March 2002

71http://www.cs.wisc.edu/condor

Hands-On Exercise #3› Please point your browser to the new

instructions: Go back to the tutorial homepage Click on Shared Release Directory

› Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm

› If you exited Netscape, just click on “Tutorial” from your Start menu

Page 72: Condor Administration  Paradyn-Condor Week UW Campus March 2002

72http://www.cs.wisc.edu/condor

Advanced Installation Options

› Spawning the Condor daemons automatically at reboot

› “Full installation” of condor_compile› Advertising your own attributes in the

machine ClassAd› Setting up Host/IP security in Condor

(which we already talked about)› Customizing the startd policy

Page 73: Condor Administration  Paradyn-Condor Week UW Campus March 2002

73http://www.cs.wisc.edu/condor

Spawning the Condor Daemons automatically

at reboot› If you are running Condor as root, you

probably want to have your boot scripts start the condor_master automatically

› Provides more robust service, less manual work for the administrators

› We provide a “SysV-style” init script: <release>/etc/examples/condor.boot

› Exact details depends on your operating system platform

Page 74: Condor Administration  Paradyn-Condor Week UW Campus March 2002

74http://www.cs.wisc.edu/condor

Why Perform a “Full Installation” of

condor_compile?› condor_compile used to re-link user jobs

with the Condor libraries so they become “standard” jobs

› By default, condor_compile only works with certain commands (gcc, g++, g77, cc, CC, f77, f90, ld)

› With a “full-installation”, condor_compile will work with any command (in particular, “make”)

Page 75: Condor Administration  Paradyn-Condor Week UW Campus March 2002

75http://www.cs.wisc.edu/condor

How to Perform a Full Installation of

condor_compile:› Move your real ld binary, the “linker”, to

“ld.real” The path to “ld” varies from platform to

platform… though it’s usually “/bin/ld”

› Install Condor’s “ld” script in its place› If condor_compile is used, our ld will do

the Condor-specific magic› If not, our ld will just call the real ld and

everything will work like normal

Page 76: Condor Administration  Paradyn-Condor Week UW Campus March 2002

76http://www.cs.wisc.edu/condor

Advertising Your Own Attributes in the Machine ClassAd

› Add new macro(s) to the config file This is usually done in the local config file Can name the macros anything, so long as

the names don’t conflict with existing ones

› Tell the condor_startd to include these other macros in the ClassAd it sends out Edit the STARTD_EXPRS macro to include

the names of the macros you want to advertise (comma separated)

Page 77: Condor Administration  Paradyn-Condor Week UW Campus March 2002

Condor ProjectComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

Hands-On Exercise #4 Defining Your Own

Attributes in the Startd Classad

Page 78: Condor Administration  Paradyn-Condor Week UW Campus March 2002

78http://www.cs.wisc.edu/condor

Hands-On Exercise #4› Please point your browser to the new

instructions: Go back to the tutorial homepage Click on Local Startd Attributes

› Again, read the instructions carefully and execute any commands on a line beginning with % in your xterm

› If you exited Netscape, just click on “Tutorial” from your Start menu

Page 79: Condor Administration  Paradyn-Condor Week UW Campus March 2002

Condor ProjectComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

10 Minute Break

Questions are welcome….

Page 80: Condor Administration  Paradyn-Condor Week UW Campus March 2002

80http://www.cs.wisc.edu/condor

Configuring the Startd Policy

› Allows administrators or machine owners the power to control when and if Condor starts and stops jobs on a machine

› Lots of flexibility: can base any policy expression on any attributes in the startd’s ClassAd, or the ClassAd of the currently running job

› Many mechanisms available: suspending, checkpointing, hard kill, etc.

Page 81: Condor Administration  Paradyn-Condor Week UW Campus March 2002

81http://www.cs.wisc.edu/condor

Basic Progression a Job Can Pass Through When an Owner

Returns› No owner: job is running

› The owner returns: job is suspended If the owner leaves again shortly, the job is

resumed

› The owner is still there: job is vacated soft-kill... do a checkpoint if possible

› The vacate is taking too long: job is hard-killed (kill -9)

Page 82: Condor Administration  Paradyn-Condor Week UW Campus March 2002

82http://www.cs.wisc.edu/condor

Introduction to the Policy Expressions

› The policy expressions control the transitions between various “states” and “activities” a machine can be in

› All expressions use boolean logic› It is common to define macros for

complicated terms in your expressions to make them easier to read

› Often, you only need to edit these macros to customize your policy

Page 83: Condor Administration  Paradyn-Condor Week UW Campus March 2002

83http://www.cs.wisc.edu/condor

Machine

States

PREEMPTING

CLAIMED

UNCLAIMED

OWNER

MATCHED

begin

Page 84: Condor Administration  Paradyn-Condor Week UW Campus March 2002

84http://www.cs.wisc.edu/condor

Machine

Activities

PREEMPTING

CLAIMED

UNCLAIMED

OWNER

MATCHED

Benchmarking

begin

Idle

Suspended

Busy

Idle

Killing

Vacating

Idle

Idle

Page 85: Condor Administration  Paradyn-Condor Week UW Campus March 2002

85http://www.cs.wisc.edu/condor

The Policy Expressions

STARTRANK

WANT_SUSPENDSUSPENDCONTINUEPREEMPT

WANT_VACATEKILL

Page 86: Condor Administration  Paradyn-Condor Week UW Campus March 2002

86http://www.cs.wisc.edu/condor

True

True

Road Map of the Policy

Expressions

STARTSTARTSTARTSTART

WANT SUSPENDWANT SUSPENDWANT SUSPENDWANT SUSPEND

SUSPENDSUSPENDSUSPENDSUSPEND

VacatingVacatingVacatingVacating

PREEMPTPREEMPTPREEMPTPREEMPT

KILLKILLKILLKILL

True

True

True

True

False

WANT VACATEWANT VACATEWANT VACATEWANT VACATE

KillingKillingKillingKilling

False

= Expression

= Activity

Page 87: Condor Administration  Paradyn-Condor Week UW Campus March 2002

87http://www.cs.wisc.edu/condor

The START expression

› The most important policy expression› This is the “requirements” expression for

machines› Controls when Condor will start jobs› Can reference attributes of the job (such

as its size or the user who submitted it)› A machine will only leave the Owner

state if START evaluates to True (or Undefined)

Page 88: Condor Administration  Paradyn-Condor Week UW Campus March 2002

88http://www.cs.wisc.edu/condor

Example Start Expressions

KeyboardIsIdle = (KeyboardIdle > (15 * $(MINUTE)))

CPUIsIdle = (LoadAvg - CondorLoadAvg < 0.3)

START : $(KeyboardIsIdle) && $(CPUIsIdle)

or

START : Owner == “wright” || Owner == “condor” || \

($(KeyboardIsIdle) && $(CPUIsIdle))

or

START : True

Page 89: Condor Administration  Paradyn-Condor Week UW Campus March 2002

89http://www.cs.wisc.edu/condor

The RANK Expressions› Both machines and jobs can “rank” what

they’re looking for

› If a machine is claimed, it still advertises that it’s available Always looking for a higher-ranked job Will preempt the current job if a better one is

available.

› Jobs can rank machines - in a large pool, users can prefer certain hosts

Page 90: Condor Administration  Paradyn-Condor Week UW Campus March 2002

90http://www.cs.wisc.edu/condor

Using Machine RANK Expressions

› The expression is a floating point number Use “+” instead of “&&” (X == Y) evaluates to 0 or 1 Allows unlimited flexibility

› Often used in large pools made up of individual groups that machines owned by one group will always run jobs submitted by the users in that group

Page 91: Condor Administration  Paradyn-Condor Week UW Campus March 2002

91http://www.cs.wisc.edu/condor

Example Rank Expression

MachineOwner = (Owner == “wright”)

Friend = (Owner == "tannenba" || \

Owner == ”ballard”)

ResearchGroup = (Owner == "jbasney" || \

Owner == "raman”)

Rank : Friend + ResearchGroup*10 + \

MachineOwner*20

Startd_Exprs = $(Startd_Exprs), Friend, \

MachineOwner, ResearchGroup

Page 92: Condor Administration  Paradyn-Condor Week UW Campus March 2002

92http://www.cs.wisc.edu/condor

Example Rank Expression Explained

› First, we define different groups of people that we’re interested in (Friend, ResearchGroup and MachineOwner)

› Then, we define the Rank (it’s an expression, so we need to use “:”) to give different weights each group

› Finally, we add these new attributes to the list of attributes we publish so that Rank can be evaluated remotely

Page 93: Condor Administration  Paradyn-Condor Week UW Campus March 2002

93http://www.cs.wisc.edu/condor

WANT_SUSPEND vs. SUSPEND

› WANT_SUSPEND determines if the startd should even consider entering the Suspended activity: If WANT_SUSPEND is True, while a job is

running, SUSPEND is checked, and if it evaluates to True, the job is suspended

If WANT_SUSPEND if False, SUSPEND is never evaluated, and while the job is running, PREEMPT is checked

Page 94: Condor Administration  Paradyn-Condor Week UW Campus March 2002

94http://www.cs.wisc.edu/condor

CONTINUE

› Only evaluated while in the Suspended activity (WANT_SUSPEND must therefore be True)

› If CONTINUE evaluates to True, the job is resumed and the machine goes back to the Busy activity

Page 95: Condor Administration  Paradyn-Condor Week UW Campus March 2002

95http://www.cs.wisc.edu/condor

PREEMPT

› Specifies when a machine enters the Preempting state

› Must handle two cases (and usually has two separate terms in the expression): WANT_SUSPEND is True, and the job has

been suspended longer than the owner wants

WANT_SUSPEND is False, and the owner is using the machine again

Page 96: Condor Administration  Paradyn-Condor Week UW Campus March 2002

96http://www.cs.wisc.edu/condor

WANT_VACATE vs. KILL

› WANT_VACATE is only evaluated when PREEMPT is True and the machine is entering the Preempting state

› Determines if a vacate (checkpoint) is wanted, or if the job should be immediately hard killed

› KILL is only evaluated if the job is checkpointing (WANT_VACATE was True)

› If True, the job is hard-killed

Page 97: Condor Administration  Paradyn-Condor Week UW Campus March 2002

97http://www.cs.wisc.edu/condor

Final Notes on Startd Policy› Please read the Administrator’s Manual to

Condor for a complete explanation of the previous diagram See the chapter on “Configuring the Startd

Policy”

› This is all pretty confusing and complex: If you have questions, please send them to

[email protected] We can try to translate an English explanation

of the policy you want into expressions for Condor

Page 98: Condor Administration  Paradyn-Condor Week UW Campus March 2002

98http://www.cs.wisc.edu/condor

When something goes wrong...› Looking at “condor_q -analyze”

› Looking at the “UserLog”

› Looking at the “ShadowLog”

› Looking at the other daemon’s log files

› Condor is a large, distributed system, so analyzing problems can be very difficult: We’ll give you the basics of where to begin If you can’t figure it out, send us email and

we’ll be able to help you

Page 99: Condor Administration  Paradyn-Condor Week UW Campus March 2002

99http://www.cs.wisc.edu/condor

Looking at condor_q -analyze

› You specify a job or set of jobs you want to analyze

› condor_q will try to figure out why the job isn’t running

› The output is not as user-friendly as we’d like (though we’re working on it)

› Good at finding errors in Requirements expressions set by users

Page 100: Condor Administration  Paradyn-Condor Week UW Campus March 2002

100http://www.cs.wisc.edu/condor

Looking at the UserLog

› When the user submits a job, she/he can specify a “UserLog” in their submit file

› This will contain a record of if and where the job ran, if it checkpointed, if it was kicked off without a checkpoint, etc.

› Very useful in figuring out where a job was running when it was having problems, and to monitor the progress of the job

› Required by DAGMan and others

Page 101: Condor Administration  Paradyn-Condor Week UW Campus March 2002

101http://www.cs.wisc.edu/condor

Looking at the ShadowLog

› Of the log files generated by the Condor daemons, the ShadowLog usually has the most useful information when debugging a problem with a job

› You often want to increase the “Debug Level” of the Shadow and increase the maximum size of the file to get more info:

SHADOW_DEBUG = D_SYSCALLS D_FULLDEBUG

MAX_SHADOW_LOG = 1000000

Page 102: Condor Administration  Paradyn-Condor Week UW Campus March 2002

102http://www.cs.wisc.edu/condor

Analyzing the ShadowLog

› Incorrect file permissions or files that were removed are the most common errors

› Often useful to grep for a certain job ID grep “25\.3” ShadowLog | less

› At the end of the log, you might find an entry that looks something like “ERROR:” While not always the most clear, these entries

usually give a very good indication of the problem

Page 103: Condor Administration  Paradyn-Condor Week UW Campus March 2002

103http://www.cs.wisc.edu/condor

Looking at the Other Daemon’s Log Files

› If there is no ShadowLog, or no ShadowLog entries for the job with problems, you might have a problem even finding a match for the job

› Look in the SchedLog to see if there are errors communicating with the Negotiator

› Check for host permission problems› Look at the NegotiatorLog on the CM: is it

even negotiating jobs at all

Page 104: Condor Administration  Paradyn-Condor Week UW Campus March 2002

104http://www.cs.wisc.edu/condor

Analyzing the Logs› All daemons can display more debugging:

D_FULLDEBUG and D_COMMAND

› Can also get timestamps in the logs that include seconds, which can help pinpoint a problem w/ D_SECONDS

› Logs will often rotate quickly with heavy debugging output, so increase MAX_*_LOG as much as your disk space allows

› Unfortunately, Condor’s logs are still primarily useful only to the developers

› We’re working on changing that

Page 105: Condor Administration  Paradyn-Condor Week UW Campus March 2002

105http://www.cs.wisc.edu/condor

Thank you!

Check us out on the Web:http://www.cs.wisc.edu/condor

Email:[email protected]