Condor Project Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor Eager, Lazy, and Just-in-Time Planning Edinburgh Workshop Oct 2003
Mar 28, 2015
Condor ProjectComputer Sciences DepartmentUniversity of Wisconsin-Madison
[email protected]://www.cs.wisc.edu/condor
Eager, Lazy, and Just-in-Time
Planning Edinburgh Workshop
Oct 2003
2http://www.cs.wisc.edu/condor
Planning –vs- Scheduling
› Can you control the resources? Yes? Scheduling. No? Planning.
› Planning is a ‘client’ operation.
3http://www.cs.wisc.edu/condor
The question of When
› Lots of planning open questions.
› An important consideration: When the planning occurs.
Time
Eager Just-in-TimeLazy
4http://www.cs.wisc.edu/condor
Eager Example› First Pass of EDG
Resource Broker
RB DAGMan
Condor-G
Globus
Fabric
Site Scheduler
5http://www.cs.wisc.edu/condor
Eager Condor-G Submit File
universe = globus
globussite = beak.cs.wisc.edu/jobmanager-lsf
executable = find_particlearguments = ….output = ….log = …
6http://www.cs.wisc.edu/condor
EDG Resource Broker Gets Lazy…
› Addition of a DAGMan callouts› DAGMan is given a command (script) to run
immediately before submission of job to Condor-G (different than a PRE script on a node)
› The helper command is passed a copy of the job submit file when DAGMan is about to submit that node in the graph
› This allows changes to be made to the submit file (i.e. changing globussite attribute) at the last minute
7http://www.cs.wisc.edu/condor
Eager Example› First Pass of EDG
Resource Broker
RB DAGMan
Condor-G
Globus
Fabric
Site Scheduler
callout
8http://www.cs.wisc.edu/condor
Moving Condor-G to Just-In-Time
› Delay the binding of the task (job) to the resource until the resource is ready.
› Need to know when the resource is ready.
› One way: unimplemented globus 1.1 “queue wait time” estimate Not really just-in-time, because of lies, lies
lies…
› Another way… Condor-G Glidein Mechanism.
9http://www.cs.wisc.edu/condor
How It Works
ScheddSchedd
LSFLSF
CollectorCollector
Condor-G Globus Resource
600 Condorjobs
10http://www.cs.wisc.edu/condor
How It Works
ScheddSchedd
LSFLSF
CollectorCollector
Condor-G Globus Resource
600 Condorjobs
GlideIn jobs
11http://www.cs.wisc.edu/condor
How It Works
ScheddSchedd
LSFLSF
CollectorCollector
Condor-G Globus Resource
GridManagerGridManager
600 Condorjobs
GlideIn jobs
12http://www.cs.wisc.edu/condor
How It Works
ScheddSchedd JobManagerJobManager
LSFLSF
CollectorCollector
Condor-G Globus Resource
GridManagerGridManager
600 Condorjobs
GlideIn jobs
13http://www.cs.wisc.edu/condor
How It Works
ScheddSchedd JobManagerJobManager
LSFLSF
StartdStartd
CollectorCollector
Condor-G Globus Resource
GridManagerGridManager
600 Condorjobs
GlideIn jobs
14http://www.cs.wisc.edu/condor
How It Works
ScheddSchedd JobManagerJobManager
LSFLSF
StartdStartd
CollectorCollector
Condor-G Globus Resource
GridManagerGridManager
600 Condorjobs
GlideIn jobs
15http://www.cs.wisc.edu/condor
How It Works
ScheddSchedd JobManagerJobManager
LSFLSF
User JobUser Job
StartdStartd
CollectorCollector
Condor-G Globus Resource
GridManagerGridManager
600 Condorjobs
GlideIn jobs
16http://www.cs.wisc.edu/condor
A Just-in-time Submit
executable = find_particlerequirements = TARGET.Arch ==
“Intel/Linux” || TARGET.Arch == “Sparc/Solaris”
# job describes the “power”rank = MFlops * 10000 + Memory
17http://www.cs.wisc.edu/condor
Another Just-in-time Submit
executable = find_particlerequirements = TARGET.Arch ==
“Intel/Linux” || TARGET.Arch == “Sparc/Solaris”
rank = sam_data_overlap(MY.dataset,TARGET.sam_site_name) + (TARGET.Mflops / 100000)
+dataset = search_space_id_0133313
18http://www.cs.wisc.edu/condor
Lots of Tradeoffs…› Just-in-Time
Pro: Dynamic. Resources can come and go. Can take advantage of changing circumstances.
Con: Coordination of multiple resources
› Eager Pro: Easier to coordinate multiple resources Con: Hard to scale… how to know about all
the resources in advance? Con: Plan falls apart if assumptions change.
19http://www.cs.wisc.edu/condor
Some observations› A complete separation of task from
resource is difficult. Lots and lots of structured data required. But this separation is required to in order to
achieve Just-In-Time planning.
› Grid Protocols that do not separate task from resource cannot realistically live on the grid. Virtualization can help.
20http://www.cs.wisc.edu/condor
Plan for failure
› Much effort on how to create a plan.
› How about a plan for when things fail?
21http://www.cs.wisc.edu/condor
Job Failure Policy Expressions
› Condor/Condor-G augemented so users can supply job failure policy expressions in the submit file.
› Can be used to describe a successful run, or what to do in the face of failure.
on_exit_remove = <expression>on_exit_hold = <expression>periodic_remove = <expression>periodic_hold = <expression>
22http://www.cs.wisc.edu/condor
Job Failure Policy Examples› Do not remove from queue (i.e. reschedule) if
exits with a signal:on_exit_remove = ExitBySignal == False
› Place on hold if exits with nonzero status or ran for less than an hour:
on_exit_hold = ((ExitBySignal==False) && (ExitSignal != 0)) || ((ServerStartTime –
JobStartDate) < 3600)› Place on hold if job has spent more than 50% of
its time suspended:periodic_hold = CumulativeSuspensionTime
> (RemoteWallClockTime / 2.0)