Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - Jeffrey P. Gardner - PSC PSC Edward Walker - Edward Walker - TACC TACC Miron Livney - Miron Livney - U. Wisconsin U. Wisconsin Todd Tannenbaum - Todd Tannenbaum - U. Wisconsin U. Wisconsin And many others! And many others!
28
Embed
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Condor and GridShell
How to Execute 1 Million Jobs on the Teragrid
Jeffrey P. Gardner - Jeffrey P. Gardner - PSCPSCEdward Walker - Edward Walker - TACCTACCMiron Livney - Miron Livney - U. WisconsinU. WisconsinTodd Tannenbaum - Todd Tannenbaum - U. WisconsinU. Wisconsin
And many others!And many others!
Scientific Motivation Astronomy is increasingly being done by
using large surveys with 100s of millions of objects.
Analyzing large astronomical datasets frequently means performing the same analysis task on >100,000 objects.
Each object may take several hours of computing.
The amount of computing time required may vary, sometimes dramatically, from object to object.
Solution: PBS?
In theory, PBS should provide the answer. Submit 100,000 single-processor PBS jobs
In practice, this does not work. Teragrid nodes are multiprocessor
Only 1 PBS job per node Teragrid machines frequently restrict the
number of jobs a single user may run. Chad might get really mad if I submitted
100,000 PBS jobs!
Solution: mprun?
We could submit a single job that uses many processors. Now we have a reasonable number of
PBS jobs (Chad will now be happy). Scheduling priority would reflect our
actual resource usage. This still has problems.
Each job takes a different amount of time to run: we are using resources inefficiently.
The Real Solution: Condor+GridShell
The real solution is to submit one large PBS job, then use a private scheduler to manage serial work units within each PBS job.
We can even submit large PBS jobs to multiple Teragrid machines, then farm out serial work units as resources become availiable.
Vocabulary: JOB: (n) a thing that is submitted via Globus or PBS WORK UNIT: (n) An independent unit of work (usually serial), such as the analysis of a single astronomical object
The Real Solution: Condor+GridShell
The real solution is to submit one large PBS job, then use a private scheduler to manage serial work units within each PBS job.
We can even submit large PBS jobs to multiple Teragrid machines, then farm out serial work units as resources become availiable.
Vocabulary: JOB: (n) a thing that is submitted via Globus or PBS WORK UNIT: (n) An independent unit of work (usually serial), such as the analysis of a single astronomical object
CondorCondor
GridShellGridShell
Condor Overview
Condor was first designed as a CPU cycle harvester for workstations sitting on people’s desks.
Condor is designed to schedule large numbers of jobs across a distributed, heterogeneous and dynamic set of computational resources.