This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Tutorial on Distributed High Performance Computing
Job Schedulers Assigns work (jobs) to compute resources to meet specified job
requirements within constraints of available resources and their characteristics
An optimization problem. Objective usually to maximum throughput of jobs.
Scheduler with automatic data placement components
(Input/output staging)
Fig 3.4
e.g. Stork
Advance reservation
Term used for requesting actions at times in future In this context, requesting a job to start at some time in
the future. Both computing resources and network resources are
involved Network connection usually being the Internet is not
reserved. Found in recent schedulers
Some reasons one might want advance reservation in Grid computing
• Reserved time chosen to reduce network or resource contention. • Resources not physically available except at certain times.• Jobs require access to a collection of resources simultaneously,
e.g. data generated by experimental equipment.• A deadline for results of work• Parallel programming jobs in which jobs must communicate
between themselves during execution.• Workflow tasks in which jobs must communicate between
themselves during execution.
Without advance reservation, schedulers will schedule jobs from a queue with no guarantee when they actually would be scheduled to run.
Scheduler Examples
Sun Grid Engine
Condor/Condor-G
Grid Engine job submission GUI interface
3-1.8Fig. 3.8
Submitting a job through GRAM and through an SGE scheduler
3-1.9Fig. 3.10
Running Globus job with SGE scheduler using globusrun-ws command
Scheduler selected by name using -Ff option (i.e. factory type).
Name for Sun Grid Engine (obviously) is SGE. Hence:
globusrun-ws –submit -Ft SGE -f prog1.xml
submits job described in job description file called prog1.xml.
Output
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:d23a7be0-f87c-11d9-a53b-0011115aae1f
Termination time: 07/20/2008 17:44 GMT
Current job state: Active
Current job state: CleanUp
Current job state: Done
Destroying job...Done.
Cleaning up any delegated credentials...Done.
Note: the user credentials have to be delegated
Actual machine running job
Scheduler will choose machine that job is run on, which can vary for each job. Hence
globusrun-ws –submit –s -Ft SGE –c /bin/hostname
submits executable hostname to SGE scheduler in streaming mode redirecting output to console, with usual Globus output.
Output: Hostname displayed as output will be that of machine running job and may vary.
Specifying Submit Host
Submit host and location for factory service can be specified by using -F option, e.g.:
globusrun-ws –submit –s
-F http://coit-grid03.uncc.edu:8440
-Ft SGE –c /bin/hostname
Condor
Developed at University of Wisconsin-Madison in mid 1980’s to convert collection of distributed workstations and clusters into a high-throughput computing facility.
Key concept - using wasted computer power of idle workstations.
Hugely successful.
Many institutions now operate Condor clusters.
Condor
Essentially a job scheduler jobs scheduled in background on distributed computers,
but without user needing an account on individual computers.
Users compile their programs for computers Condor is going to use, and include Condor libraries, which apart from other things handles input and captures output.
Job described in a job description file. Condor then ships job off to appropriate computers.
Example job submission
# This is a comment condor submit file for prog1 jobUniverse = vanillaExecutable = prog1Output = prog1.outError = prog1.errorLog = prog1.logQueue
Condor has its own job description language to describe job in a “submit description file”
Not in XML as predates XML
Simple Submit Description File Example
One of 9 environments – vanilla only requires executable. Checkpointing and remote system calls not allowed.)
Submitting jobcondor_submit command
condor_submit prog1.sdl
where prog1.sdl is submit description file.
Without any other specification, Condor will attempt to find suitable executable machine from all available.
Condor works with and without a shared file system.
Most local clusters set up with shared file system and Condor will not need to explicitly transfer files.
Submitting Multiple Jobs
Done by adding number after Queue command, i.e.:
Submit Description File Example
# condor submit file for program prog1 Universe = vanillaExecutable = prog1Queue 500
will submit 500 identical prog1 jobs at once. Can use multiple Queue commands with Arguments for each instance.
Grid universe
Condor can be used as the environment for Grid computing:
Stand-alone without Grid middleware such as Globus
or alternatively
Integrated with the Globus toolkit.
Condor’s matchmaking mechanism
To chooses best computer to run the job
Condor ClassAd
Based upon notion that jobs and resources advertise themselves in “classified advertisements”, which include their characteristics and requirements.
Job ClassAd matched against resource ClassAd.
Condor’s ClassAd Matchmaking Mechanism
Fig 3.14
Machine ClassAdd
Set up during system configuration.
Some attributes provided by Condor but their values can be dynamic and alter during system operation.
Machine attributes can describe such things as: Machine name Architecture Operating system Main memory available for job Disk memory available for job Processor performance Current load, etc
Job ClassAdd
Job is typically characterized by its resource requirements and preferences.May include: What job requires What job desires What job prefers, and What job will accept
using Boolean expressions.These details put in submit description file.
Matchmaking commands Requirements and Rank
Available for both job ClassAd and machine ClassAd:
Requirements -- specify machine requirements.
Rank -- used to differentiate between multiple machines that can satisfy requirements and can identify a preference based upon a user criteria.
Rank = <number>
Computes to a floating point number.
Resource with highest rank chosen.
Allows one to specify dependencies between Condor Jobs.
Example
“Do not run Job B until Job A completed successfully”
Especially important to jobs working together (as in Grid computing).
Condor’s Directed Acyclic GraphManager (DAGMan)
Meta-scheduler
Example
# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D
Job A
Job CJob B
Job D
Condor’s Directed Acyclic Graph(DAG) File
Running DAG
condor_submit_dag
Start a DAG with dag file diamond.dag.
condor_submit_dag diamond.dag
Submits a Scheduler Universe Job with DAGMan as executable.
Meta-schedulers
Schedule jobs across distributed sites Highly desirable in a Grid computing
environment. For a Globus installation, interfaces to local
Globus GRAM installation, which in turn interfaces with local job scheduler
Uses whatever local scheduler present at each site
Meta-scheduler interfacing to Globus GRAM
Condor-G
A version of Condor that interfaces to Globus environment. Jobs submitted to Condor through Grid universe and directed
to Globus job manager (GRAM)
Fig 3-18
Communication between user,
myProxy server, and Condor-G for long-running jobs
Fig 3.19
Gridway A meta-scheduler designed specifically for a Grid
computing environment
Interfaces to Globus components.
Project began in 2002.
Now open source.
Became part of Globus distribution from version 4.0.5 onwards (June 2007).
Standard set of API’s for submission and control of jobs to DRM’s
Bindings in C/C++, Java, Perl, Python, and Ruby for a range of DSMs including (Sun) Grid Engine, Condor, PBS/Torque, LSF and Gridway
Scheduler with DRMAA interface
Fig 3.21
Example of the use of DRMAA
Fig 3.22
Grid-enabling an application
A poorly defined and understood term.
It does NOT mean simply executing a job of a Grid platform!
Almost all computer batch programs can be shipped to a remote Grid site and executed with little more than with a remote ssh connection.
This is a model we have had since computers were first connected (via telnet).
Grid-enabling should include utilizing the unique distributed nature of the Grid platform.
Grid-enabling an application
With that in mind, a simple definition is:
Being able to execute an application on a Grid platform, using the distributed resources
available on that platform.
However, even that simple definition is not agreed upon by everyone!
A broad definition that matches our view of Grid enabling applications is:
“Grid Enabling refers to the adaptation or development of a program to provide the capability of interfacing with a grid middleware in order to schedule and utilize resources from a dynamic and distributed pool of “grid resources” in a manner that effectively meets the program’s needs”2
2 Nolan, K., “Approaching the Challenge of Grid-Enabling Applications.,” Open Source Grid & Cluster Conf., Oakland, CA, 2008.
How does one do “Grid-enabling”?
Still an open question and in the research domain without a standard approach.
Here we will describe various approaches.
We can divide the use of the computing resources in a Grid into two types:
Using multiple computers separately to solve multiple problems
Using multiple computers collectively to solve a single problem
Using Multiple Computers SeparatelyParameter Sweep Applications
In some domains areas, scientists need to run the same program many times but with different input data.
“Sweep” across parameter space with different values of input parameter values in search of a solution.
Many cases, not easy to compute answer and human intervention is required for to search or design space
Implementing Parameter Sweep
Can be simply achieved by submitting multiple job description files, one for each set of parameters but that is not very efficient.
Parameter sweep applications are so important that research projects devoted to making them efficient on a Grid.
Parameter sweeps appears explicitly in job description languages.(More details in UNC-C course notes)
Exposing an Application as a Service
“Wrap” application code to produce a Web service
“Wrapping” means application not accessed directly but through service interface
Grid computing has embraced Web service technology so natural to consider its use for accessing applications.
Web service invoking a program
If Web service written in Java, service could issue a command in a separate process using exec method of current Runtime object with the construction:
Runtime runtime = Runtime.getRuntime();Process process = runtime.exec(“<command>” )
where <command> is command to issue, capturing output with