www.eu-eela.eu E-science grid facility for Europe and Latin America Watchdog: A job monitoring solution inside the EELA-2 Infrastructure Riccardo Bruno, Roberto Barbera, Elisa Ingrà INFN Sez. Catania (Italy) 2nd EELA-2 Conference Choroni (Venezuela), 25-27.11.2009
14
Embed
Www.eu-eela.eu E-science grid facility for Europe and Latin America Watchdog: A job monitoring solution inside the EELA-2 Infrastructure Riccardo Bruno,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
www.eu-eela.eu
E-science grid facility forEurope and Latin America
Watchdog: A job monitoring solution inside the EELA-2 Infrastructure
Riccardo Bruno, Roberto Barbera, Elisa Ingrà
INFN Sez. Catania (Italy)
2nd EELA-2 Conference
Choroni (Venezuela), 25-27.11.2009
www.eu-eela.eu
Job Monitoring in gLite
Before gLite v3.1 no job monitoring systems were available
• Jobs running into the WNs are considered as Black Boxes• No prompted job status retrieval (Done/Abort/…)• Output Sandbox available only after WMS recognize job completion
• This situation was not good for jobs requesting very long computational time.
– Get in touch with the jobs running into the WN (especially for long term jobs) monitoring and controlling their execution.
• How– Perform job control and monitoring using grid services in the less
invasive way for the application.
• Observations– Almost all Grid jobs are piloted by a main shell script:
Get precious info in case of faults Pilot complex batch workflows
– Both AMGA and SE+LFC can be used as a basic Grid Info System lfc-* and lcg-* tools already available for Grid file management mdcli AMGA command can be used by jobs on the WNs cp command in case of shared file system on the WN The latency of CLI tools is very low compared to long term jobs
• Monitor job execution timely watching files produced by the job while it executes on the WN– File snapshots will be reported on LFC+SE, AMGA servers or
mounted shared FSs
• It would be useful to configure the monitoring tool accordingly to the user needs– The monitoring tool will consist only of bash script files– Few shell environment variables can be used to configure
the monitoring behavior
• Control the job execution accessing directly on the WN– It is possible to send user commands on the WN– It is possible to change the monitoring while the Grid job runs
The Watchdog• The Watchdog consists of set of shell scripts to be included in the
JDL InputSandbox and then called by the pilot script.• Watchdog features:
– It starts in background before to run the Grid job
– The watchdog runs as long as the main job
– The monitoring process can be piloted until the pilot scripthas not finished
– Easily configurable and customizable
– The watchdog does not compromise the CPU power of the WN– The watchdog can be used with MPI jobs– Files may be fully or partially reported (only last changes)
– The WD core main script, it is the responsible of the job monitoring file snapshot reporting and user command execution
• watchdog.ctrl– This script controls the execution of the WD core script; it can:
start, stop, pause and resume the WD.It can be also used to: alter the time interval add/remove files to watch and change reporting strategy (full/partial)
• watchdog.conf– This script contains all environment variables needed to
1. Configure the Watchdog setting the watchdog.conf file
2. Applications using Watchdog MUST include the files: watchdog.sh, watchdog.ctrl, watchdog.conf,uuencode,uudecode (in case of AMGA reporting) or configure the PATH VO_PROD_VO_EU_EELA_EU_SW_DIR in the WN
• CLI to ease the WD user interaction– 20091124164201 wd>
• Uses the watchdog.conf file to get user configuration• Principal commands:
– set Set MODE (LFC/AMGA/mounted Shared FS)– show jobs Get list of monitored jobs– Attach to a monitored job– show snapshots Get the list of file snapshots– View the snapshot content– Get generic info: ENV,PID,CE,WN,Proxy …– exec Execute a given command
Interactive commands are not allowed It is possible to call the watchdog.ctrl command (use –n opt!)