Top Banner
Generic MPI Job Submission by the P- GRADE Grid Portal Zoltán Farkas ([email protected]) MTA SZTAKI
26

Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas ([email protected]) MTA SZTAKI.

Mar 28, 2015

Download

Documents

Carter Murray
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Generic MPI Job Submission by the P-GRADE Grid Portal

Zoltán Farkas ([email protected])MTA SZTAKI

Page 2: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Contents

• MPI Standards Implementations

• P-GRADE Grid Portal Workflow execution, file handling Direct job submission Brokered job submission

Page 3: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

MPI

• MPI stands for Message Passing Interface Standards 1.1 and 2.0

• MPI Standard features: Collective communication (1.1+) Point-to-Point communication (1.1+) Group management (1.1+) Dynamic Processes (2.0) Programming Language APIs …

Page 4: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

MPI Implementations

• MPICH Freely available implementation of MPI Runs on many architectures (even on Windows) Implements Standards 1.1 (MPICH) and 2.0 (MPICH2) Supports Globus (MPICH-G2) Nodes are allocated upon application execution

• LAM/MPI Open-source implementation of MPI Implements Standards 1.1 and parts of 2.0 Many interesting features (checkpoint) Nodes are allocated before application execution

• Open MPI Implements Standard 2.0 Uses technologies of other projects

Page 5: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

MPICH execution on x86 clusters

• Application can be started … … using ‘mpirun’ … specifying:

➢ number of requested nodes (-np <nodenumber>),➢ a file containing the nodes to be allocated (-machinefile

<arg>) [OPTIONAL],➢ the executable,➢ executable arguments.

$ mpirun –np 7 ./cummu –N –M –p 32

• Processes are spawned using ‘rsh’ or ‘ssh’, depending on the configuration

Page 6: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

MPICH x86 execution – requirements

• Executable (and input files) must be present on worker nodes:

Using Shared Filesystem, or User distributes the files before invoking ‘mpirun’.

• Accessing worker nodes from the host running ‘mpirun’:

Using ‘rsh’ or ‘ssh’ Without user interaction (host-based authentication)

Page 7: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

P-GRADE Grid Portal

• Technologies:– Apache Tomcat

– GridSphere

– Java Web Start

– Condor

– Globus

– EGEE Middleware

– Scripts

Page 8: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

P-GRADE Grid Portal

• Workflow execution: DAGMan as workflow scheduler pre and post script to perform tasks around job exeution Direct job execution using GT-2:

➢ GridFTP, GRAM➢ pre: create temporary storage directory, copy input files➢ job: Condor-G is executing a wrapper script➢ post: download results

Job execution using EGEE broker (both LCG/gLite):➢ pre: create application context as input sandbox➢ job: Scheduler universe Condor job executing a script, which

does job submission, status polling, output downloading. A wrapper script is submitted to the broker

➢ post: error checking

Page 9: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Workflow Manager Portlet

Page 10: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Workflow example

• Jobs

• Input/output files

• Data transfers

Page 11: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Portal: File handling

• „Local” files:– User has access to these files through the Portal

– Local input files are uploaded from the user machine

– Local output files are downloaded to the user machine

• „Remote” files:– Files reside on EGEE Storage Elements or are accessible

using GridFTP

– EGEE SE files:➢ lfn:/…➢ guid:…

– GridFTP files: gsiftp://…

Page 12: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Workflow Files

• File Types– In/Out

– Local/Remote

• File Names– Internal

– „Global”

• File Lifetime– Permanent

– Volatile

Page 13: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Portal: Direct job execution

• The resource to be used is known before job execution

• The user must have a valid, accepted certificate

• Local files are supported

• Remote GridFTP files are supported, even in case of grid-unaware applications

• Jobs may be sequential or MPI applications

Page 14: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Direct exec: step-by-step I.

1. Pre script:• creates a storage directory on the selected site’s front-

end node, using the ‘fork’ jobmanager

• local input files are copied to this directory from the Portal machine using GridFTP

• remote input files are copied using GridFTP (in case of errors, a two-phase copy is tried using Portal machine)

1. Condor-G job:• a wrapper script (wrapperp) is specified as the real

executable

• a single job is submitted to the requested jobmanager, for MPI jobs the ‘hostcount’ RSL attribute is used to specify the number of requested nodes

Page 15: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Direct exec: step-by-step II.

1. LRSM:• allocate the number of requested nodes (if needed)

• start wrapperp on one of the allocated nodes (master worker node)

2. Wrapperp (running on master worker node):• copies the executable and input files from the front-end

node (‘scp’ or ‘rcp’)• in case of PBS jobmanagers, executable and input files are

copied to the allocated nodes (PBS_NODEFILE). In case of non-PBS jobmanagers, shared filesystem is required, as the host names of the allocated nodes cannot be determined

• wrapperp searches for ‘mpirun’

• the real executable is started using the found ‘mpirun’• in case of PBS jobmanagers, output files are copied from

the allocated worker nodes to the master worker node)• output files are copied to the front-end node

Page 16: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Direct exec: step-by-step III.

1. Post script:• local output files are copied from the temporary working

directory created by the pre script to the Portal machine using GridFTP

• remote output files are copied using GridFTP (in case of errors, a two-phase copy is tried using Portal machine)

2. DAGMan: keeps on job scheduling…

Page 17: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Direct execution

Remote filestoragePortal machine

Temp.Storage

1

1

Fork GridFTP

PBS

2

Master WN

WrapperpSlave WN1 Slave WNn-1

3

In/exempirun

OutputExecutable

In/exe

OutputExecutable

In/exe

OutputExecutable

4

4

5

5

Page 18: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Direct Submission Summary

• Pros:– Users can add remote file support to legacy applications

– Works for both sequential and MPI(CH) applications

– For PBS jobmanagers, there is no need to have a shared filesystem (support for other jobmanagers can be added, depends on informations provided by jobmanagers)

– Works in case of jobmanagers, which do not support MPI

– Faster, than submitting with the broker

• Cons:– user needs to specify the execution resource

– currently doesn’t work on non-PBS jobmanagers without shared filesystems

Page 19: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Portal: Brokered job submission

● EGEE Resource Broker is used

• The resource to be used is unknown before job execution

• The user must have a valid, accepted certificate

• Local files are supported

• Remote files residing on Storage Elements are supported, even in case of grid-unaware applications

• Jobs may be sequential or MPI applications

Page 20: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Broker exec: step-by-step I.

1. Pre script:• creates the Scheduler universe Condor submit file

1. Scheduler Universe Condor job:• the job is a shell script

• the script is responsible for:• job submission: a wrapper script (wrapperrb) is specified as

the real executable in the JDL file

• job status polling

• job output downloading

Page 21: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Broker exec: step-by-step II.

1. Resource Broker:• handles requests of the Scheduler universe Condor job

• sends the job to a CE

• watches its exeution

• reports errors

• …

2. LRMS on CE:• allocates the requested number of nodes

• starts wrapperrb on the master worker node using

‘mpirun’

Page 22: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Broker exec: step-by-step III.

1. Wrapperrb:• the script is started by ‘mpirun’, so this script starts on every

allocated worker node like an MPICH process

• checks if remote input files are already present. If not, they are downloaded from the storage element

• if the user specified any remote output files, they are removed from the storage

• the real executable is started with the arguments passed to the script. These arguments already contain MPICH-specific ones

• after the executable has been finished, remote output files are uploaded to the storage element (only in case of gLite)

2. Post script:• nothing special…

Page 23: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Broker execution

Portal Machine

Resource Broker

Front-end node

Globus

PBS

Master WN

…mpirun

wrapperrb

Real exe

Slave WN1

wrapperrb

Real exe

Slave WNn-1

wrapperrb

Real exe

StorageElement

2

3

4

5

5

5

Page 24: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Broker Submission Summary

• Pros:– adds support for remote file handling in case of legacy

applications

– extends the functionality of the EGEE broker

– one solution supports both sequential and MPI applications

• Cons:– slow application execution

– status polling generates high load with 500+ jobs

Page 25: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Experimental results

• Tested some selected SEEGRID CEs using the broker from command line and the direct job submission from P-GRADE Portal with a job requesting 3 nodes

OKFailed (exe not found)grid2.cs.bilkent.edu.tr

OKFailedgrid01.rcub.bg.ac.yu (?)

OKFailedcluster1.csk.kg.ac.yu

OKOKce02.grid.acad.bg

Failed (job held)OKce01.isabella.grnet.gr

OKScheduledce.ulakbim.gov.tr

OKFailed (exe not found)ce.phy.bg.ac.yu

Portal Direct ResultBroker ResultCE Name

Page 26: Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas (zfarkas@sztaki.hu) MTA SZTAKI.

Thank you for your attention

?