Top Banner
Running Parallel Simulations and Enabling Science Gateways with the NSF MATLAB Experimental Computing Resource at Cornell Steve Lantz, Cornell Center for Advanced Computing Susan Mehringer, Cornell Center for Advanced Computing The “MATLAB on the TeraGridexperimental computing resource is funded by NSF grant 0844032 in partnership with Purdue University, Dell, The MathWorks, and Microsoft. Presented at SC10, New Orleans, LA November, 2010
68

Running Parallel Simulations and Enabling Science Gateways ...

Mar 26, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Running Parallel Simulations and Enabling Science Gateways ...

Running Parallel Simulations and Enabling Science Gateways with the NSF

MATLAB Experimental Computing Resource at Cornell

Steve Lantz, Cornell Center for Advanced Computing

Susan Mehringer, Cornell Center for Advanced Computing

The “MATLAB on the TeraGrid” experimental computing resource is funded by NSF grant

0844032 in partnership with Purdue University, Dell, The MathWorks, and Microsoft.

Presented at SC10, New Orleans, LA

November, 2010

Page 2: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 2

MATLAB on the TeraGrid

• This is an effort to provide a large parallel MATLAB resource to a national (and international) community in a secure, useable manner.

• Several different hardware and software components make up the system. These integrate with the MATLAB client at different levels.

• All functions are provided by various “services”, meaning you never actually log on to any CAC systems. The client software simply makes requests to CAC systems.

Page 3: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 3

High-Level Process

• Security is managed via short-lived certificates. When you log in to the system, you trade a username/password for a certificate that allows you to use the services.

• File transfer service – enables you to move files through a specialized FTP server to a network file system that is mounted on all compute nodes.

• Job submission service – enables you to submit and query jobs on the cluster; these jobs are executed by MATLAB workers on the compute nodes.

Page 4: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 4

Hardware View

MyProxy Server GridFTP Server

HPC 2008 Head Node

DataDirectNetworks

9700 Storage

Windows Server 2008

CAC 10GB Interconnect

1. Retrieve certificate2. Upload files to storage via GridFTP3. Submit job to run MATLAB workers on cluster4. Download files via GridFTP

Page 5: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 5

Software View

• File movement and job submission interactions are largely hidden by software integrated with MATLAB.

• CAC’s client code for MATLAB is a mix of Java and .m files that enable access to the TUC cluster directly from your MATLAB client through the PCT “generic scheduler” interface.

• Client code will communicate as needed with server-side software to run your parallel jobs on TUC, the 512-core cluster devoted to parallel MATLAB applications.

Page 6: Running Parallel Simulations and Enabling Science Gateways ...

6

JGlobus CoG

Apache CXFCertificate Management MyProxy GridFTP

SSLJSDL

matlabpool

parfor createJob submit

getAllOutputArguments

Page 7: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 7

A Word on Security

• Logging on to MyProxy returns a short-lived X.509 certificate that is used to authenticate to services.

• This allows any TeraGrid user to access the system using their username/password on a TeraGrid MyProxy server. Most users will use the CAC MyProxy server.

• Job submission and status information is accessed via a web service call that is secured by a client-certificate SSL (or TLS) connection. Your data and job requests are transferred over secure channels.

Page 8: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 8

GridFTP

• GridFTP is an extension of the standard File Transfer Protocol, developed as part of the Globus Toolkit.

• GridFTP provides two key extensions that the CAC client code uses:

– GSI Security – The Grid Security Infrastructure provides file transfer authentication and encryption and interoperates with MyProxy X.509 certificates.

– Parallel Transfers (extended block mode) – Makes use of multiple simultaneous connections so a higher percentage of available bandwidth can be used.

Page 9: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 9

A Note on the Platform

• The compute nodes that run the MATLAB jobs are running Windows HPC 2008 (64 bit).

– Since a minority of people are running a Win64 platform, any files requiring compilation (e.g., mex files) will likely need to be recompiled on TUC.

– MATLAB is relatively resilient to paths with the wrong direction of slashes, but the difference can cause problems.

• C:\Users\naw47\myfiles\this.dat Windows path

• /home/naw47/myfiles/this.dat Mac, Linux path

Page 10: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 10

Support

• As a funded project, the system is free to use for research applications.

– We will ask for information on your project so that we can learn who we are supporting and how to best address problems.

• We also provide consulting support for the system.

– Troubleshooting

– Guidance on optimizing your application

– General help with parallel MATLAB

Page 11: Running Parallel Simulations and Enabling Science Gateways ...

Installing theCAC Parallel MATLAB

Client Code

Page 12: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 12

Installation Overview

1. Check that prerequisites are met

2. Download the CAC client code

3. Modify your MATLAB classpath.txt file

4. Modify one function in the CAC client code

5. Register your Certificate

6. Set paths, runtime, etc

7. Run test jobs

Page 13: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 13

Prerequisites

• Linux, Mac, or Windows operating system on the client machine

• MATLAB Release 2009a or 2009b or 2010a

• MATLAB Parallel Computing Toolbox

• Obtained access via submitting the Interest Form found at http://www.cac.cornell.edu/matlab/

Page 14: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 14

Terminology

MATLABROOT: The MATLAB installation directory

• >> matlabroot

• Common locations are:

– Vista/XP/7: C:\Program Files\MATLAB\R2009a

– Mac: /Applications/MATLAB_R2009a.app

– Linux: /opt/matlab/r2009a or /usr/local/matlab/r2009a

CACHOME: Wherever you install the CAC client code

• Be sure to substitute your folder path for CACHOME in all installation steps. Can be named something else.

Page 15: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 15

Helpful Links

• General: http://www.cac.cornell.edu/matlab/

• CAC client code, download and installation FAQ: http://www.cac.cornell.edu/matlab/downloads

• Helphttp://www.cac.cornell.edu/help/

Page 16: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 16

Download CAC Client Code

• Choose a name and location for CACHOME. You will need write permissions. Some good choices:

– Windows: c:\username\cac

– Mac: /Users/username/cac

– Linux: /home/username/cac

• Download and extract the .zip file:

http://www.cac.cornell.edu/matlab/downloads

• Unpack it into CACHOME. You should end up with a folder which contains .m files and subdirectories.

Page 17: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 17

classpath.txt: Java Libraries

• The CAC client code is heavily dependant on a series of Java libraries for functionality. The CACHOME/lib/*.jar files must be added to MATLAB’s java classpath, in the text file classpath.txt.

• Find the location of classpath.txt: >> which classpath.txt

Page 18: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 18

classpath.txt: Modifications

• maci64=$matlabroot/java/jarext/aquaDecorations.jar After this line add 12 lines

– Note these are Windows slashes, reverse them for Mac and Unix.

– Replace CACHOME with your install path, and it must be an absolute path.

• Comment out one line: ## $matlabroot/java/jarext/ice/ib6https.jar

CACHOME\lib\littlejohn.jar

CACHOME\lib\bcprov-jdk15-1.43.jar

CACHOME\lib\bcprov-jdk16-143.jar

CACHOME\lib\cog-jglobus-1.7.0.jar

CACHOME\lib\commons-logging-1.1.1.jar

CACHOME\lib\cryptix-asn1.jar

CACHOME\lib\cryptix.jar

CACHOME\lib\cryptix32.jar

CACHOME\lib\cxf-2.2.7.jar

CACHOME\lib\log4j-1.2.15.jar

CACHOME\lib\not-yet-commons-ss0.3.11.jar

CACHOME\lib\puretls.jar

Page 19: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 19

classpath.txt: Testing

1. Restart MATLAB, then:

2. List the paths of all of the jar files that MATLAB knows about. Do you see the 12 lines you added? >> javaclasspath

3. Are you using the classpath.txt file you expected? >> which classpath.txt

4. Test that classpath.txt is set up properly.>> addpath('LITTLEJOHNHOME/contrib');>> updateContrib();>> cacCheckClassPath();

Page 20: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 20

classpath.txt: But, what if…

You are not administrator on the machine?

• You must run explicitly as the administrator when editing classpath.txt for the changed to be saved, since it affects the global MATLAB install.

• If this is not feasible, e.g. you are on a multi-user system:– Identify your startup directory: >> userpath

– Place a classpath.txt file in your startup directory. This file is user-specific and will only affect your MATLAB environment.

– MATLAB looks first in the startup directory for a classpath.txt file, then the default directory, using whichever it finds first.

– http://www.mathworks.com/access/helpdesk/help/techdoc/matlab_env/f8-10506.html

Page 21: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 21

cacsched.m: Modifications

• Start MATLAB

• Edit CACHOME\cacsched.m. Change USERNAME to your CAC username in line 24.

• Add the CAC client code to your MATLAB path:>> addpath('CACHOME');

• Run cacsched to set up the scheduler object, sched. This object is passed to the createJob functions in order to initiate jobs. The scheduler settings will be output. >> cacsched

Page 22: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 22

cacsched.m: Optional

• Review the following line from cacsched.m:set(sched,'DataLocation',fullfile(LJHome, 'jobs'));

• The communication between the client MATLAB and the scheduler is file based. This means each job submission creates a large number of files which need to be stored somewhere on the client machine. The default is to set it to CACHOME\jobs .

• For a different location, change the line to an explicit path, e.g.set(sched,'DataLocation','/home/myuserid/myJobsLocation');

Page 23: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 23

Register your Certificate

Download the CAC server certificates and register your certificate with us.

>> cacRegisterCertificate();

• Follow the dialogue box instructions

• Run this again any time you change your password

• It can take up to two minutes to complete

Page 24: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 24

Job Submission Settings

• ≫ ClusterInfo.setQueueName('Quick');only use this for 10 minutes and 16 cores or less

• These settings will be in effect for all subsequent job submissions, until you change them:

• >> ClusterInfo.setWallTime(10); set your wall time limit to 10 minutes

• >> ClusterInfo.getWallTimeshow your current wall time setting

• >> help ClusterInfosee more examples

Page 25: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 25

Installation Testing

runtests.m, found in CACHOME, is a tool which performs a series of functionality tests:

>> runtests(sched,1); % run local tests on file and path settings

>> runtests(sched,2); % run test on the file transfer system and on scheduler communication.

>> runtests(sched,3); % run sample MATLAB jobs to the cluster to ensure that both parallel and distributed tests are functioning.

>> runtests(sched,0); % run all tests

Page 26: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 26

Future Sessions

>> addpath('CACHOME');

>> cacsched

>> addpath('CACHOME/contrib');

>> updateContrib();

≫ ClusterInfo.setQueueName('Quick');% Or rely on setting from previous session

>> ClusterInfo.setWallTime(10); % Or rely on setting from previous session

>> submityourjob(sched); % “submityourjob” is your job

Page 27: Running Parallel Simulations and Enabling Science Gateways ...

How to Submit a Job

Page 28: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 28

Next Steps

• At this point, you have a fully operational install of the CAC client code for parallel MATLAB.

• Your next step should be to take a look in the examples directory to start seeing how to take advantage of TUC.

– cacsubmit – super simple distributed job example

– cacparsubmit – simple example of a parallel (MPI) job

– pooljobremote – MATLAB pool example

– cacNonBlockSubmit – example of submitting a non-blocking job (avoiding waitForState)

Page 29: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 29

Using the PCT

• MATLAB’s Parallel Computing Toolbox is the client-side toolbox that enables parallelism (including using TUC).

• PCT provides a set of interfaces that allow us (CAC) to write implementations of PCT functions that talk to TUC but look the same to you (the user) as when run locally.

• Parallel resources are selected either by using a named configuration or by using the findResource function.

– Either way, PCT function calls must be tied to specific implementations to provide resource-specific functionality.

– You don’t ever call the underlying functions directly.

Page 30: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 30

findResource

• In our examples (and in general practice) we call the findResource function via a script called cacsched.m.

• If you examine cacsched, you’ll see we also tie the PCT interface functions to specific functions provided by CAC.

Page 31: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 31

Using a Configuration

Page 32: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 32

Jobs and Tasks

• findResource creates a scheduler object, which allows you to create Jobs. In PCT, Jobs are containers for Tasks, which are where the actual work is.

schedScheduler Object

Jobs(24) Jobs(25)

j=createJob(sched) j=createParallelJob(sched)

Tasks(1)myFunction(z)

Tasks(1)someFunction(x)

Tasks(2)otherFunction(y)

createTask(j,…)createTask(j,…)createTask(j,…)

Page 33: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 33

Distributed Jobs

• There are three types of jobs in the PCT: distributed, parallel, and pool.

• Distributed jobs have one or more one-core tasks and no communication between tasks. Thus, each task could be run as a one-core job through a batch scheduler. These are useful for EP work or for shifting lengthy jobs to TUC.

Page 34: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 34

Parallel and Pool Jobs

• Parallel and Pool jobs are multi-core; communication between cores is possible. These jobs have just one task!

• The number of cores must be given. The task function is responsible for implementing the actual parallelism using MPI_Rank logic (or parfor/spmd/labindex for pool jobs).

Page 35: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 35

State of Jobs

• After a job is submitted, “job.state” is just one of several different ways to learn the state of the job.

• waitForState is a PCT interface to block on job state; it’s problematic for long running jobs or jobs that fail.

• cacNonBlockingGetJobStates is an optional, non-PCT interface that offers more control.

Page 36: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 36

Retrieving Results

• Once your job completes, you need to get the results in two steps: (1) download files, (2) load into workspace.

– getAllOutputArguments returns cell array a{Task,Output}

– a{1,2} = Task 1, second output

Page 37: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 37

Helpers

• The CAC client code provides a number of functions beyond the PCT interface which should be helpful to you. The hands-on labs will take advantage of these functions.

– gridFTP() – creates an object whose methods are, in effect, a command-line interface to the TUC file storage.

– littleJohnLog/qpeek – monitor the status of a running job. littleJohnLog is a server-side function that writes data to a file that qpeek reads.

– getErrors/getOutput – pretty-print any errors your tasks had, as well as the command-line output.

Page 38: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 38

Putting it All Together

We can control which resource is used to execute the job simply by swapping out the scheduler object!

Page 39: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 39

Parallelizing a Pool Code

• As we have seen, converting code to run remotely as a distributed job is fairly trivial. All you really need is to do createTask on your function (maybe in a loop).

• Pool jobs are not hard, either. Let’s take a code that opens a pool on a multi-core workstation and alter it to exploit the many cores on TUC. The basic process:

1) Modify the pool function to run on TUC

2) Write the submitter or driver script

3) Script the movement of any needed files

Page 40: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 40

Pool Code

• Our example code opens a matlabpool, reads an input file, then uses a parfor loop to execute the peakpickingalgorithm in parallel.

Page 41: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 41

Modifications to the Pool Code– More outputs

are needed

– Pool commands are removed

– Absolute paths are best for I/O

– Graphics may be moved off to the client, or may be dumped to a file

Page 42: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 42

Submitter Script

• The submitter or driver script sets up the pool job

• It starts up the matlabpool automatically (8 “labs” here)

Page 43: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 43

Moving the Files

• Both the task function and the datafile must be present on the remote server. We’ll use gridFTP to take care of it.

• The submitter also sets PathDependencies for the job.

Page 44: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 44

Parallel Jobs

• PCT supports basic MPI commands inside parallel jobs.

– Initialization is done for you (no MPI_Init)

– Size and rank are available from the start of the job; numlabs = MPI_Comm_size, labindex = MPI_Comm_rank

Page 45: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 45

More on Parallel Jobs

• All the basic communication methods are available: Send, Receive, Broadcast, Barrier, gop (gather)

• Source and tag are the same as in MPI, but MATLAB figures out data formats for you.

– labSend(data,destination,[tag]);

– labReceive(source,tag);

– labReceive(); %take anything

• Co-distributed arrays are sliced across the workers so that huge matrices can be operated on.

Page 46: Running Parallel Simulations and Enabling Science Gateways ...

File Transfer

Page 47: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 47

Hardware View

MyProxy Server GridFTP Server

HPC 2008 Head Node

DataDirectNetworks

9700 Storage

Windows Server 2008

CAC 10GB Interconnect

1. Retrieve certificate2. Upload files to storage via GridFTP3. Submit job to run MATLAB workers on cluster4. Download files via GridFTP

Page 48: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 48

File Transfer

• The basic job submit operation specifies that the program will run on a remote server. When it runs, the functions and data must be available.

j = createJob(sched);

createTask(j,@rand,1,{3,3});

submit(j);

waitForState(j);

a = getAllOutputArguments(j);

• This example works as-is because rand is a built-in MATLAB function. It is always on the MATLAB path.

Page 49: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 49

File Transfer

• Have a custom function and/or require a datafile?

j = createJob(sched);

createTask(j,@myfunction,1,{3,3});

submit(j);

waitForState(j);

a = getAllOutputArguments(j);

• myfunction.m does not exist on the remote computer.

• Transfer this file and get it added to the path.

Page 50: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 50

FileDependencies

• The MATLAB FileDependencies property will move the files for you

• Best for smaller projects with only a couple of files

• Specify directories and files the worker will need. All files and directory structure will be copied; file transfer occurs for each worker running a task for that particular job on a machine

set(j,'FileDependencies',{'/home/username/src/myfunction.m', '/home/username/data/dfile.mat');

Page 51: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 51

Move the Files Yourself

• FileDependencies is best for smaller projects with only a couple of files

• Alternative:

1.Move the file(s)

2.Add the path to the worker sessions

Page 52: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 52

1. Move the Files

• First move the file(s) needed by the job:

sendFileToCAC('filename.m');

• sendFileToCAC('filename') – super simple method for dumping a single file into your home directory on TUC.

• sendDirToCAC('mydir','tucDir') – Recursively move a directory and its contents to TUC.

Page 53: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 53

2. Add the path

• On your laptop/workstation, you commonly issue addpath('path/to/file') statements.

• The same is true when running MATLAB on TUC, but:

– The task function must be on the startup path of MATLAB.

– You may enter addpath and cd statements into your task function, but first your function must be available.

• We will use PathDependencies to make our task function available.

Page 54: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 54

PathDependencies

• PathDependencies is a property of the Job object that allows you to issue addpath statements on TUC before calling your task.

• Assuming the file has been moved to \\storage01\matlab\username\MyProjectDir

• Specify the path dependency in your job submission script:set(j,'PathDependencies',{'\\storage01\matlab\usernameMyProjectDir'});

Page 55: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 55

PathDependencies

Page 56: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 56

Scripted Solution

• Both send*ToCAC methods are primarily for one time use, best for moving a big directory of data files, testing, or copying a single file.

• The gridFTP interface is more flexible. It allows you to interactively move files to TUC as well as write scripts that move files.

• For projects that involve more than one or two source files, we recommend writing a “prep” function which ensures that the most up-to-date functions are available on TUC.

Page 57: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 57

Then add PathDependencies to the job in the submission script:

Page 58: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 58

Lab

• Source files:

– calcLatLongDistance.m

– degrees2Radians.m

• Data file:

– Airports_boardings.txt

• Task function:

– addpath_remote.m

• Batch script:

– addpath_submit.m

Using this set of fileswe will work with

• FileDependencies

• PathDependencies

• GridFTP

Page 59: Running Parallel Simulations and Enabling Science Gateways ...

Debugging

Page 60: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 60

Debugging

• Debugging a remote process is always difficult. The situation on TUC is no different. Errors must be caught, captured and returned to the client machine to resolve.

• MATLAB generally captures any errors thrown by a task and stores them as a MException in the task output.

– Distributed jobs may store an exception for each task.

– The CAC-provided function getErrors(j) collects the errors from the tasks of a job and pretty-prints them for you.

Page 61: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 61

Getting Errors

• SimpleError.m has a simple error in the task function. Submit SimpleErrorSubmit.m and examine the error:

>> [j,a] = SimpleErrorSubmit(sched);>> getErrors(j); % how do we view just one stacktrace?>> ts = get(j,'Tasks');>> es = get(ts,'Errors');>> es.message>> es.stack(1)

Page 62: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 62

Manual Retrieval

• Sometimes a job will hang or fail in such a way that the files don’t get downloaded from TUC correctly. In this case, you’ll want to retrieve those files manually.

>> downloadJob(sched,j);

• Here’s what to do first for a job defined in a prior session:

>> cacsched % re-create the sched object >> j = sched.Jobs(12) % copy the Jobs(12) object, e.g.>> get(j,'name') % check the job’s name

Page 63: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 63

Manual Retrieval with gridFTP

• For large parallel jobs, you may want to use gridFTP to shortcut downloadJob, which downloads all of the files.

– If you need to spot the error in a large parallel job, for example, you can just use gridFTP to grab Task1.out.mat.

– It very likely contains the exception, because the error is almost always found in all Tasks or the master (Task1).

>> ftp = gridFTP();>> ftp.get(‘Job4/Task1.out.mat’);>> ftp.close();

Page 64: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 64

Stdout, Stderr

• For a parallel job named JobN, the standard output and errors for all the MATLAB processes are stored in two files called JobN.ou and JobN.er respectively.

• In a distributed job, each TaskM of writes its own output and error into TaskM.ou and TaskM.er in the JobN folder.

• But capturing errors doesn’t help you catch other things:

– Problems with numerics

– Running times of different sections of your code

– Progress of a long-running job…

Page 65: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 65

Printf

• There are two other ways to get diagnostic information:

– captureCommandWindowOutput - if ‘true’ for a task, this property tells MATLAB to return output from fprintfstatements and other console output (e.g., from statements lacking a semicolon) to the client.

– littleJohnLog/qpeek - this pair of functions can be used to create a log for a long-running job and examine the log as the job runs. Usage ideas can be found in cacLog.m and cacLogSubmit.m, in the examples folder.

Page 66: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 66

Printf Example

• Verbose.m and VerboseSubmit.m contain both fprintfand littleJohnLog statements. Notice that the tasks must be set up to return the command window output at task creation.

• Run the jobs and make sure you can retrieve the output manually as well as using the getOutput(job) function.

>> at = get(j,’Tasks’);>> out = get(at,’CommandWindowOutput’);>> getOutput(j);

Page 67: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 67

Debug Lab – Intro

• The Traveling Salesman Problem is a classic minimization problem. A salesman has a fixed set of cities that he must visit (each only one time). What order of city visits will minimize the total distance travelled?

– ga_run.m solves this problem using a genetic algorithm (GA) on the airport dataset that we worked with in the file transfer lab. This is a relatively small dataset with about 150 locations.

– ga_run2.m is a buggy version that solves the problem for a larger dataset (cities.txt).

Page 68: Running Parallel Simulations and Enabling Science Gateways ...

www.cac.cornell.edu/matlab 68

Debug Lab – Instructions

• Examine the output from getErrors and getOutput in order to find and fix the problems with ga_run2.m and help our intrepid salesman out.

• The functions defined in the two .m files take the same arguments and return the same outputs, so ga_submit.mshould not need to be modified, except to change the name of the function in createTask.