Ian C. Smith* Introduction to research computing using Condor *Advanced Research Computing University of Liverpool
Feb 25, 2016
Ian C. Smith*
Introduction to research computing using Condor
*Advanced Research ComputingUniversity of Liverpool
Overview
what is Condor and what can it be used for ? typical Condor pool operation University of Liverpool Condor Pool support for MATLAB and R applications some research computing examples quick introduction to UNIX with a walk-through example
What is Condor ?
a specialized system for delivering High Throughput Computing a harvester of unused computing resources developed by Computer Science Dept at University of Wisconsin
in late ‘80s free and (now) open source software widely used in academia and increasing in industry available for many platforms: Linux, Solaris, AIX, Windows
XP/Vista/7, Mac OS
Types of Condor application
typically - large numbers of independent calculations (“pleasantly parallel”)
data parallel applications – split large datasets into smaller parts and process them in parallel biological sequence analysis (e.g. BLAST)
processing of field trial data
optimisation problems microprocessor design and testing
applications based on Monte Carlo methods radiotherapy treatment analysis
epidemiological studies
A “typical” Condor pool
Condor Server
Desktop PC
Execute hostsExecute hosts
login and upload input data
A “typical” Condor pool
Condor Server
Desktop PC
Execute hostsExecute hosts
jobsjobs
A “typical” Condor pool
Condor Server
Desktop PC
Execute hostsExecute hosts
results results
A “typical” Condor pool
Condor Server
Desktop PC
Execute hostsExecute hosts
download results
University of Liverpool Condor Pool
contains around 700 classroom PCs running the CSD Managed Windows 7 Service (mostly 64 bit from next year)
most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots per PC (total of 1400 job slots)
single job submission point for Condor jobs provided by powerful UNIX server
jobs continue to run while classroom PCs are unused but ... if load (or memory use) becomes significant, job will be killed and
usually any results will be lost (job will start again from scratch) tools provided for running large numbers of MATLAB and R jobs
Condor caveats
only suitable for non-interactive applications
no communication between jobs possible
all files needed by application must be present on local disk
shorter jobs more likely to run to completion (10-20 min seems to work best)
long running jobs can be run if save/restore mechanism (checkpointing) is built into them
tricky to begin with but usually worth the initial effort
Running MATLAB jobs under Condor
need to create standalone application from M-file(s) using MATLAB compiler
standalone application can run without a MATLAB license
run-time libraries still need to be accessible to MATLAB jobs
nearly all toolbox functions available to standalone applications
simple (but powerful) file input/output makes checkpointing easier
tools available to simplify job submission - see Liverpool Condor website for more information
Running R jobs under Condor
limited support at present
R is installed on-the-fly as part of the job
currently only R version 2.6.2 available with standard packages
tools available to simplify job submission
checkpointing may be possible for long running jobs
Personalised Medicine example
project is a Genome-Wide Association Study aims to identify genetic predictors of response to anti-epileptic drugs
try to identify regions of the human genome that differ between individuals (referred to as SNPs)
800 patients genotyped at 500 000 SNPs along the entire genome
test statistically the association between SNPs and outcomes (e.g. time to withdrawal of drug due to adverse effects)
very large data-parallel problem using R – ideal for Condor divide datasets into small partitions so that individual jobs run for
15-30 minutes batch of 26 chromosomes (2 600 jobs) required ~ 5 hours
wallclock time on Condor but ~ 5 weeks on a single PC
Radiotherapy example
large 3rd party application code which simulates photon beam radiotherapy treatment using Monte Carlo methods
tried running simulation on 56 cores of high performance computing cluster but no progress after 5 weeks
divided problem into 250 then 5 000 and eventually 50 000 Condor jobs
required ~ 2 600 days of cpu time (equivalent to ~ 3.5 years on dual core PC)
Condor simulation completed in less than one week average run time was ~ 70 min only ~ 10 % of compute time wasted due to evictions
Condor service prerequisites
will need a Sun UNIX service account (contact CSD [email protected]) and a Condor account (http://www.liv.ac.uk/csd/registration/eScienceform.pdf)
to login in to the Condor server: on MWS use PuTTy: Install University Applications | Internet | PuTTy 0.60
Mac/Linux: open terminal window and use ssh
off campus: use Apps Anywhere (PuTTy is in Utilities group)
to upload/download files to/from the Condor server: on MWS use CoreFTPLite: Install University Applications | Internet |
CoreFTP LE2.1
Mac/Linux: open terminal window, use sftp/scp
off campus: need to use virtual private network (VPN), then FTP
PuTTy login
PuTTy login
PuTTy login
CoreFTP Lite
CoreFTP Lite
CoreFTP Lite
CoreFTP Lite
CoreFTP Lite
CoreFTP Lite
CoreFTP Lite – download files
CoreFTP Lite – download files
Condor server directory tree/ or ‘root’
/usr /bin /sbin /tmp/home/condor_data
Condor server directory tree/
/home/fred/home/smithic /home/jim
/home
login ‘home’directories
/tmp/usr /bin /sbin/condor_data
Condor server directory tree
/condor_data
/condor_data/smithic /condor_data/jim
/usr /bin /sbin /home /tmp
/
‘home’directories for Condor
MATLAB Condor example
calculate the sum of p matrix-matrix products:
each product calculation is independent and can be performed in parallel
MATLAB M-file (product.m):
function product load input.mat; C=A*B; save( 'output.mat', 'C' ); quit;
Job submission example[smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory
Job submission example[smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory[smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples
Job submission example[smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory[smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples[smithic@ulgp5 smithic]$ cd matlab #now in /condor_data/smithic/matlab
Job submission example[smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory[smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples[smithic@ulgp5 smithic]$ cd matlab #now in /condor_data/smithic/matlab[smithic@ulgp5 matlab]$ ls #list filesinput0.mat input2.mat input4.mat productinput1.mat input3.mat product.m
Job submission example[smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory[smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples[smithic@ulgp5 smithic]$ cd matlab #now in /condor_data/smithic/matlab[smithic@ulgp5 matlab]$ ls #list filesinput0.mat input2.mat input4.mat productinput1.mat input3.mat product.m[smithic@ulgp5 matlab]$ matlab_build product.m #create standalone executableSubmitting job(s).1 job(s) submitted to cluster 503.
Job submission example[smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory[smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples[smithic@ulgp5 smithic]$ cd matlab #now in /condor_data/smithic/matlab[smithic@ulgp5 matlab]$ ls #list filesinput0.mat input2.mat input4.mat productinput1.mat input3.mat product.m product.exe[smithic@ulgp5 matlab]$ matlab_build product.m #create standalone executableSubmitting job(s).1 job(s) submitted to cluster 503.[smithic@ulgp5 matlab]$ condor_q #get Condor queue status-- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 503.0 smithic 6/7 15:19 0+00:00:10 R 0 0.0 runscript.bat wrap
Job submission example[smithic@ulgp5 multiple]$ cd /condor_data/smithic #change directory[smithic@ulgp5 smithic]$ tar xf /opt1/condor/examples/handson.tar #get examples[smithic@ulgp5 smithic]$ cd matlab #now in /condor_data/smithic/matlab[smithic@ulgp5 matlab]$ ls #list filesinput0.mat input2.mat input4.mat productinput1.mat input3.mat product.m product.exe[smithic@ulgp5 matlab]$ matlab_build product.m #create standalone executableSubmitting job(s).1 job(s) submitted to cluster 503.[smithic@ulgp5 matlab]$ condor_q #get Condor queue status-- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 503.0 smithic 6/7 15:19 0+00:00:10 R 0 0.0 runscript.bat wrap
1 jobs; 0 idle, 1 running, 0 held[smithic@ulgp5 matlab]$ condor_q #job has finished when gone from queue-- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 held
Job submission example[smithic@ulgp5 matlab]$ lsinput0.mat input2.mat input4.mat product.bat product.exe.manifest product.subinput1.mat input3.mat product product.exe product.m
Job submission example[smithic@ulgp5 matlab]$ lsinput0.mat input2.mat input4.mat product.bat product.exe.manifest product.subinput1.mat input3.mat product product.exe product.m[smithic@ulgp5 matlab]$ cat product #display file contentsexecutable=product.exeindexed_input_files=input.matindexed_output_files=output.mattotal_jobs=5
Job submission example[smithic@ulgp5 matlab]$ lsinput0.mat input2.mat input4.mat product.bat product.exe.manifest product.subinput1.mat input3.mat product product.exe product.m[smithic@ulgp5 matlab]$ cat product #display file contentsexecutable=product.exeindexed_input_files=input.matindexed_output_files=output.mattotal_jobs=5[smithic@ulgp5 matlab]$ matlab_submit product #submit multiple Matlab jobsSubmitting job(s).....5 job(s) submitted to cluster 511.
Job submission example[smithic@ulgp5 matlab]$ lsinput0.mat input2.mat input4.mat product.bat product.exe.manifest product.subinput1.mat input3.mat product product.exe product.m[smithic@ulgp5 matlab]$ cat product #display file contentsexecutable=product.exeindexed_input_files=input.matindexed_output_files=output.mattotal_jobs=5[smithic@ulgp5 matlab]$ matlab_submit product #submit multiple Matlab jobsSubmitting job(s).....5 job(s) submitted to cluster 511.[smithic@ulgp5 matlab]$ condor_q #get status of jobs-- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.1 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.2 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.3 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc 511.4 smithic 6/7 16:01 0+00:00:02 R 0 0.0 product.bat produc5 jobs; 0 idle, 5 running, 0 held
Job submission example[smithic@ulgp5 matlab]$ condor_q #some jobs completed, one still
running-- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc
1 jobs; 0 idle, 1 running, 0 held
Job submission example[smithic@ulgp5 matlab]$ condor_q #some jobs completed, one still
running-- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc
1 jobs; 0 idle, 1 running, 0 held[smithic@ulgp5 matlab]$ condor_q #all jobs complete -- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 held
Job submission example[smithic@ulgp5 matlab]$ condor_q #some jobs completed, one still
running-- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc
1 jobs; 0 idle, 1 running, 0 held[smithic@ulgp5 matlab]$ condor_q #all jobs complete -- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 held[smithic@ulgp5 matlab]$ ls #check output filesinput0.mat input3.mat output1.mat output4.mat product.exe
product.subinput1.mat input4.mat output2.mat product product.exe.manifestinput2.mat output0.mat output3.mat product.bat product.m
Job submission example[smithic@ulgp5 matlab]$ condor_q #some jobs completed, one still
running-- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 511.0 smithic 6/7 16:01 0+00:00:25 R 0 0.0 product.bat produc
1 jobs; 0 idle, 1 running, 0 held[smithic@ulgp5 matlab]$ condor_q #all jobs complete -- Schedd: [email protected] : <138.253.100.17:42003> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 held[smithic@ulgp5 matlab]$ lsinput0.mat input3.mat output1.mat output4.mat product.exe
product.subinput1.mat input4.mat output2.mat product product.exe.manifestinput2.mat output0.mat output3.mat product.bat product.m[smithic@ulgp5 matlab]$ zip output.zip output*.mat #bundle output files
Summary
Condor can speed up processing by running large numbers of jobs in parallel
shorter jobs work best but can deal with jobs of arbitrary length user-written codes easiest to run (MATLAB, R, C/C++, FORTRAN
etc) commercial 3rd party software may work needs to run on standard MWS PC without user interaction all Condor jobs submitted via central UNIX server
Further Information
Condorhttp://www.liv.ac.uk/e-science/condor
other research computing serviceshttp://www.liv.ac.uk/csd/research/