07-06-10 Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics Lab
07-06-10
Challenge the future
DelftUniversity ofTechnology
Logic Networks on the Grid: Handling 15 Million JobsJan Bot, Delft Bioinformatics Lab
2
Overview
• Explanation of the application• Challenges for the grid• Custom grid solution design & implementation• More challenges (aka problems)• Adding a desktop cluster• Errors and statistics• Discussion
3
But first...
Does anybody not know what these are:• Life Science Grid• Grid middleware• ToPoS• ORM (Object Relational Mapper)
4
The application: overview
Input data: ~100 mouse tumors
5
Grid pipeline
• Prepare inputs: prepare data for
future grid runs
• Multiple parameter settings are
tested, output of these tests contains
the 'real' data
• Choose best parameter settings, it
should still be feasible to do at least
100 permutations
• Do permutations, 10 permutations
per run
6
Run properties
• Number of jobs per run is fairly large:24 * 6228 = 149472
• Run time is, due to the optimization algorithm, unpredictable: jobs can take anywhere between 2 seconds and 14 hours
• Outputs are small, both for the real runs and for the permutations
7
Middleware problems
• Scheduling:– This amount of jobs cannot be scheduled using the normal
(glite) middleware– Overhead of scheduling could out-weight the run time
• Bookkeeping– No method of tracking this amount of jobs
• Output handling:– No grid resource can store large amounts of small files
(dCache is not an option)– Other solutions (such as ToPoS) are slow when retrieving
output
8
Scheduling jobs with ToPoS
• ToPoS takes care of the first two categories of problems but presents some new challenges:– ToPoS does not scale beyond 10.000 jobs per pool– No client software which facilitates spreading the tokens
over multiple pools
9
Python ToPoS clients
To deal with the limitations of ToPoS two clients were implemented:
• Grid client:
– uses the most basic Python httplib module
– can fetch, lock and delete tokens
– has a generator to transparently handle tokens in multiple pools
• Local client:
– uses the more advanced UrlLib2 module
– can create and delete pools, spread tokens over multiple pools,
delete all locks in a pool, gather ToPoS statistics, etc.
10
Dealing with the outputs
Outputs:
• are small and well defined
• why not just flush them to a database?
Proposed solution:
• Python as language
• SQLAlchemy as ORM
• XML-RPC as communication channel
• MySQL (for now) as database
11
Client design
Bash script Python script Matlab (MCR)
● Set environment variables● Fetch input data● Make binaries executable● Load modules● Start python script
● Loop: ● Fetch token from ToPoS● Call Matlab (MCR)● Parse output & send to
result server
● Perform algorithm
start python
call Matlab
send results
12
Application design
ORM App XML-RPC ClientsDB
DB DB code
main
XML-RPC server Client
Flush loop: flush once every minute
Listen loop: listen for incoming calls
Thread model
Design overview
Fetch tokenDo workUpload output
13
Implementation & the weakest link
• Implemented in Python
• Hosted on a P4 in a broom closet in our department
• On power failure: everything collapses (but that's not very likely,
right?)
14
Getting ready to run: data replication
• Getting the data from one (remote) site is expensive
• Use data replication across all sites to minimize external traffic and
divide the load over multiple SRMs
• Data replication can be done easily with the V-Browser
• Manual approach:
lcg-cr -l lfn:///grid/lsgrid/jridder/MGtest/MG_Perm2_5_Datapack.zip MG_Perm2_5_Datapack.zip
lcg-rep --vo lsgrid -d tbn18.nikhef.nl srm://gb-se-tud.ewi.tudelft.nl/dpm/ewi.tudelft.nl/home/lsgrid/generated/2010-05-26/file006bff9b-49ef-46bd-80cd-5b8110171557
Register file:
Replicate file (in this example to nikhef):
On a WN, retrieve a local copy:
DATAPACK=lfn:/grid/lsgrid/jridder/MGtest/MG_Perm2_5_Datapack.zipecho $VO_LSGRID_DEFAULT_SETDATA=`lcg-lr --vo lsgrid $DATAPACK | grep $VO_LSGRID_DEFAULT_SE`lcg-cp --verbose $TDATA $DATAPACK
15
Adding a desktop cluster
• Practical (student) pcs are not doing anything at night• Use these computers to increase computation power• Compute at night & in weekends
• Our scenario (using ToPoS and an external output server) is ideal for testing such a cluster
• Use condor to manage the work
16
Desktop cluster locations
• Two locations– Drebbelweg: 250 practical pcs– Mekelweg: 50 – 100 pcs distributed throughout the building
• Different locations means different vlans: use two condor queues
17
Problems during run
• Many jobs seemed to quit prematurely while most of them ran fine
• Errors could be traced back to Deimos and Nikhef• The middleware doesn't really provide statistics to the end-
user• Output files cannot always be retrieved
18
Gathering statistics
• Add run information (e.g. start & end times) to the job-output• Add an additional XML-RPC method to capture error
information• Uploading error info is easy:
– Use return status of external program– Use Pythons internal error handling capabilities– All error messages (of the entire job) are located in one text
file
19
Job Running times (1)
20
Job running times (2)
One permutation run (10 permutations) takes:• 415140369 seconds• 115316.77 hours• 4804.87 days• 13.16 years
• Now, repeat 9 times (yes, that's a century)
21
Work done per site
22
Nikhef and Deimos mortality
23
Gathering error info
• Gathering error information on the grid is prone to error
• Again, work around the middleware:– Implement additional XML-RPC call to gather error
information
24
Error & Fix
• Jobs failed due to one error:“Could not access the MCR component cache"
• Fix:
basically tells the MCR to store all temporary information in a new tmpdir
• Will be included in the next POC environment
export MCR_CACHE_ROOT=$( mktemp -d )
25
Mortality after fix
26
Discussion
• We can schedule millions of jobs and capture their outputs on the grid,
it just takes a custom solution
• Other fields (such as pattern recognition) can benefit from this solution
• Is their similar work being done?
• If not, can we design and implement a generic solution which does the
same?
27
Thanks
Jeroen de RidderRoeland van OchtenMarcel Reinders
Jeroen EngelbertsPieter van BeekEvert Lammerts
Jan Just Keijzer
28
Life Science GridSite CPUs
SARA 2000
NIKHEF 5000
Philips 1500
RUG 160
Erasmus 32
Keygene 32
TU Delft 32
RUG 32
AMS 32
NKI 16
AMC 16
LUMC 16
WUR 16
UU 16
kun 16
Total 8900
29
Grid Middleware
• The glue (or spaghetti) that unifies job management across clusters
Middleware
gina: condor keygene: pbs TU Delft: lsf ... RUG: SGEDifferent sites with different job scheduling applications
Heterogeneous compute resources
30
ToPoS: Token Pool Server
Fetch work
Submit jobsAdd work 1. Get token
3. Delete token
2. Do work:Translate tokenCall functionUpload output
ToPoS: a pilot job framework.
Pilot job: one job / thread which keeps running until all the work has been done.
A 'token' represents one unit of work.Tokens can be locked to prevent other jobs from doing the same work twice.
Why is ToPoS needed:• Problems with grid middleware
– Inability to deal with large amounts of jobs
– Failing jobs– Job accounting– Etc.
31
ORM: Object-Relational Mapper
• Mapper for persistent storage of objects into a database
• Saves you from having to write any DB code yourself
• Examples:
– Python: SQL Alchemy, Storm
– Java: Hybernate, Cayenne
– Ruby: ActiveRecord
32
Why not Molgenis?
• Familiar with Python, which already has all the tools to make this
• Design – XML-ify – generate – rol-out to cumbersome