Top Banner
07-06-10 Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics Lab
32

Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

Jul 06, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

07-06-10

Challenge the future

DelftUniversity ofTechnology

Logic Networks on the Grid: Handling 15 Million JobsJan Bot, Delft Bioinformatics Lab

Page 2: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

2

Overview

• Explanation of the application• Challenges for the grid• Custom grid solution design & implementation• More challenges (aka problems)• Adding a desktop cluster• Errors and statistics• Discussion

Page 3: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

3

But first...

Does anybody not know what these are:• Life Science Grid• Grid middleware• ToPoS• ORM (Object Relational Mapper)

Page 4: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

4

The application: overview

Input data: ~100 mouse tumors

Page 5: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

5

Grid pipeline

• Prepare inputs: prepare data for

future grid runs

• Multiple parameter settings are

tested, output of these tests contains

the 'real' data

• Choose best parameter settings, it

should still be feasible to do at least

100 permutations

• Do permutations, 10 permutations

per run

Page 6: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

6

Run properties

• Number of jobs per run is fairly large:24 * 6228 = 149472

• Run time is, due to the optimization algorithm, unpredictable: jobs can take anywhere between 2 seconds and 14 hours

• Outputs are small, both for the real runs and for the permutations

Page 7: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

7

Middleware problems

• Scheduling:– This amount of jobs cannot be scheduled using the normal

(glite) middleware– Overhead of scheduling could out-weight the run time

• Bookkeeping– No method of tracking this amount of jobs

• Output handling:– No grid resource can store large amounts of small files

(dCache is not an option)– Other solutions (such as ToPoS) are slow when retrieving

output

Page 8: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

8

Scheduling jobs with ToPoS

• ToPoS takes care of the first two categories of problems but presents some new challenges:– ToPoS does not scale beyond 10.000 jobs per pool– No client software which facilitates spreading the tokens

over multiple pools

Page 9: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

9

Python ToPoS clients

To deal with the limitations of ToPoS two clients were implemented:

• Grid client:

– uses the most basic Python httplib module

– can fetch, lock and delete tokens

– has a generator to transparently handle tokens in multiple pools

• Local client:

– uses the more advanced UrlLib2 module

– can create and delete pools, spread tokens over multiple pools,

delete all locks in a pool, gather ToPoS statistics, etc.

Page 10: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

10

Dealing with the outputs

Outputs:

• are small and well defined

• why not just flush them to a database?

Proposed solution:

• Python as language

• SQLAlchemy as ORM

• XML-RPC as communication channel

• MySQL (for now) as database

Page 11: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

11

Client design

Bash script Python script Matlab (MCR)

● Set environment variables● Fetch input data● Make binaries executable● Load modules● Start python script

● Loop: ● Fetch token from ToPoS● Call Matlab (MCR)● Parse output & send to

result server

● Perform algorithm

start python

call Matlab

send results

Page 12: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

12

Application design

ORM App XML-RPC ClientsDB

DB DB code

main

XML-RPC server Client

Flush loop: flush once every minute

Listen loop: listen for incoming calls

Thread model

Design overview

Fetch tokenDo workUpload output

Page 13: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

13

Implementation & the weakest link

• Implemented in Python

• Hosted on a P4 in a broom closet in our department

• On power failure: everything collapses (but that's not very likely,

right?)

Page 14: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

14

Getting ready to run: data replication

• Getting the data from one (remote) site is expensive

• Use data replication across all sites to minimize external traffic and

divide the load over multiple SRMs

• Data replication can be done easily with the V-Browser

• Manual approach:

lcg-cr -l lfn:///grid/lsgrid/jridder/MGtest/MG_Perm2_5_Datapack.zip MG_Perm2_5_Datapack.zip

lcg-rep --vo lsgrid -d tbn18.nikhef.nl srm://gb-se-tud.ewi.tudelft.nl/dpm/ewi.tudelft.nl/home/lsgrid/generated/2010-05-26/file006bff9b-49ef-46bd-80cd-5b8110171557

Register file:

Replicate file (in this example to nikhef):

On a WN, retrieve a local copy:

DATAPACK=lfn:/grid/lsgrid/jridder/MGtest/MG_Perm2_5_Datapack.zipecho $VO_LSGRID_DEFAULT_SETDATA=`lcg-lr --vo lsgrid $DATAPACK | grep $VO_LSGRID_DEFAULT_SE`lcg-cp --verbose $TDATA $DATAPACK

Page 15: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

15

Adding a desktop cluster

• Practical (student) pcs are not doing anything at night• Use these computers to increase computation power• Compute at night & in weekends

• Our scenario (using ToPoS and an external output server) is ideal for testing such a cluster

• Use condor to manage the work

Page 16: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

16

Desktop cluster locations

• Two locations– Drebbelweg: 250 practical pcs– Mekelweg: 50 – 100 pcs distributed throughout the building

• Different locations means different vlans: use two condor queues

Page 17: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

17

Problems during run

• Many jobs seemed to quit prematurely while most of them ran fine

• Errors could be traced back to Deimos and Nikhef• The middleware doesn't really provide statistics to the end-

user• Output files cannot always be retrieved

Page 18: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

18

Gathering statistics

• Add run information (e.g. start & end times) to the job-output• Add an additional XML-RPC method to capture error

information• Uploading error info is easy:

– Use return status of external program– Use Pythons internal error handling capabilities– All error messages (of the entire job) are located in one text

file

Page 19: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

19

Job Running times (1)

Page 20: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

20

Job running times (2)

One permutation run (10 permutations) takes:• 415140369 seconds• 115316.77 hours• 4804.87 days• 13.16 years

• Now, repeat 9 times (yes, that's a century)

Page 21: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

21

Work done per site

Page 22: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

22

Nikhef and Deimos mortality

Page 23: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

23

Gathering error info

• Gathering error information on the grid is prone to error

• Again, work around the middleware:– Implement additional XML-RPC call to gather error

information

Page 24: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

24

Error & Fix

• Jobs failed due to one error:“Could not access the MCR component cache"

• Fix:

basically tells the MCR to store all temporary information in a new tmpdir

• Will be included in the next POC environment

export MCR_CACHE_ROOT=$( mktemp -d )

Page 25: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

25

Mortality after fix

Page 26: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

26

Discussion

• We can schedule millions of jobs and capture their outputs on the grid,

it just takes a custom solution

• Other fields (such as pattern recognition) can benefit from this solution

• Is their similar work being done?

• If not, can we design and implement a generic solution which does the

same?

Page 27: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

27

Thanks

Jeroen de RidderRoeland van OchtenMarcel Reinders

Jeroen EngelbertsPieter van BeekEvert Lammerts

Jan Just Keijzer

Page 28: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

28

Life Science GridSite CPUs

SARA 2000

NIKHEF 5000

Philips 1500

RUG 160

Erasmus 32

Keygene 32

TU Delft 32

RUG 32

AMS 32

NKI 16

AMC 16

LUMC 16

WUR 16

UU 16

kun 16

Total 8900

Page 29: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

29

Grid Middleware

• The glue (or spaghetti) that unifies job management across clusters

Middleware

gina: condor keygene: pbs TU Delft: lsf ... RUG: SGEDifferent sites with different job scheduling applications

Heterogeneous compute resources

Page 30: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

30

ToPoS: Token Pool Server

Fetch work

Submit jobsAdd work 1. Get token

3. Delete token

2. Do work:Translate tokenCall functionUpload output

ToPoS: a pilot job framework.

Pilot job: one job / thread which keeps running until all the work has been done.

A 'token' represents one unit of work.Tokens can be locked to prevent other jobs from doing the same work twice.

Why is ToPoS needed:• Problems with grid middleware

– Inability to deal with large amounts of jobs

– Failing jobs– Job accounting– Etc.

Page 31: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

31

ORM: Object-Relational Mapper

• Mapper for persistent storage of objects into a database

• Saves you from having to write any DB code yourself

• Examples:

– Python: SQL Alchemy, Storm

– Java: Hybernate, Cayenne

– Ruby: ActiveRecord

Page 32: Logic Networks on the Grid: Handling 15 Million Jobs · Challenge the future Delft University of Technology Logic Networks on the Grid: Handling 15 Million Jobs Jan Bot, Delft Bioinformatics

32

Why not Molgenis?

• Familiar with Python, which already has all the tools to make this

• Design – XML-ify – generate – rol-out to cumbersome