November 14, 2008 1 Grid Usecase BioMed How to get biologists to compute Surfnet / Grid Tutorial Jan Bot Vermelding onderdeel organisatie
Jan 08, 2016
November 14, 2008
1
Grid Usecase BioMed
How to get biologists to compute
Surfnet / Grid Tutorial
Jan Bot
Vermelding onderdeel organisatie
November 14, 2008 2
Who am I
• Graduated March 2008
• Bioinformatics group TU Delft
• BioAssist programmer
• Happy grid user
• Working on the grid as part of the TU Delft – NKI collaboration
• Chris Klijn: human copy number variation
• Jeroen de Ridder: viral insertions in mice
November 14, 2008 3
DNA & Genes
November 14, 2008 4
Copy number variation & Viral insertions
• Pieces of DNA can be added, deleted, moved & removed
• Viruses can insert themselves into a genome• This causes all kinds of problems, for example cancer:
• Multiple mutations needed before a tumor starts to develop
November 14, 2008 5
aCGH data
• Array comparative genomic hybridization
• Compare DNA of sample against a reference
November 14, 2008 6
KCSmart: Datasets
• Leukaemia & lymphoma cell-lines• aCGH data (10k affy) from the Sanger Institute• Same samples measured on 1.8M SNP6• 105 cell-line samples• About 350 mb of data
November 14, 2008 7
KCSmart: Overview
For each tumor we construct a pair-wise space by comparing each chromosome arm with each other chromosome arm. A point in this space is a pair of genomic loci.
November 14, 2008 8
KCSmart: Compute Co-occurrence Score
Using a 2d Gaussian kernel we want to look for local enrichment of high scores in the pairwise space.
Peaks in the convolved space allows us to define two genomic loci that can be said to be co-aberrated to a certain degree
November 14, 2008 9
KCSmart: Parameters (1)
Chromosome arms:Natural split at the centromere to better divide work loadNot all p-arms contain measurements (39 out of 44)
Resolution:'Grid points' are fixed on the genomeLocation of the grid points, and thus the computational complexity, doesn't change when using different datasetsMeasurements are allocated to grid pointsTried this for [20, 25, 35, 50] kbpChoice based on the best resolution which still fits in memory
10k data
Grid
1.8m data
November 14, 2008 10
KCSmart: Parameters (2)
Scale:The kernel width in base pairsCapture changes on different scales:[0.2, 2, 10, 20] mbp (6 sigma)
Amplification type:Either insertion or deletionAll possible combinations for two chromosomes:[ins:ins, del:del, ins:del, del:ins]ins=amplification, del=loss)
November 14, 2008 11
KCSmart: Getting the Parameters Right
• 10k data to estimate memory consumption and running times
• Find best resolution & scale that still fit in 2.3 gb of memory
• Final Parameters:
• chr = [1.0, 1.5, ..., 22.5]
• res = [20000]
• scale = [0.2, 2, 10, 20]
• amp = ['ins-ins', 'del-del', 'ins-del', 'del-ins']
• Roughly 10k jobs (without the jobs required for finding the correct parameter settings!)
• All parameters generated using a python script
• In a jdl it looks like:Parameters={"19.5 15.5 2 1 20000", "2.5 4.0 2 1 20000"};
November 14, 2008 12
KCSmart: Output
• +/- 10k files
• 7.5 gb of 'peak-info'
• 1 TB of raw data
• Problems with the grid:
• once you have all the scripts in place to run jobs it's easy to create more output than a biologist can analyze
• once the biologist has some results he'll ask you to do it again (and again...)
November 14, 2008 13
KCSmart: Results 10k data
November 14, 2008 14
KCSmart: Results 1.8m data
November 14, 2008 15
KCSmart: Results 1.8M data
Found a know deletion pair (T-cell receptor): the method works.
November 14, 2008 16
KCSmart: Future work
• Higher resolution (once we have 64 bit WNs)• Smaller scale• Mutual exclusiveness tests• Run on real tumor dataset
November 14, 2008 17
Matlab jobs
• Compile code using Matlab (on a UI), run using MCR
• Add ctf & executable to input sandbox:InputSandbox={"kcsmart_topos.sh","kcsmart_large.bin","kcsmart_large_run.ctf","curl.gz"};
• Add 'require code' to jdl:Requirements = Member("lsgmcr-7.5",other.GlueHostApplicationSoftwareRunTimeEnvironment);
• Load module on WN:module load mcr
• Call executable
November 14, 2008 18
Job status tracking problems
• How do you check which jobs failed?
• Use output files as indicators:lcg-ls lfn:///grid/lsgrid/jbot/chris_large/output/ > output.txtcat output.txt | ~/code/chris/check_missing.pl > to_do.txt
• Copy subset of parameters to jdl file
• Submit job again
• This takes too long!
November 14, 2008 19
The Annoyances: glite-wms-job-*
glite-wms-job-status
• It barely tells me anything (unless I specified error codes myself)
• I would rather know
• the number of failed / running jobs
• the error output or the parameters with which this job was run
• Use with grep & awk:
glite-wms-job-status `job-ids` > status.txt
cat status.txt | gawk '{prev=$7;getline;if($0~/Exit\ Code/){print prev;}}'
• Output: https://wms.grid.sara.nl:9000/ztINwkKvTJfKnUuZBTYs_g
Status info for the Job : https://wms.grid.sara.nl:9000/ztINwkKvTJfKnUuZBTYs_g Current Status: Done (Exit Code !=0) Exit code: 1 Status Reason: Warning: job exit code != 0 Destination: gb-ce-lumc.lumc.nl:2119/jobmanager-pbs-medium Submitted: Sun Sep 7 21:24:56 2008 CEST
November 14, 2008 20
The Annoyances: glite-wms-job-*
glite-wms-job-cancel
• Does not recursively cancel jobs stored in a file
• Fix:
glite-wms-job-status -i jobs.txt | grep 'http' | gawk '{print $7}' > to_cancel.txt
glite-wms-job-cancel -i to_cancel.txt
Status info for the Job : https://wms.grid.sara.nl:9000/ztINwkKvTJfKnUuZBTYs_g Current Status: Done (Exit Code !=0) Exit code: 1 Status Reason: Warning: job exit code != 0 Destination: gb-ce-lumc.lumc.nl:2119/jobmanager-pbs-medium Submitted: Sun Sep 7 21:24:56 2008 CEST
November 14, 2008 21
The Annoyances: lcg-*
lcg-cr
• Getting files to and from the SEs:
• What, lcg-cr doesn't always work?
• On error: try again
• No error: good to go, right?
• Try copying the file back to the WN
lcg-cp
• Copying > 3000 files from a SE to the UI machine takes > 1 hour
• Copying the same files over ssh (scp) to my (remote) machine: ~2 minutes
• Security overhead?
• Work-around:
• lcg-rec-cp: slow
• custom script (do it in parallel): nasty
Both: don't work when the MCR is loaded
November 14, 2008 22
ToPoS
• Main developer: Pieter van Beek
• WebDav + Tokens + pilot job
• Instead of submitting one job at a time, claim a (bunch of) computer(s) until all jobs are done
November 14, 2008 23
ToPoS Overview
ToPoS Server
User
The Grid
(1) Job tokens
(2) Pilot Jobs
(3) Job Request
(4) Job Token
(5) Job Output
(6) All Output
November 14, 2008 24
Token renewal
Pilot jobPilot job
affirmtokenuse
affirmtokenuse
Getunusedtoken
Getunusedtoken
SubmitSubmit
Pilot job with token
Pilot job with token
Running pilot job
Running pilot job
Executetoken task
Executetoken task
Finished?Finished?
Deletetoken
Deletetoken
noyes
November 14, 2008 25
ToPoS: Conclusion
• Advantages:
• Easy output handling using Curl with atomic operations
• Handles failed jobs
• Less overhead
• Able to dynamically add or remove nodes
• Easy to re-run jobs
• Easy access to output
• Disadvantages:
• Little / no security
• Some overhead at the end of a run (unless you're reserving tokens)
• Feature request: progress bar
November 14, 2008 26
Fixing the difficulties: LEARN BASH!
• diff is your friend:
• Useful to transfer missing files to and from SE
• grep
• Usefull for querying status of jobs (use with the -c option)
• (g)awk
• Handy to cancel jobs
• Redirect output to file and push processes to background:
• lcg-ls is a typical example
November 14, 2008 27
Why not let the biologist do it?
• Recourses needed to get this working on the grid:
• +/- 180 replies from grid support
• +/- 100 messages exchanged with the biologists
• Many hours of work, mostly finding out about the 'quirks' of the software
• Advantage of making a programmer submit the jobs:
• One person to handle support
• Re-use experience with other projects
November 14, 2008 28
Some other tricks
• Nikhef does not 'advertise' the installed software
• Do your own load balancing (once the job is in a queue, it doesn't get re-scheduled)
• Easy to do with the cancel-script shown previously
• Don't keep your stuff in $home when on WNs, change directory to $TMPDIR at the beginning of your script
• Keep in mind: once you retrieved your job-output it's gone from the grid
• Use startGridSession
• When using ToPoS: make sure you land in the 'long' queue
November 14, 2008 29
Thanks!
• Sara Grid Support
• Jeroen Engelberts
• Pieter van Beek
• Machiel Jansen
• NikHef
• Jan Just Keijser
• Collaborators
• Chris Klijn
• Jeroen de Ridder