R en Bioinformática: paralelización y web Context Parallelizing code Web applications Large data sets and parallelization R, C, and compression on the fly Conclusions et al. What we are doing now R, paralelización, datos masivos y aplicaciones web: ejemplos del uso de R en bioinformática Ramón Díaz-Uriarte Dept. Bioquímica Universidad Autónoma de Madrid Madrid, Spain [email protected]http://ligarto.org/rdiaz Facultad de Informática Universidad Complutense de Madrid 9-Mayo-2012 (1 : 62)
77
Embed
R, paralelización, datos masivos y aplicaciones web ... · R en Bioinformática: paralelización y web Context Parallelizing code Web applications Large data sets and parallelization
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
R, paralelización, datos masivos yaplicaciones web: ejemplos del uso de R
Arrays: a dot is a DNA fragment. Each array a sample. Each array all chromosomes. (For analysis, location in chromosome matters)
Hupe & Barillot, 2005
Calling gains and losses: hypothesistesting
Inferring number of copy gains/losses: estimation L
og
2(R
ati
o)
(12 : 62)
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Data, data, data (in Gigabytes)
Expression arrays (mRNA) > 40,000 probesCopy number with aCGH > 400,000 common;
some > 4 x 106
. . . . . .
(13 : 62)
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
aCGH: example of data
(From O. Rueda’s PhD Thesis)(14 : 62)
probe gene
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Computational issues et al.
We want to analyze, reanalyze, and combine.Do it in a reasonably short time.“Wet lab researchers” need user friendly access tomethods that are both statistically rigorous andcomputationally efficient.
BioConductor paper: second most accessed paper inGenome Biology ; yearly “Web server issue” ofNucleic Acids Research.
(15 : 62)
R enBioinformática:paralelización y
web
ContextBiological context
Computational context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Multicores and computing clusters
Increases in CPU speed slowed down (< 20% peryear since 2002).Increase in the number of “cores”: 2, 4, 8. Next 10years?Inexpensive computing clusters with off-the-shelfcomponents.Must design our programs from the start: parallelprogramming
Analyze data in a reasonably short time.User friendly access to methods that are statisticallyrigorous.
(26 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applicationsWeb apps: how
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Web-based applications
User-friendly interface.No hardware/software hassles for end users.Parallelization is transparent.Method selection can be partially transferred (to us).Short user wall time: use (hardware/software)resources rarely available to individual biomedicalresearchersJust type in a URL:http://www.some-application
Arrays: a dot is a DNA fragment. Each array a sample. Each array all chromosomes. (For analysis, location in chromosome matters)
Hupe & Barillot, 2005
Calling gains and losses: hypothesistesting
Inferring number of copy gains/losses: estimation L
og
2(R
ati
o)
(46 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Store and access (large) pre-computed results
HMM for aCGH data with Reversible Jump: ViterbiCommon regions: “count” on the Viterbi paths.
Fitting HMM/common regions: distinct operations.
C: number-crunching.R: wrapper and figures/tables.C: creates large amounts of data.
In package RJaCGH (CRAN).
(47 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Fit HMM
R C (HMM)
Store Viterbias gzipped file
return filenames
Find common regions
R C (common regions)pass filenames
ReadViterbi datareturn results
Figures, tables
(48 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Fit HMM
R C (HMM)
Store Viterbias gzipped filereturn filenames
Find common regions
R C (common regions)pass filenames
ReadViterbi datareturn results
Figures, tables
(48 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Fit HMM
R C (HMM)
Store Viterbias gzipped filereturn filenames
Find common regions
R C (common regions)pass filenames
ReadViterbi datareturn results
Figures, tables
(48 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Fit HMM
R C (HMM)
Store Viterbias gzipped filereturn filenames
Find common regions
R C (common regions)pass filenames
ReadViterbi datareturn results
Figures, tables
(48 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Context
Parallelizing code
Web applications
Large data sets and parallelization
R, C, and compression on the fly
Conclusions et al.
What we are doing now
(49 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Web-based: A few things we’ve learned
Configuration sucks (if you need to modify > 1 file)Too many languagesAdding test cases to the testing suites: web, RDocumentation: in the code, web pages, LATEX . . .
Too much R code to catch errorsUser interfaces: who designs them?
(50 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Too many languagesImpedance mismatch problem:“Building Web-based applications requires the mastering of anumber of languages/technologies (e.g. HTML, CSS, CGI, ASP,PHP, XML, etc..). Such languages and technologies werecreated to address different aspects on a by-need evolutionarymanner. The result is a plethora of tools that are fitted togetherin an ad hoc fashion.” El-Ansary, Grolaux, Van Roy, Rafea(2005) “Overcoming the Multiplicity of Languages andTechnologies for Web-Based Development Using aMulti-paradigm Approach”.
R and CHTML and Python: CGI, data entry, displayPython (and others): control and monitor MPIJavascript: AJAX and figures
(51 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Fault tolerance and communicationManual check for errors (R ain’t Erlang)Too much network traffic
(52 : 62)
Boot (new)LAM/MPI
Start R: continue from last checkpoint
Sleep
Run outof time?
Are we done?R crashed
(coding errors)?
MPI universe:Servers 1 ... n
NFS sharedtemporary storage
NFS sharedstorage
Rmpi crashed?LAM/MPI crashed?
(includes node crashes)
No
Halt MPI universe Produce and return results pages
Yes
Yes
No
Verify servers(modify LAM defs)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Solutions?
Literate programming and org-modeAlternatives to MPI and/or use Erlang. . .Keep things as they are (only a few painful events ayear)
(53 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Rethinking web-based applications
Users can get into trouble.
Sure, but we can do a good job . . .I Provide state-of-the art statistical and computational
approaches (R).I Pedagogical examples and pipelinesI Minimize the chance of users getting into trouble
Web-based applications are here to stay
(54 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Rethinking web-based applications
Users can get into trouble.
Sure, but we can do a good job . . .I Provide state-of-the art statistical and computational
approaches (R).I Pedagogical examples and pipelinesI Minimize the chance of users getting into trouble
Web-based applications are here to stay
(54 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Rethinking web-based applications
Users can get into trouble.
Sure, but we can do a good job . . .I Provide state-of-the art statistical and computational
approaches (R).I Pedagogical examples and pipelinesI Minimize the chance of users getting into trouble
Web-based applications are here to stay
(54 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
. . . so . . .
Forget about them: just write your R/C/whatever codeGo for it
I We can use R + HPCI But other tools and work necessary
(55 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Regardless of web-based applications . . .
Parallel computing can be used routinelyI (library(parallel) in R ≥ 2.14.0)
Large data sets with ff + parallelization.
(56 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
So far . . .
Most of what I mentioned refers to “traditional clustersetups”
I Several nodes (e.g., > 10).I A few CPUs/cores per nodeI Not too much RAM per node.
We’ve been using it for about 10 years.But things change . . .
(57 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
New hardware
Only a few nodes (2 in our case).Many cores.Lots of RAM available for a single process.More reliable?
Little need for control and monitorization software?Reconfiguration of MPI definition files.Load balancing of web servers.
(59 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
CHANGES
Changes in application software:Rethink how we use MPI.Start using OpenMP in C code.
I Need to be careful when called from R.I Random number generation.
Use mclapply (forking) within R.Rethink usage of ff: we can keep the whole object inRAM.
I Do not use the disk at all.I Eliminate code.
Rethink I/O and storage.Combine MPI/Rmpi with OpenMP and mclapply(forking).
Rethink usage of R (Julia? Python?)
(60 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
CHANGES
Changes in application software:Rethink how we use MPI.Start using OpenMP in C code.
I Need to be careful when called from R.I Random number generation.
Use mclapply (forking) within R.Rethink usage of ff: we can keep the whole object inRAM.
I Do not use the disk at all.I Eliminate code.
Rethink I/O and storage.Combine MPI/Rmpi with OpenMP and mclapply(forking).Rethink usage of R (Julia? Python?)
(60 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Commercials (grandes ofertas)
I’d be glad to talk to anybody who wants to play with, andhelp configure, our machines and code.
(61 : 62)
R enBioinformática:paralelización y
web
Context
Parallelizingcode
Web applications
Large data setsandparallelization
R, C, andcompression onthe fly
Conclusions etal.
What we aredoing now
Acknowledgements
O. M. Rueda, A. Alibés, A. Cañada, E. R. Morrissey,M. L. Neves, D. Rico.Funding: Fundación de Investigación Médica MutuaMadrileña, Project TIC2003-09331-C02-02 of theSpanish MEC and BIO2009-12458 of the SpanishMICINN. Ramón y Cajal Programme of the SpanishMinistry of Education and Science.CNIO (Spanish National Cancer Research Center).The R users and developers for a vibrant statisticalcomputing community and amazing platform.Victoria López.