SDM Cente r Parallel R (pR) For High Performance Statistical Computing • Nagiza F. Samatova (ORNL) • Srikanth Yoginath (ORNL) • Guruprasad Kora (ORNL) • David Bauer (GT) • Chongle Pan (UTK/ORNL) SDM AHM @ Salt Lake City March 3-4, 2005 Contact : Nagiza Samatova, [email protected]
21
Embed
SDM Center Parallel R (pR) For High Performance Statistical Computing Nagiza F. Samatova (ORNL) Srikanth Yoginath (ORNL) Guruprasad Kora (ORNL) David Bauer.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
SDMCenter
Parallel R (pR)For High Performance Statistical Computing
About Parallel R Motivation About R and its parallelization efforts Task and data parallelism with Parallel R (pR) Extensibility of Parallel R Performance Benchmarks
Parallel R across Different Applications GIS data analysis with GRASS and Parallel R Clustered Climate Regimes using Parallel R Fusion scenario challenges Parallel R Quantitative Proteomics in Biology using Parallel R
Summary and Future Work
SDMCenter
Tera-(Flop & Byte) Analyses Could Be Routine for Scientific Applications But…
HitsAlgorithmic Complexity:
Calculate means O(n)
Calculate FFT O(n log(n))
Calculate PCA O(r • c)
Hierarchical clust. O(n2)
Algorithmic Complexity:
Calculate means O(n)
Calculate FFT O(n log(n))
Calculate PCA O(r • c)
Hierarchical clust. O(n2)
Climate Now: 20-40TB per simulated year 5 yrs: 100TB/yr 5-10PB/yr
Astrophysics Now and 5 yrs: Can soak up anything!
Fusion Now: 100Mbytes/15min 5 yrs: 1000Mbytes/2 min
1Tflop/sec
SDMCenter
Statistical Computing with R
About R (http://www.r-project.org/):
• R is an Open Source (GPL), most widely used programming environment for statistical analysis and graphics; similar to S.
• Provides good support for both users and developers.
• Highly extensible via dynamically loadable add-on packages.
• Originally developed by Robert Gentleman and Ross Ihaka.
• rpvm (Na Li and Tony Rossini): R interface to PVM; requires knowledge of parallel programming.
• snow (Luke Tierney): general API on top of message passing routines to provide high-level (parallel apply) commands; mostly demonstrated for embarrassingly parallel applications .
Ideal Programming Requirements: Be able to use existing high level (i.e. R) code Require minimal extra efforts for parallelizing Have Identical/similar (presumably easy-to-use) interface to R’s Be able to test codes in sequential settings Provide efficient and scalable (in terms of problem size and
number of processors) performance
SDMCenter
:::::::fileList<-list.files(pattern="*.nc");PE ( for (i in 1:length(fileList)) { matrix [i] readNcFile (fileList[i]); pca [i] sla.prcomp (matrix [i])} ):::::::::::::
:::::::fileList<-list.files(pattern="*.nc");PE ( for (i in 1:length(fileList)) { matrix [i] readNcFile (fileList[i]); pca [i] sla.prcomp (matrix [i])} ):::::::::::::
:::::::fileList<-list.files(pattern="*.nc");for (i in 1:length(fileList)) { matrix [i] readNcFile (fileList[i]); pca [i] prcomp (matrix [i])} :::::::::::::
:::::::fileList<-list.files(pattern="*.nc");for (i in 1:length(fileList)) { matrix [i] readNcFile (fileList[i]); pca [i] prcomp (matrix [i])} ::::::::::::: R pR
Providing Task and Data Parallelism in pR
• Likelihood Maximization.
• Re-sampling schemes: Bootstrap, Jackknife, etc.
• Animations
• Markov Chain Monte Carlo (MCMC).– Multiple chains.– Simulated Tempering: running parallel chains at different “temperature“ to improve mixing.
Task-parallel analyses:
• k-means clustering
• Principal Component Analysis (PCA)
• Hierarchical (model-based) clustering
• Distance matrix, histogram, etc. computations
Data-parallel analyses:
Data ParallelismD a t
a
RScaLAPACK
Task & Data Parallelism with pR
Task Parallelism
D a t a
Task-pR
SDMCenter
Extensibility of Parallel R (pR)
REnvironment
REnvironment
ParallelAgent
ParallelAgent
Third Party Parallel Codes
ScaLAPACK
Matrix
Parallel k-means
Alok’s Data Mining
RScaLAPACK
pMatrix
pAlok
•Define R function parameters & returns
•Map R functions to defined function interfaces
•Define the function interfaces
•Set parallel environment limits for your functions
•Define data distribution function (Optional)
•Convert your MPI/PVM routine(s) into a set of functions.
•Create a shared library of your functions.
•Place it in a predefined location.
C/Fortran MPI
Robject
SDMCenter
Scalability of Parallel R (pR)
Speedup for Parallel R’s sla.solve() over serial R’s solve().
Architecture: SGI Altix at CCS of ORNL with 256 Intel Itanium2 processors at 1.5 GHz; 8 GB of memory per processor (2 TB system memory); 64-bit Linux OS; 1.5 TeraFLOPs/s theoretical total peak performance.
Matrix size:
R> solve (A,B) pR> sla.solve (A, B, NPROWS, NPCOLS, MB)A and B are the input matrices; NPROWS and NPCOLS are process grid specs; MB is block size
• pR enables both data and task parallelism (includes task-pR and RScaLAPACK) (2004/Q4)
• RScaLAPACK provides R interface to ScaLAPACK with its scalability in terms of problem size and number of processors using data parallelism (2004/Q2)
• task-pR achieves parallelism by performing out-of-order execution of tasks. With its intelligent scheduling mechanism it attains significant gain in execution times (2004/Q3)
•pMatrix provides a parallel platform to perform major matrix operations in parallel using ScaLAPACK and PBLAS Level II & III routines (2005/Q2)
Also: Available for download from R’s CRAN web site (www.R-Project.org) with 37 mirror sites in 20 countries
SDMCenter
Geo-statistical and Spatial Data Analysis with GRASS and Parallel RWith: George Fann, John Drake, and Bhaduri Budhendra
About GRASS (http://grass.itc.it/):
• GRASS (Geographic Resources Analysis Support System) is a raster/vector GIS, image processing system, and graphics production.
• GRASS contains over 350 programs and tools to render maps and images on monitor and paper; manipulate raster, vector, and sites data; process multi spectral image data; create, manage, and store spatial data.
• It is Free (Libre) Software/Open Source released under GNU GPL.
Parallel R (pR) extension for GRASS:
• Leverages the work by Markus Neteler (http://grass.itc.it/statsgrass/grass_geostats.html).
• Offers a richer set of statistical analysis capabilities including (Basic Statistics, Exploratory Data Analysis, Linear Models, Multivariate Analysis, Time Series Analysis, etc.)
• Provides high performance and parallel computational platform for large datasets
for (i in 1:length(chroList)) { currResult [i] Rratio(filename=chroList[i]); }
for (i in 1:length(chroList)) { cat (chroList[i], currResult$samSN, currResult$refSN, currResult$PPCSN, currResult$HR, currResult$PCA, currResult$PCASN, file="Pratio-Peptide.txt"); }:::::::::::::
PE ( for (i in 1:length(chroList)) { currResult [i] Pratio(filename=chroList[i]); } )
for (i in 1:length(chroList)) { cat (chroList[i], currResult$samSN, currResult$refSN, currResult$PPCSN, currResult$HR, currResult$PCA, currResult$PCASN, file="Pratio-Peptide.txt"); }:::::::::::::
Parallel Version
SDMCenter
Performance Results for Ratio Calculation
SDMCenter
Summary and Future Work
Parallel R (pR) is an Open Source high performance library for statistical computing in R
It has been deployed in a number of applications including: climate, GIS, fusion, and biology
Future improvements in few major directions: Demonstrate more application scenarios Add more libraries like RScaLAPACK, PMatrix (e.g. pAlok,
pclust, pnetCDF) Improve the performance (reduce overhead, memory
management) of Parallel Agent Enhance features of Parallel Agent:
Support outside of Master-Slave model Better memory management strategies (one-sided put(),
get(), release(), etc.) Support of parallel I/O over netCDF and HDF files