Creating an Optimized Algorithm in R: Version 1 October 22, 2009
Jun 23, 2015
Creating an Optimized Algorithm in R:
Version 1October 22, 2009
R : Background
Nobody owns it , yet R related products have been created by
REvolution Computing (Partnering with Microsoft/Intel)http://www.revolution-computing.com/industry/academic.php
SAS (Interface to SAS/IML)http://support.sas.com/rnd/app/studio/Rinterface2.html
and SPSS (Interface to SPSS including some use of Python)http://insideout.spss.com/2009/01/13/spss-statistics-and-r/
Blue Reference Inc ( Plugin for MS Office)http://inferenceforr.com/default.aspx
and
Information Focus ( R GUI for Data Mining)http://www.informationbuilders.com/products/webfocus/predictivemodeling.html
R Packages
CRAN - 1783 Packages in R 2.11
1977 Packages in R 2.9
COST -0 $ BUT a lot of hours.
Question:Number of People in the World who know all 1977 R Packages?
Some uses of RCitation: httP://blog.revolution-computing.comlibrary(maps)map("state", interior = FALSE)map("state", boundary = FALSE, col="gray", add = TRUE)
GADM is a spatial database of the location of the world's administrative boundariesthe spplot function (from the sp package).the data for Switzerland, and then plot each canton with a color denoting its primary language:library(sp)con <- url("http://gadm.org/data/rda/CHE_adm1.RData")print(load(con))close(con)language <- c("german", "german", "german","german", "german","german","french", "french", "german","german","french", "french", "german", "french","german","german", "german","german","german", "german", "german","italian","german","french", "french","german","german")gadm$language <- as.factor(language)col = rainbow(length(levels(gadm$language)))spplot(gadm, "language", col.regions=col, main="Swiss Language Regions") AnthroSpace: Download Global Administrative Areas as RData files
Seven tips for "surviving" R
• Keep extensive written notes• Find a way to search for R answers• Learn to convert complex objects to canonical forms with
unclass()• Learn how to find and inspect classes and methods for
objects• Learn how to clear pesky attributes from objects• Swallow your pride • and learn and use R's many one-line idioms, rather than
reinventing the wheel John Mount from Win-Vector LLC :Citation
Writing a Function/ Algorithm in RSimply enough,
newRalgorithm(x) <- function(x) OldAlgorithm(x)
Eg-do_something <- function(x,y){# Function code goes here ... } # Subset my dataorange_girls <- subset(crabs, sex == 'F' & sp == 'O') # Call my function do_something(orange_girls$CW,orange_girls$C Citation- http://cran.r-project.org/doc/manuals/R-exts.html#Top
http://www.bioinformaticszen.com/r_programming/data_analysis_using_r_functions_as_objects/
Writing a new stats algorithm ( in R /other language)Steps ( Basic Idea)-
Journal Review of Study Area
Existing Algorithm Study for GAP analysis
And add creativity
Test and Iterate within community
Publish
Choosing Clustering as the area of interest
Clustering works with Big Data. Can work with lots of incomplete column variables when othertechniques may not be suitable. Works when data cannot be used for regression models.
Groups of clusters can be merged and combined to make new clusters so a case for parallel processing
Useful for product marketing, business, medicine andfinancial
K Means Clustering using R
R> data("planets", package = "HSAUR")R> library("scatterplot3d")R> scatterplot3d(log(planets$mass), log(planets$period),+ log(planets$eccen), type = "h", angle = 55,+ pch = 16, y.ticklabs = seq(0, 10, by = 2),+ y.margin.add = 0.1, scale.y = 0.7)
Writing a Function/ Algorithm in R 2Adding loops and multiple function
Eg- # Arrays of values for each type of species and sexspecies <- unique(crabs$sp)sexes <- unique(crabs$sex) # Loop through species ...for(i in 1:length(species)){# ... loop through sex .. for(j in 1:length(sexes)){#... and finally call a function on each subset something_else(subset(crabs, sp == species[i] & sex == sexes[j])) }
Citation- http://cran.r-project.org/doc/manuals/R-exts.html#Top http://www.bioinformaticszen.com/r_programming/data_analysis_using_r_functions_as_objects/
Writing a Function/ Algorithm in R 2Adding loops and multiple function
Eg- # Arrays of values for each type of species and sexspecies <- unique(crabs$sp)sexes <- unique(crabs$sex) # Loop through species ...for(i in 1:length(species)){# ... loop through sex .. for(j in 1:length(sexes)){ #... and finally call a function on each subset
something_else(subset(crabs, sp == species[i] & sex == sexes[j])) } Citation- http://cran.r-project.org/doc/manuals/R-exts.html#Top http://www.bioinformaticszen.com/r_programmin/data_analysis_using_r_functions_as_objects/
More ways to write functionseach <- function(.column,.data,.lambda){# Find the column index from it's namecolumn_index <- which(names(.data) == .column)# Find the unique values in the columncolumn_levels <- unique(.data[,column_index])
# Loop over these valuesfor(i in 1:length(column_levels)){# Subset the data and call the passed function on it
.lambda(.data[.data[,column_index] == column_levels[i],]) } }
The last argument .lambda is an R function, because R treats functions as objects this allows them to be passed as arguments to other functions. # Another function as the last argument to this functioneach("sp", crabs, something_else) # Or create a new anonymous function ... each("sp", crabs, function(x){ # ... and run multiple lines of code here something_else(x) with(x,lm(CW ~ CL)) })
Additionally create new functionsuse a PlyrFrom http://had.co.nz/plyr/plyr is a set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each pieces and then put all the pieces back together. It's already possible to do this with split and the apply functions, but plyr just makes it all a bit easier with:
• consistent names, arguments and outputs• input from and output to data.frames, matrices and lists• progress bars to keep track of long running operations• built-in error recovery
a consistent and useful set of tools for solving the split-apply-combine problem.library(plyr) # Three arguments # 1. The dataframe# 2. The name of columns to subset by # 3. The function to call on each subset d_ply(crabs, .(sp, sex), something_else)
Quick Recap
We have an algorithm in mind or create a new alogirthm ( toughest part) ( Eg. http://en.scientificcommons.org/42572415 Genetic K-Means (GKM) or Genetic Regularized Mahalanobis (GARM) distances to compute the initial cluster parameters, with little difference in the final results. This innovation allows our algorithm to find optimal parameter estimates of complex hyperellisoidal clusters. We develop and score the information complexity (ICOMP) criterion of Bozdogan (1994a,b, 2004) as our fitness function to choose the number of clusters present in the data sets )
We created a function in R on it.We can also use to rename Package Functions (like a SAS R Package I created)
We now need to create a package so we all 2 million R users may have a chance to use it
Creating a New Package
Citation-http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf
1. Load all functions and data sets you want in the package into a clean R session, and run package.skeleton(). The objects are sorted into data and functions, skeleton help files are created for them using prompt() and a DESCRIPTION file is created. The function then prints out a list of things for you to do next.This creates the Package within the Current Working Directory
> package.skeleton(name="NAME_OF_PACKAGE",code_files="FILENAME.R")Creating directories ...Creating DESCRIPTION ...Creating Read-and-delete-me ...Copying code files ...Making help files ...Done.Further steps are described in './linmod/Read-and-delete-me'.
Q WHERE IS MY PACKAGE?A getwd()
Creating a New Package
Citation-http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf
Q What is the best step in making a software-
A Documenting HELP
FINALLY
* Edit the help file skeletons in 'man', possibly combining help files for multiple functions.* Put any C/C++/Fortran code in 'src'.* If you have compiled code, add a .First.lib() function in 'R' to load the shared library.* Run R CMD build to build the package tarball.* Run R CMD check to check the package tarball.
Read "Writing R Extensions" for more information. http://cran.r-project.org/doc/manuals/R-exts.pdf Also see guidelines for CRAN submission
Next Steps
We have New functions and a new Package
We now need to optimize the R Package for Performance
Using
1) Parallel Computing2) High Performance Computing3) Code Optimization
Optimizing Code
Citation:Dirk Eddelbuettelhttp://dirk.eddelbuettel.com/papers/useR2009hpcTutorial.pdf
R already provides the basic tools for performance analysis. the system.time function for simple measurements. the Rprof function for profiling R code. the Rprofmem function for profiling R memory usage.In addition, the profr and proftools package on CRAN can beused to visualize Rprof data.
We use tools for creating visual images of how the algorithm is looping- in case we dont know how the algorithm we created looks visually and to avoid multiple calls.
Optimizing Code :Example
Citation:Dirk Eddelbuettelhttp://dirk.eddelbuettel.com/papers/useR2009hpcTutorial.pdf
> sillysum <- function(N) { s <- 0; for (i in 1:N) s <- s + i; return(s) }> system.time(print(sillysum(1e7)))[1] 5e+13 user system elapsed 13.617 0.020 13.701>> system.time(print(sum(as.numeric(seq(1,1e7)))))[1] 5e+13 user system elapsed 0.224 0.092 0.315>Replacing the loop yielded a gain of a factor of more than 40.
Running R Parallel
We need a cluster ( like Newton with 1500 processors run on 2 nd floor SMC )
Several R packages to execute code inparallel: NWS Rmpi snow (using MPI, PVM, NWS or sockets) papply taskPR multicore
Running R Parallel
We need a HPC cluster and also Que time in terms of how long we can run our query on the shared resource.
Using SNOW
A simple example:cl <- makeCluster(4, "MPI")print(clusterCall(cl, function() \ Sys.info()[c("nodename","machine")]))stopCluster(cl)
and
params <- c("A", "B", "C", "D", "E", "F", "G", "H")cl <- makeCluster(8, "MPI")res <- parSapply(cl, params, FUN=function(x) myNEWFunction(x))will ’unroll’ the parameters params one-each over the function argument given, utilising the cluster cl. In other words, we will be running eight copies of myNEWFunction() at once.
Current Status
We are writing the algorithm we have selected for optimized use on Newton
We will create a Package and release it with a paper once project is over