Easy R

Creating an Optimized Algorithm in R:

Version 1October 22, 2009

R : Background

Nobody owns it , yet R related products have been created by

REvolution Computing (Partnering with Microsoft/Intel)http://www.revolution-computing.com/industry/academic.php

SAS (Interface to SAS/IML)http://support.sas.com/rnd/app/studio/Rinterface2.html

and SPSS (Interface to SPSS including some use of Python)http://insideout.spss.com/2009/01/13/spss-statistics-and-r/

Blue Reference Inc ( Plugin for MS Office)http://inferenceforr.com/default.aspx

and

Information Focus ( R GUI for Data Mining)http://www.informationbuilders.com/products/webfocus/predictivemodeling.html

R Packages

CRAN - 1783 Packages in R 2.11

1977 Packages in R 2.9

COST -0 $ BUT a lot of hours.

Question:Number of People in the World who know all 1977 R Packages?

Some uses of RCitation: httP://blog.revolution-computing.comlibrary(maps)map("state", interior = FALSE)map("state", boundary = FALSE, col="gray", add = TRUE)

GADM is a spatial database of the location of the world's administrative boundariesthe spplot function (from the sp package).the data for Switzerland, and then plot each canton with a color denoting its primary language:library(sp)con <- url("http://gadm.org/data/rda/CHE_adm1.RData")print(load(con))close(con)language <- c("german", "german", "german","german", "german","german","french", "french", "german","german","french", "french", "german", "french","german","german", "german","german","german", "german", "german","italian","german","french", "french","german","german")gadm$language <- as.factor(language)col = rainbow(length(levels(gadm$language)))spplot(gadm, "language", col.regions=col, main="Swiss Language Regions") AnthroSpace: Download Global Administrative Areas as RData files

Seven tips for "surviving" R

• Keep extensive written notes• Find a way to search for R answers• Learn to convert complex objects to canonical forms with

unclass()• Learn how to find and inspect classes and methods for

objects• Learn how to clear pesky attributes from objects• Swallow your pride • and learn and use R's many one-line idioms, rather than

reinventing the wheel John Mount from Win-Vector LLC :Citation

Writing a Function/ Algorithm in RSimply enough,

newRalgorithm(x) <- function(x) OldAlgorithm(x)

Eg-do_something <- function(x,y){# Function code goes here ... } # Subset my dataorange_girls <- subset(crabs, sex == 'F' & sp == 'O') # Call my function do_something(orange_girls$CW,orange_girls$C Citation- http://cran.r-project.org/doc/manuals/R-exts.html#Top

http://www.bioinformaticszen.com/r_programming/data_analysis_using_r_functions_as_objects/

Writing a new stats algorithm ( in R /other language)Steps ( Basic Idea)-

Journal Review of Study Area

Existing Algorithm Study for GAP analysis

And add creativity

Test and Iterate within community

Publish

Choosing Clustering as the area of interest

Clustering works with Big Data. Can work with lots of incomplete column variables when othertechniques may not be suitable. Works when data cannot be used for regression models.

Groups of clusters can be merged and combined to make new clusters so a case for parallel processing

Useful for product marketing, business, medicine andfinancial

K Means Clustering using R

R> data("planets", package = "HSAUR")R> library("scatterplot3d")R> scatterplot3d(log(planets$mass), log(planets$period),+ log(planets$eccen), type = "h", angle = 55,+ pch = 16, y.ticklabs = seq(0, 10, by = 2),+ y.margin.add = 0.1, scale.y = 0.7)

Writing a Function/ Algorithm in R 2Adding loops and multiple function

Eg- # Arrays of values for each type of species and sexspecies <- unique(crabs$sp)sexes <- unique(crabs$sex) # Loop through species ...for(i in 1:length(species)){# ... loop through sex .. for(j in 1:length(sexes)){#... and finally call a function on each subset something_else(subset(crabs, sp == species[i] & sex == sexes[j])) }

Citation- http://cran.r-project.org/doc/manuals/R-exts.html#Top http://www.bioinformaticszen.com/r_programming/data_analysis_using_r_functions_as_objects/

Writing a Function/ Algorithm in R 2Adding loops and multiple function

Eg- # Arrays of values for each type of species and sexspecies <- unique(crabs$sp)sexes <- unique(crabs$sex) # Loop through species ...for(i in 1:length(species)){# ... loop through sex .. for(j in 1:length(sexes)){ #... and finally call a function on each subset

something_else(subset(crabs, sp == species[i] & sex == sexes[j])) } Citation- http://cran.r-project.org/doc/manuals/R-exts.html#Top http://www.bioinformaticszen.com/r_programmin/data_analysis_using_r_functions_as_objects/

More ways to write functionseach <- function(.column,.data,.lambda){# Find the column index from it's namecolumn_index <- which(names(.data) == .column)# Find the unique values in the columncolumn_levels <- unique(.data[,column_index])

# Loop over these valuesfor(i in 1:length(column_levels)){# Subset the data and call the passed function on it

.lambda(.data[.data[,column_index] == column_levels[i],]) } }

The last argument .lambda is an R function, because R treats functions as objects this allows them to be passed as arguments to other functions. # Another function as the last argument to this functioneach("sp", crabs, something_else) # Or create a new anonymous function ... each("sp", crabs, function(x){ # ... and run multiple lines of code here something_else(x) with(x,lm(CW ~ CL)) })

Additionally create new functionsuse a PlyrFrom http://had.co.nz/plyr/plyr is a set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each pieces and then put all the pieces back together. It's already possible to do this with split and the apply functions, but plyr just makes it all a bit easier with:

• consistent names, arguments and outputs• input from and output to data.frames, matrices and lists• progress bars to keep track of long running operations• built-in error recovery

a consistent and useful set of tools for solving the split-apply-combine problem.library(plyr) # Three arguments # 1. The dataframe# 2. The name of columns to subset by # 3. The function to call on each subset d_ply(crabs, .(sp, sex), something_else)

Quick Recap

We have an algorithm in mind or create a new alogirthm ( toughest part) ( Eg. http://en.scientificcommons.org/42572415 Genetic K-Means (GKM) or Genetic Regularized Mahalanobis (GARM) distances to compute the initial cluster parameters, with little difference in the final results. This innovation allows our algorithm to find optimal parameter estimates of complex hyperellisoidal clusters. We develop and score the information complexity (ICOMP) criterion of Bozdogan (1994a,b, 2004) as our fitness function to choose the number of clusters present in the data sets )

We created a function in R on it.We can also use to rename Package Functions (like a SAS R Package I created)

We now need to create a package so we all 2 million R users may have a chance to use it

Creating a New Package

Citation-http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf

1. Load all functions and data sets you want in the package into a clean R session, and run package.skeleton(). The objects are sorted into data and functions, skeleton help files are created for them using prompt() and a DESCRIPTION file is created. The function then prints out a list of things for you to do next.This creates the Package within the Current Working Directory

> package.skeleton(name="NAME_OF_PACKAGE",code_files="FILENAME.R")Creating directories ...Creating DESCRIPTION ...Creating Read-and-delete-me ...Copying code files ...Making help files ...Done.Further steps are described in './linmod/Read-and-delete-me'.

Q WHERE IS MY PACKAGE?A getwd()

Creating a New Package

Citation-http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf

Q What is the best step in making a software-

A Documenting HELP

FINALLY

* Edit the help file skeletons in 'man', possibly combining help files for multiple functions.* Put any C/C++/Fortran code in 'src'.* If you have compiled code, add a .First.lib() function in 'R' to load the shared library.* Run R CMD build to build the package tarball.* Run R CMD check to check the package tarball.

Read "Writing R Extensions" for more information. http://cran.r-project.org/doc/manuals/R-exts.pdf Also see guidelines for CRAN submission

Next Steps

We have New functions and a new Package

We now need to optimize the R Package for Performance

Using

1) Parallel Computing2) High Performance Computing3) Code Optimization

Optimizing Code

Citation:Dirk Eddelbuettelhttp://dirk.eddelbuettel.com/papers/useR2009hpcTutorial.pdf

R already provides the basic tools for performance analysis. the system.time function for simple measurements. the Rprof function for profiling R code. the Rprofmem function for profiling R memory usage.In addition, the profr and proftools package on CRAN can beused to visualize Rprof data.

We use tools for creating visual images of how the algorithm is looping- in case we dont know how the algorithm we created looks visually and to avoid multiple calls.

Optimizing Code :Example

Citation:Dirk Eddelbuettelhttp://dirk.eddelbuettel.com/papers/useR2009hpcTutorial.pdf

> sillysum <- function(N) { s <- 0; for (i in 1:N) s <- s + i; return(s) }> system.time(print(sillysum(1e7)))[1] 5e+13 user system elapsed 13.617 0.020 13.701>> system.time(print(sum(as.numeric(seq(1,1e7)))))[1] 5e+13 user system elapsed 0.224 0.092 0.315>Replacing the loop yielded a gain of a factor of more than 40.

Running R Parallel

We need a cluster ( like Newton with 1500 processors run on 2 nd floor SMC )

Several R packages to execute code inparallel: NWS Rmpi snow (using MPI, PVM, NWS or sockets) papply taskPR multicore

Running R Parallel

We need a HPC cluster and also Que time in terms of how long we can run our query on the shared resource.

Using SNOW

A simple example:cl <- makeCluster(4, "MPI")print(clusterCall(cl, function() \ Sys.info()[c("nodename","machine")]))stopCluster(cl)

and

params <- c("A", "B", "C", "D", "E", "F", "G", "H")cl <- makeCluster(8, "MPI")res <- parSapply(cl, params, FUN=function(x) myNEWFunction(x))will ’unroll’ the parameters params one-each over the function argument given, utilising the cluster cl. In other words, we will be running eight copies of myNEWFunction() at once.

Current Status

We are writing the algorithm we have selected for optimized use on Newton

We will create a Package and release it with a paper once project is over

Easy R

Technology

r citation

r packages cran

r blue reference

r related products

sasiml http

microsoftintel http

information focus r

data mining http