Top Banner
Creating an Optimized Algorithm in R: Version 1 October 22, 2009
22

Easy R

Jun 23, 2015

Download

Technology

Ajay Ohri

creating new stats algorithms easily in R
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Easy R

Creating an Optimized Algorithm in R: 

Version 1October 22, 2009

Page 2: Easy R

R : Background

Nobody owns it , yet R related products have been created by

REvolution Computing (Partnering with Microsoft/Intel)http://www.revolution-computing.com/industry/academic.php

SAS (Interface to SAS/IML)http://support.sas.com/rnd/app/studio/Rinterface2.html

and SPSS (Interface to SPSS including some use of Python)http://insideout.spss.com/2009/01/13/spss-statistics-and-r/

Blue Reference Inc ( Plugin for MS Office)http://inferenceforr.com/default.aspx

and

Information Focus ( R GUI for Data Mining)http://www.informationbuilders.com/products/webfocus/predictivemodeling.html

Page 3: Easy R

R Packages

CRAN - 1783 Packages in R 2.11

              1977 Packages in R 2.9  

  COST -0 $   BUT a lot of hours.

Question:Number of People in the World who know all 1977 R Packages?

Page 4: Easy R

Some uses of RCitation:  httP://blog.revolution-computing.comlibrary(maps)map("state", interior = FALSE)map("state", boundary = FALSE, col="gray", add = TRUE)  

 GADM is a spatial database of the location of the world's administrative boundariesthe spplot function (from the sp package).the data for Switzerland, and then plot each canton with a color denoting its primary language:library(sp)con <- url("http://gadm.org/data/rda/CHE_adm1.RData")print(load(con))close(con)language <- c("german", "german", "german","german", "german","german","french", "french", "german","german","french", "french",  "german", "french","german","german", "german","german","german", "german", "german","italian","german","french", "french","german","german")gadm$language <- as.factor(language)col = rainbow(length(levels(gadm$language)))spplot(gadm, "language", col.regions=col, main="Swiss Language Regions") AnthroSpace: Download Global Administrative Areas as RData files   

Page 5: Easy R

Seven tips for "surviving" R 

• Keep extensive written notes• Find a way to search for R answers• Learn to convert complex objects to canonical forms with

unclass()• Learn how to find and inspect classes and methods for

objects• Learn how to clear pesky attributes from objects• Swallow your pride • and learn and use R's many one-line idioms, rather than

reinventing the wheel John Mount from Win-Vector LLC :Citation

  

Page 6: Easy R

Writing a Function/ Algorithm in RSimply enough,

newRalgorithm(x) <- function(x) OldAlgorithm(x)

Eg-do_something <- function(x,y){# Function code goes here ... } # Subset my dataorange_girls <- subset(crabs, sex == 'F' & sp == 'O') # Call my function do_something(orange_girls$CW,orange_girls$C   Citation-  http://cran.r-project.org/doc/manuals/R-exts.html#Top

http://www.bioinformaticszen.com/r_programming/data_analysis_using_r_functions_as_objects/

Page 7: Easy R

Writing a new stats algorithm ( in R /other language)Steps ( Basic Idea)-

Journal Review of Study Area

Existing Algorithm Study for GAP analysis

And add creativity

Test and Iterate within community

Publish

Page 8: Easy R

Choosing Clustering as the area of interest

 Clustering works with Big Data.  Can work with lots of incomplete column variables when othertechniques may not be suitable. Works when data cannot be used for regression models.

Groups of clusters can be merged and combined to make new clusters so a case for parallel processing

Useful for product marketing, business, medicine  andfinancial

Page 9: Easy R

K Means Clustering using R

R> data("planets", package = "HSAUR")R> library("scatterplot3d")R> scatterplot3d(log(planets$mass), log(planets$period),+ log(planets$eccen), type = "h", angle = 55,+ pch = 16, y.ticklabs = seq(0, 10, by = 2),+ y.margin.add = 0.1, scale.y = 0.7)  

Page 10: Easy R

Writing a Function/ Algorithm in R 2Adding loops and multiple function

Eg- # Arrays of values for each type of species and sexspecies <- unique(crabs$sp)sexes <- unique(crabs$sex) # Loop through species ...for(i in 1:length(species)){# ... loop through sex .. for(j in 1:length(sexes)){#... and finally call a function on each subset something_else(subset(crabs, sp == species[i] & sex == sexes[j])) }

Citation- http://cran.r-project.org/doc/manuals/R-exts.html#Top http://www.bioinformaticszen.com/r_programming/data_analysis_using_r_functions_as_objects/

Page 11: Easy R

Writing a Function/ Algorithm in R 2Adding loops and multiple function

Eg- # Arrays of values for each type of species and sexspecies <- unique(crabs$sp)sexes <- unique(crabs$sex)   # Loop through species ...for(i in 1:length(species)){# ... loop through sex ..   for(j in 1:length(sexes)){   #... and finally call a function on each subset

something_else(subset(crabs, sp == species[i] & sex == sexes[j])) }    Citation-  http://cran.r-project.org/doc/manuals/R-exts.html#Top   http://www.bioinformaticszen.com/r_programmin/data_analysis_using_r_functions_as_objects/

Page 12: Easy R

More ways to write functionseach <- function(.column,.data,.lambda){# Find the column index from it's namecolumn_index <- which(names(.data) == .column)# Find the unique values in the columncolumn_levels <- unique(.data[,column_index]) 

  # Loop over these valuesfor(i in 1:length(column_levels)){# Subset the data and call the passed function on it

.lambda(.data[.data[,column_index] == column_levels[i],]) } }

The last argument .lambda is an R function, because R treats functions as objects this allows them to be passed as arguments to other functions. # Another function as the last argument to this functioneach("sp", crabs, something_else) # Or create a new anonymous function ...  each("sp", crabs, function(x){ # ... and run multiple lines of code here something_else(x) with(x,lm(CW ~ CL)) })

Page 13: Easy R

Additionally create new functionsuse a PlyrFrom http://had.co.nz/plyr/plyr is a set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each pieces and then put all the pieces back together. It's already possible to do this with split and the apply functions, but plyr just makes it all a bit easier with:

• consistent names, arguments and outputs• input from and output to data.frames, matrices and lists• progress bars to keep track of long running operations• built-in error recovery

a consistent and useful set of tools for solving the split-apply-combine problem.library(plyr) # Three arguments # 1. The dataframe# 2. The name of columns to subset by # 3. The function to call on each subset d_ply(crabs, .(sp, sex), something_else)

Page 14: Easy R

Quick Recap

 We have an algorithm in mind or create a new alogirthm ( toughest part) ( Eg. http://en.scientificcommons.org/42572415 Genetic K-Means (GKM) or Genetic Regularized Mahalanobis (GARM) distances to compute the initial cluster parameters, with little difference in the final results. This innovation allows our algorithm to find optimal parameter estimates of complex hyperellisoidal clusters. We develop and score the information complexity (ICOMP) criterion of Bozdogan (1994a,b, 2004) as our fitness function to choose the number of clusters present in the data sets )

We created a function in R on it.We can also use to rename Package Functions (like a SAS R Package I created)

We now need to create a package so we all 2 million R users may have a chance to use it

Page 15: Easy R

Creating a New Package

Citation-http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf

   1. Load all functions and data sets you want in the package into a clean R session, and run package.skeleton(). The objects are sorted into data and functions, skeleton help files      are created for them using prompt() and a DESCRIPTION file is created. The function then prints out a list of things for you to do next.This creates the Package within the Current Working Directory

> package.skeleton(name="NAME_OF_PACKAGE",code_files="FILENAME.R")Creating directories ...Creating DESCRIPTION ...Creating Read-and-delete-me ...Copying code files ...Making help files ...Done.Further steps are described in './linmod/Read-and-delete-me'.

Q WHERE IS MY PACKAGE?A getwd()

Page 16: Easy R

Creating a New Package

Citation-http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf

Q What is the best step in making a software-

A Documenting HELP

FINALLY 

* Edit the help file skeletons in 'man', possibly combining help files  for multiple functions.* Put any C/C++/Fortran code in 'src'.* If you have compiled code, add a .First.lib() function in 'R' to load  the shared library.* Run R CMD build to build the package tarball.* Run R CMD check to check the package tarball.

Read "Writing R Extensions" for more information. http://cran.r-project.org/doc/manuals/R-exts.pdf   Also see guidelines for CRAN submission

Page 17: Easy R

Next Steps

We have New functions and a new Package

We now need to optimize the R Package for Performance 

Using

1) Parallel Computing2) High Performance Computing3) Code Optimization

Page 18: Easy R

Optimizing Code

Citation:Dirk Eddelbuettelhttp://dirk.eddelbuettel.com/papers/useR2009hpcTutorial.pdf

R already provides the basic tools for performance analysis.     the system.time function for simple measurements.     the Rprof function for profiling R code.     the Rprofmem function for profiling R memory usage.In addition, the profr and proftools package on CRAN can beused to visualize Rprof data.

We use tools for creating visual images of how the algorithm is looping- in case we dont know how the algorithm we created looks visually and to avoid multiple calls. 

Page 19: Easy R

Optimizing Code :Example

Citation:Dirk Eddelbuettelhttp://dirk.eddelbuettel.com/papers/useR2009hpcTutorial.pdf

> sillysum <- function(N) { s <- 0;        for (i in 1:N) s <- s + i; return(s) }> system.time(print(sillysum(1e7)))[1] 5e+13    user system elapsed  13.617   0.020 13.701>> system.time(print(sum(as.numeric(seq(1,1e7)))))[1] 5e+13    user system elapsed   0.224   0.092   0.315>Replacing the loop yielded a gain of a factor of more than 40.

Page 20: Easy R

Running R Parallel

We need a cluster ( like Newton with 1500 processors run on 2 nd floor SMC )

Several R packages to execute code inparallel:    NWS    Rmpi    snow (using MPI, PVM, NWS or sockets)    papply    taskPR    multicore

Page 21: Easy R

Running R Parallel

We need a HPC cluster and also Que time in terms of how long we can run our query on the shared resource.

Using SNOW

A simple example:cl <- makeCluster(4, "MPI")print(clusterCall(cl, function() \          Sys.info()[c("nodename","machine")]))stopCluster(cl)

and 

params <- c("A", "B", "C", "D", "E", "F", "G", "H")cl <- makeCluster(8, "MPI")res <- parSapply(cl, params,                        FUN=function(x) myNEWFunction(x))will ’unroll’ the parameters params one-each over the function argument given, utilising the cluster cl. In other words, we will be running eight copies of myNEWFunction() at once.

Page 22: Easy R

Current Status

We are writing the algorithm we have selected for optimized use on Newton

We will create a Package and release it with a paper once project is over