Top Banner
Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University March 1, 2014
29

Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Apr 13, 2018

Download

Documents

ledang
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Speeding up R with Rcpp

Stephen CristianoDepartment of BiostatisticsJohns Hopkins University

March 1, 2014

Page 2: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

What is Rcpp?

I Rcpp: Seamless integration between R and C++.

I Extremely simple to connect C++ with R.

I Maintained by Dirk Eddelbuetter and Romain Francois

Page 3: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Simple examples

library('Rcpp')

cppFunction('int square(int x) { return x*x; }')square(7L)

## [1] 49

cppFunction('

int add(int x, int y, int z) {int sum = x + y + z;

return sum;

}')

add(1, 2, 3)

## [1] 6

Page 4: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Everything revolves around .Call

C++ Level:

SEXP foo(SEXP a, SEXP b, SEXP C, ...);

R Level:

res <- .Call("foo", a, b, C, ..., package="mypkg")

## Error: object ’a’ not found

Page 5: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Why C++?

I One of the most frequently used programming languages.Easy to find help.

I Speed.

I Good chance what you want is already implemented in C++.

I From wikipedia: ’C++ is a statically typed, free-form,multi-paradigm, compiled, general-purpose, powerfulprogramming language.’

Page 6: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Why not C++?

I More difficult to debug.

I more difficult to modify.

I The population of potentials users who understand both Rand C++ is smaller.

Page 7: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Why Rcpp

I Easy to use (honest).

I Clean and approachable API that enable for high performancecode.

I R style vectorized code at C++ level.

I Programmer time vs computer time: much more efficient codethat does not take much longer to write.

I Enables access to advanced data structures and algorithmsimplented in C++ but not provided by R.

I Handles garbage collection and the Rcpp programmer shouldnever have to worry about memory allocation anddeallocation.

Page 8: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

C++ in 2 minutes

cppFunction('

double sumC(NumericVector x) {int n = x.size();

double total = 0;

for(int i = 0; i < n; ++i) {total += x[i];

if(total > 100)

break;

}return total;

}')

Page 9: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

I Need to initialize your variables with data type.

I for loops of structure for(initialization; condition; increment).

I conditionals are the same as R.

I End every statement with a semicolon.

I Vectors and arrays are 0-indexed.

I size() is a member function on the vector class - x.size()returns the size of x.

I While C++ can be a very complex language, just knowingthese will enable you to write faster R functions.

Page 10: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Typical bottlenecks in R

I Loops that depend on previous iterations, eg MCMC methods.

I Function calls in R slow, but very little overhead in C++.Recursive functions are very inefficient in R.

I Not having access to advanced data structures algorithms inR but available in C++.

Page 11: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

When to use Rcpp

I Before writing C++ code, you should first ask if it’s necessary.

I Take advantage of vectorization when possible.

I Most base R functions already call C functions. Make surethere isn’t already an efficient implementation of what you aretrying to do.

Page 12: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Data Structures

I All R objects are internally represented by a SEXP: a pointerto an S expression object.

I Any R object can be passed down to C++ code: vectors,matrices lists. Even functions and environments.

I A large number of user-visible classes for R objects, whichcontain pointers the the SEXP object.

I IntegerVectorI NumericVectorI LogicalVectorI CharacterVectorI NumericMatrixI and many more

Page 13: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

library(inline, quietly=TRUE)

##

## Attaching package: ’inline’

##

## The following object is masked from ’package:Rcpp’:

##

## registerPlugin

src <- '

Rcpp::NumericVector invec(vx);

Rcpp::NumericVector outvec(vx);

for(int i=0; i<invec.size(); i++) {outvec[i] = log(invec[i]);

}return outvec;

'

fun <- cxxfunction(signature(vx="numeric"), src, plugin="Rcpp")

x <- seq(1.0, 3.0, by=1)

cbind(x, fun(x))

## x

## [1,] 0.0000 0.0000

## [2,] 0.6931 0.6931

## [3,] 1.0986 1.0986

Note: outvec and invec point to the same underlying R object.

Page 14: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Use clone to not modify original vector.

src <- '

Rcpp::NumericVector invec(vx);

Rcpp::NumericVector outvec = Rcpp::clone(vx);

for(int i=0; i<invec.size(); i++) {outvec[i] = log(invec[i]);

}return outvec;

'

fun <- cxxfunction(signature(vx="numeric"), src, plugin="Rcpp")

x <- seq(1.0, 3.0, by=1)

cbind(x, fun(x))

## x

## [1,] 1 0.0000

## [2,] 2 0.6931

## [3,] 3 1.0986

Page 15: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Creating R packages

Inspection of R source code for any R package will reveal thedirectories::

I R: for R functions

I vignettes: LATEXpapers weaving R code and indicating theintended workflow of an analysis.

I man: documentation for exported R functions.

I src: compiled code

The file DESCRIPTION provides a brief description of the project, aversion number, and any packages for which your package depends.

Page 16: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Creating R packages

I All compiled code coes in package/src directory.

I Code in src/ will be automatically compiled and sharedlibraries created when building the package.

I Instantiate an Rcpp package: Rcpp.package.skeleton

Page 17: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Anatomy of a .Call function

Example: discrete approximation to a joint posterior distribution.A nested for loop is unavoidable..h file (src/grid.h):

#ifndef _RCPP_GBBS_H

#define _RCPP_GBBS_H

#include <Rcpp.h>

RcppExport SEXP rcpp_discrete(SEXP grid, SEXP meangrid,

SEXP precgrid, SEXP dtheta, SEXP dprec, SEXP y);

#endif

Page 18: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Anatomy of a .Call function

.cpp file (src/rcpp discrete.h):

#include "grid.h"

#include "Rmath.h"

RcppExport SEXP rcpp_discrete(SEXP grid, SEXP meangrid, SEXP precgrid,

SEXP dtheta, SEXP dprec, SEXP y) {

//Rcpp::RNGScope scope;

using namespace Rcpp;

RNGScope scope;

// convert SEXP inputs to C++ vectors

NumericMatrix xgrid(grid);

NumericVector xmeangrid(meangrid);

NumericVector xprecgrid(precgrid);

NumericVector xdtheta(dtheta);

NumericVector xdprec(dprec);

NumericVector xy(y);

Page 19: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

int G = xmeangrid.size();

int H = xprecgrid.size();

int N = xy.size();

NumericVector tmp(N);

double lik; // likelihood

//dprec = dgamma(xprecgrid, xnu0[0]/2, xnu0[0]/2*xsigma20[0]);

//dtheta = dnorm(xmeangrid, xmu0, sqrt(xtau20[0]));

for(int g = 0; g < G; g++) {

for(int h = 0; h < H; h++) {

tmp = dnorm(xy, xmeangrid[g], 1/(sqrt(xprecgrid[h])));

lik = 1;

for(int n = 0; n < N; n++) {

lik *= tmp[n];

}

xgrid(g, h) = xdtheta[g] * xdprec[h] * lik;

}

}

return xgrid;

}

Page 20: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

In R:

y <- c(1.64, 1.7, 1.72, 1.74, 1.82, 1.82, 1.82, 1.9, 2.08)

mu0 <- 1.9

tau20 <- 0.95^2

nu0 <- 1

sigma20 <- 1e-02

set.seed(1)

G <- H <- 500

mean.grid <- seq(1.5, 2, length = G)

prec.grid <- seq(1.75, 175, length = H)

post.grid <- matrix(NA, G, H)

dtheta <- dnorm(mean.grid, mu0, sqrt(tau20))

dprec <- dgamma(prec.grid, nu0/2, nu0/2 * sigma20)

for (g in seq_len(G)) {for (h in seq_len(H)) {

post.grid[g, h] <- dtheta[g] *

dprec[h] * prod(dnorm(y,mean.grid[g], 1/sqrt(prec.grid[h])))

}}

Page 21: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

System time for R

system.time(for (g in seq_len(G)) {for (h in seq_len(G)) {

post.grid[g, h] <- dtheta[g] * dprec[h] * prod(dnorm(y,

mean.grid[g], 1/sqrt(prec.grid[h])))

}})

## user system elapsed

## 2.028 0.006 2.033

Page 22: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

System time for .Call

library(CNPBayes, quietly=TRUE, verbose=FALSE)

## Loading required package: grid

## Loading required package: lattice

## Loading required package: survival

## Loading required package: splines

## Loading required package: Formula

##

## Attaching package: ’Hmisc’

##

## The following objects are masked from ’package:base’:

##

## format.pval, round.POSIXt, trunc.POSIXt, units

post.grid2 <- matrix(0, G, H)

system.time(grid2 <- .Call("rcpp_discrete", post.grid2,

mean.grid, prec.grid, dtheta, dprec, y))

## user system elapsed

## 0.054 0.000 0.055

This is a speed improvement by a factor of 35x.

Page 23: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Rcpp Sugar

I Rcpp sugar brings a higher level of abstraction to C++ codewritten in Rcpp.

I Avoid C++ loops with code that strongly resembles R.

I Takes advantage of operator overloading.

I Despite the similar syntax, peformance is much faster in C++,though not quite as fast as manually optimized C++ code.

Page 24: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Example

pdistR <- function(x, ys) {(x - ys)^2

}

cppFunction('NumericVector pdistC2(double x, NumericVector ys) {return pow((x-ys), 2);

}')

pdistR(5.0, c(4.1,-9.3,0, 13.7))

## [1] 0.81 204.49 25.00 75.69

pdistC2(5.0, c(4.1,-9.3,0, 13.7))

## [1] 0.81 204.49 25.00 75.69

Page 25: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Logical Operators

// two integer vectors of the same size

NumericVector x;

NumericVector y;

// expressions involving two vectors

LogicalVector res = x < y;

LogicalVector res = x != y;

// one vector, one single value

LogicalVector res = x < 2;

// two expressions

LogicalVector res = (x + y) == (x*x);

// functions producing single boolean result

all(x*x < 3);

any(x*x < 3);

Page 26: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Logical OperatorsThere are many functions similar to what exists inside R

is_na(x);

seq_along(x);

sapply( seq_len(10), square<int>() );

ifelse( x < y, x, (x+y)*y );

pmin( x, x*x);

diff( xx );

intersect( xx, yy); //returns interserct of two vectors

unique( xx ); // subset of unique values in input vector

// math functions

abs(x); exp(x); log(x); ceil(x);

sqrt(x); sin(x); gamma(x);

range(x);

mean(x); sd(x); var(x);

which_min(x); which_max(x);

// A bunch more

Page 27: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Density and random number generation functions

Rcpp has access to the same density, distribution, and RNGfunctions used by R itself. For example, you can draw from agamma distribution with scale and shape parameters equal to 1with:

cppFunction('NumericVector getRGamma() {RNGScope scope;

NumericVector x = rgamma( 10, 1, 1 );

return x;

}')

getRGamma()

## [1] 0.08369 0.83618 1.22254 1.15836 0.99002 0.30737 0.09462 0.15720

## [9] 0.31081 0.46873

Page 28: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

RcppArmadillo

I Armadillo is a high level and easy to use C++ linear algebralibrary with syntax similar to Matlab.

I RcppArmadillo is an Rcpp interface allowing access to theArmadillo library.

Page 29: Speeding up R with Rcpp - Biostatistics - Departmentsjbai/compclub/rcpp_presentation.pdf · Speeding up R with Rcpp Stephen Cristiano Department of Biostatistics Johns Hopkins University

Resources

I ’Seamless R and C++ integration with Rcpp’ by DirkEddelbuettel. Excellent book for learning Rcpp. Available forfree through JHU library.

I Hadley Wickham’s Rcpp tutorial:http://adv-r.had.co.nz/Rcpp.html

I A huge number of examples at http://gallery.rcpp.org

I Stack exchange.

I Rumen