Top Banner
Package ‘pensim’ February 20, 2015 Type Package Title Simulation of high-dimensional data and parallelized repeated penalized regression Version 1.2.9 Author L. Waldron, M. Pintilie, M.- S. Tsao, F. A. Shepherd, C. Huttenhower*, I. Jurisica* (*equal contribution) Maintainer Levi Waldron <[email protected]> Depends R (>= 2.10.0) Imports parallel, penalized, MASS Suggests survivalROC, survival Description Simulation of continuous, correlated high-dimensional data with time to event or bi- nary response, and parallelized functions for Lasso, Ridge, and Elastic Net penalized regres- sion with repeated starts and two-dimensional tuning of the Elastic Net. License GPL (>= 2) LazyLoad yes NeedsCompilation no Repository CRAN Date/Publication 2014-03-14 11:32:43 R topics documented: pensim-package ....................................... 2 beer.exprs .......................................... 3 beer.survival ......................................... 4 create.data .......................................... 4 opt.nested.crossval ..................................... 7 opt.splitval .......................................... 11 opt1D ............................................ 13 opt2D ............................................ 16 scan.l1l2 ........................................... 19 Index 23 1
23

Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

Sep 04, 2018

Download

Documents

phamtram
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

Package ‘pensim’February 20, 2015

Type Package

Title Simulation of high-dimensional data and parallelized repeatedpenalized regression

Version 1.2.9

Author L. Waldron, M. Pintilie, M.-S. Tsao, F. A. Shepherd, C. Huttenhower*, I. Jurisica* (*equal contribution)

Maintainer Levi Waldron <[email protected]>

Depends R (>= 2.10.0)

Imports parallel, penalized, MASS

Suggests survivalROC, survival

Description Simulation of continuous, correlated high-dimensional data with time to event or bi-nary response, and parallelized functions for Lasso, Ridge, and Elastic Net penalized regres-sion with repeated starts and two-dimensional tuning of the Elastic Net.

License GPL (>= 2)

LazyLoad yes

NeedsCompilation no

Repository CRAN

Date/Publication 2014-03-14 11:32:43

R topics documented:pensim-package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2beer.exprs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3beer.survival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4create.data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4opt.nested.crossval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7opt.splitval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11opt1D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13opt2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16scan.l1l2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Index 23

1

Page 2: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

2 pensim-package

pensim-package Functions and data for simulation of high-dimensional data and par-allelized repeated penalized regression

Description

Simulation of continuous, correlated high-dimensional data with time-to-event or binary response,and parallelized functions for Lasso, Ridge, and Elastic Net penalized regression model training andvalidation by split-sample or nested cross-validation. See the help page for opt.nested.crossval() forthe most extensive usage examples.

Details

Package: pensimType: PackageLicense: GPL (>=2)LazyLoad: yes

Model training and validation by Lasso, Ridge, and Elastic Net penalized regression. This packagealso contains a function for simulation of correlated high-dimensional data with binary or time-to-event response.

Author(s)

Levi Waldron

Maintainer: Levi Waldron <[email protected]>

References

Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized applica-tion of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406.(*equal contribution)

See Also

penalized-package

Examples

set.seed(9)##create some data, with one of a group of five correlated variables##having an association with the binary outcome:x <- create.data(nvars=c(10, 3), cors=c(0, 0.8),

associations=c(0, 2), firstonly=c(TRUE, TRUE), nsamples=50,response="binary", logisticintercept=0.5)

x$summary

Page 3: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

beer.exprs 3

##predictor data frame and binary response vectorpen.data <- x$data[, -match("outcome", colnames(x$data))]response <- x$data[, match("outcome", colnames(x$data))]## lasso regression. Note that epsilon=1e-2 is passed onto optL1, and## reduces the precision of the tuning compared to the default 1e-10.output <- opt1D(nsim=1, nprocessors=1, penalized=pen.data, response=response, epsilon=1e-2)cc <- output[which.max(output[, "cvl"]), -1:-3] ##non-zero b.* are true positives

beer.exprs Lung adenocarcinoma microarray expression data of Beer et al.(2002)

Description

Lung adenocarcinomas were profiled by Beer et al. (2002) using Affymetrix hu6800 microarrays.The data here were normalized from raw .CEL files by RMAExpress (v0.3). The expression matrixcontains expression data for 86 patients with 7,129 probe sets.

Usage

data(beer.exprs)

Format

A data frame with 7129 probe sets (rows) for 86 patients (columns)

Source

Beer DG, Kardia SL, Huang C, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG,Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JM, Iannettoni MD, Orringer MB, HanashS: Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med 2002,8:816-824.

References

Irizarry, R.A., et al. (2003) Summaries of Affymetrix GeneChip probe level data, Nucl. Acids Res.,31, e15+-e15+.

Examples

data(beer.exprs)mysd <- apply(beer.exprs, 1, sd)

beer.subset <- as.matrix(beer.exprs[rank(-mysd) <= 100, ])heatmap(beer.subset)

Page 4: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

4 create.data

beer.survival Survival data for Beer et al. (2002) lung adenocarcinoma study

Description

Overall survival time for 86 lung adenocarcinoma patients, with 62 of the 86 events being censored.

Usage

data(beer.survival)

Format

A data frame with 86 observations on the following 2 variables.

status a numeric vector

os a numeric vector

Source

Beer DG, Kardia SL, Huang C, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG,Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JM, Iannettoni MD, Orringer MB, HanashS: Gene-expression pr ofiles predict survival of patients with lung adenocarcinoma. Nat Med 2002,8:816-824.

Examples

data(beer.survival)library(survival)surv.obj <- with(beer.survival, Surv(os, status))surv.obj.rev <- with(beer.survival, Surv(os, 1-status))survfit(surv.obj.rev~1) #reverse KM estimate of follow-up time (months)(my.survfit <- survfit(surv.obj~1)) ##KM estimate of survivalplot(my.survfit, xlab="Time (months)",

ylab="KM estimate of overall survival")legend("bottomright", lty=c(1, 2), pch=-1,

legend=c("KM estimate", "95 percent confidence interval"))

create.data simulate correlated predictors with time-to-event or binary outcome

Description

This function creates multiple groups of predictor variables which may be correlated within eachgroup, and binary or survival time (without censoring) response according to specified weights ofthe predictors.

Page 5: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

create.data 5

Usage

create.data(nvars = c(100, 100, 100, 100, 600),cors = c(0.8, 0, 0.8, 0, 0),associations = c(0.5, 0.5, 0.3, 0.3, 0),firstonly = c(TRUE, FALSE, TRUE, FALSE, FALSE),nsamples = 100,censoring = "none",labelswapprob = 0,response = "timetoevent",basehaz = 0.2,logisticintercept = 0)

Arguments

nvars integer vector giving the number of variables of each variable type. The numberof variable types is equal to the length of this vector.

cors integer vector of the same length as nvars, giving the population pairwise Pear-son correlation within each group.

associations integer vector of the same length as nvars, giving the associations of each typewith outcome

firstonly logical vector of the same length as nvars, specifying whether only the first vari-able of each type is associated with outcome (TRUE) or all variables of that type(FALSE)

nsamples an integer giving the number of observations

censoring "none" for no censoring, or a vector of length two c(a,b) for uniform U(a,b)censoring.

labelswapprob This provides an option to add uncertainty to binary outcomes by randomlyswitching labels with probability labelswapprob. The probability of a label be-ing swapped is independent for each observation. The value is ignored if re-sponse is "timetoevent"

response either "timetoevent" or "binary"

basehaz baseline hazard, used for "timetoevent"

logisticintercept

intercept which is added to X%*%Beta for "binary"

Details

This function simulates "predictor" variables in one or more groups, which are standard normallydistributed. The user can specify the population correlation within each variable group, the associ-ation of each variable group to outcome, and whether the first or all variables of that type should beassociated with outcome. The simulated response variable can be time to event with an exponentialdistribution, or binary survival with a logistic distribution.

Page 6: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

6 create.data

Value

Returns a list with items:

summary a summary of the variable types produced

associations weights of each variable in computing the outcome

covariance covariance matrix used for generating potentially correlated random predictors

data dataframe containing the predictors and response. Response is the last columnfor binary outcome ("outcome"), and the last two columns for timetoevent out-come ("time" and "cens")

Note

Depends on the MASS package for correlated random number generation

Author(s)

Levi Waldron et al.

References

Waldron L., Pintilie M., Tsao M.-S., Shepherd F. A., Huttenhower C.*, and Jurisica I.* Optimizedapplication of penalized regression methods to diverse genomic data. (2010). Under review. (*equalcontribution)

Examples

##binary outcome exampleset.seed(9)x <- create.data(nvars=c(15, 5), cors=c(0, 0.8), associations=c(0, 2),

firstonly=c(TRUE, TRUE), nsamples=50, response="binary",logisticintercept=0.5)

summary(x)x$summarymodel <- glm(outcome ~ ., data=x$data, family=binomial)summary(model)dat <- t(as.matrix(x$data[, -match("outcome", colnames(x$data))]))heatmap(dat, ColSideColors=ifelse(x$data$outcome==0, "black", "white"))

##censored survival outcome example:set.seed(1)x <- create.data(nvars=c(15, 5),

cors=c(0, 0.8),associations=c(0, 2),firstonly=c(TRUE, TRUE),nsamples=50,censoring=c(2, 10),response="timetoevent")

sum(x$data$cens==0)/nrow(x$data) #34 percent censoring

library(survival)

Page 7: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

opt.nested.crossval 7

surv.obj <- Surv(x$data$time, x$data$cens)plot(survfit(surv.obj~1), ylab="Survival probability", xlab="time")

opt.nested.crossval Parallelized calculation of cross-validated risk score predictions fromL1/L2/Elastic Net penalized regression.

Description

calculates risk score predictions by a nested cross-validation, using the optL1 and optL2 functions ofthe penalized R package for regression. In the outer level of cross-validation, samples are split intotraining and test samples. Model parameters are tuned by cross-validation within training samplesonly.

By setting nprocessors > 1, the outer cross-validation is split between multiple processors.

The functions support z-score scaling of training data, and application of these scaling and shiftingcoefficients to the test data. It also supports repeated tuning of the penalty parameters and selectionof the model with greatest cross-validated likelihood.

Usage

opt.nested.crossval(outerfold=10, nprocessors=1, cl=NULL, ...)

Arguments

outerfold number of folds in outer cross-validation (the level used for validation)

nprocessors An integer number of processors to use. If specified in opt.nested.crossval, itera-tions of the outer cross-validation are sent to different processors. If specified inopt.splitval, repeated starts for the penalty tuning are sent to different processors.

cl Optional cluster object created with the makeCluster() function of the parallelpackage. If this is not set, pensim calls makeCluster(nprocessors, type="SOCK").Setting this parameter can enable parallelization in more diverse scenarios thanmulti-core desktops; see the documentation for the parallel package. Note that ifcl is user-defined, this function will not automatically run parallel::stopCluster()to shut down the cluster.

... optFUN (either "opt1D" or "opt2D"), scaling (TRUE to z-score training datathen apply the same shift and scale factors to test data, FALSE for no scaling)are passed onto the opt.splitval function. Additional arguments are required, tobe passed to the optL1 or optL2 function of the penalized R package. See thosehelp pages, and it may be desirable to test these arguments directly on optL1 oroptL2 before using this more CPU-consuming and complex function.

Page 8: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

8 opt.nested.crossval

Details

This function calculates cross-validated risk score predictions, tuning a penalized regression modelusing the optL1 or optL2 functions of the penalized R package, for each iteration of the cross-validation. Tuning is done by cross-validation in the training samples only. Test samples are scaledusing the shift and scale factors determined from the training samples. parameter. If nprocessors >1, it uses the SNOW package for parallelization, dividing the iterations of the outer cross-validationamong the specified number of processors.

Some arguments MUST be passed (through the ... arguments) but which are documented for thefunctions in which they are used. These include, from the opt.splitval function:

optFUN="opt1D" for Lasso or Ridge regression, or "opt2D" for Elastic Net. See the help pages foropt1D and opt2D for additional arguments associated with these functions.

scaling=TRUE to scale each feature (column) of the training sample to z-scores. These same scalingand shifting factors are applied to the test data. If FALSE, no scaling is done. Note that only data inthe penalized argument are scaled, not the optional unpenalized argument (see documentation foropt1D, opt2D, or cvl from the penalized package for descriptions of the penalized and unpenalizedarguments). Alternatively, the standardize=TRUE argument to the penalized package functions canbe used to do scaling internally.

nsim=50 this number specifies the number of times to repeat tuning of the penalty parameters ondifferent data foldings for the cross-validation.

setpen="L1" or "L2" : if optFUN="opt1D", this sets regression type to LASSO or Ridge, respec-tively. See ?opt1D.

L1range, L2range, dofirst, L1gridsize, L2gridsize: options for Elastic Net regression if optFUN="opt2D".See ?opt2D.

Value

Returns a vector of cross-validated continuous risk score predictions.

Note

Depends on the R packages: penalized, parallel

Author(s)

Levi Waldron et al.

References

Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized applica-tion of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406.(*equal contribution)

See Also

opt.splitval

Page 9: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

opt.nested.crossval 9

Examples

data(beer.exprs)data(beer.survival)

##select just 100 genes to speed computation:set.seed(1)beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 100), ]

gene.quant <- apply(beer.exprs.sample, 1, quantile, probs=0.75)dat.filt <- beer.exprs.sample[gene.quant>log2(100), ]gene.iqr <- apply(dat.filt, 1, IQR)dat.filt <- as.matrix(dat.filt[gene.iqr>0.5, ])dat.filt <- t(dat.filt)dat.filt <- data.frame(dat.filt)

library(survival)surv.obj <- Surv(beer.survival$os, beer.survival$status)

## First, test the regression arguments using functions from## the penalized package. I use maxlambda1=5 here to ensure at least## one non-zero coefficient.testfit <- penalized::optL1(response=surv.obj,

maxlambda1=5,penalized=dat.filt,fold=5,positive=FALSE,standardize=TRUE,trace=TRUE)

## Now pass these arguments to opt.nested.splitval() for cross-validated## calculation and assessment of risk scores, with the additional## arguments:## outerfold and nprocessors (?opt.nested.crossval)## optFUN and scaling (?opt.splitval)## setpen and nsim (?opt1D)

## Ideally nsim would be 50, and outerfold and fold would be 10, but the## values below speed computation 200x compared to these recommended## values. Note that here we are using the standardize=TRUE argument of## optL1 rather than the scaling=TRUE argument of opt.splitval. These## two approaches to scaling are roughly equivalent, but the scaling## approaches are not the same (scaling=TRUE does z-score,## standardize=TRUE scales to unit central L2 norm), and results will## not be identical. Also, using standardize=TRUE scales variables but## provides coeffients for the original scale, whereas using## scaling=TRUE scales variables in the training set then applies the## same scales to the test set.set.seed(1)## In this example I use two processors:preds <- opt.nested.crossval(outerfold=5, nprocessors=2, #opt.nested.crossval arguments

optFUN="opt1D", scaling=FALSE, #opt.splitval argumentssetpen="L1", nsim=1, #opt1D arguments

Page 10: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

10 opt.nested.crossval

response=surv.obj, #rest are penalized::optl1 argumentspenalized=dat.filt,fold=2,maxlambda1=5,positive=FALSE,standardize=TRUE,trace=FALSE)

## We probably also want the coefficients from the model fit on all the## data, for future use:beer.coefs <- opt1D(setpen="L1", nsim=1,

maxlambda1=5,response=surv.obj,penalized=dat.filt,fold=5,positive=FALSE,standardize=TRUE,trace=FALSE)

## We can also include unpenalized covariates, if desired.## Note that when keeping only one variable for a penalized or## unpenalized covariate, indexing a dataframe like [1] instead of doing## [,1] preserves the variable name. With [,1] the variable name gets## converted to "".

beer.coefs <- opt1D(setpen="L1", nsim=1,maxlambda1=5,response=surv.obj,penalized=dat.filt[-1], # This is equivalent to dat.filt[, -1]unpenalized=dat.filt[1],fold=5,positive=FALSE,standardize=TRUE,trace=FALSE)

## (note the non-zero first coefficient this time, due to it being unpenalized).

## Summarization and plotting.preds.dichot <- preds > median(preds)

coxfit.continuous <- coxph(surv.obj~preds)coxfit.dichot <- coxph(surv.obj~preds.dichot)summary(coxfit.continuous)summary(coxfit.dichot)

if(require(survivalROC)){nobs <- length(preds)cutoff <- 12preds.roc <- survivalROC(Stime=beer.survival$os, status=beer.survival$status,

marker=preds, predict.time=cutoff, span = 0.25*nobs^(-0.20))plot(preds.roc$FP, preds.roc$TP, type="l", xlim=c(0,1), ylim=c(0,1),

lty=2,xlab=paste( "FP", "\n", "AUC = ", round(preds.roc$AUC, 3)),ylab="TP", main="LASSO predictions\n ROC curve at 12 months")

Page 11: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

opt.splitval 11

abline(0, 1)}

## clean-uprm(nobs, cutoff, preds.roc, coxfit.dichot, coxfit.continuous, preds,

preds.dichot, dat.filt, gene.iqr, gene.quant)

opt.splitval Parallelized calculation of split training/test set predictions fromL1/L2/Elastic Net penalized regression.

Description

uses a single training/test split to train a penalized regression model in the training samples, thenuse the model to calculate values of the linear risk score in the test samples. This function is usedby opt.nested.crossval, but can also be used on its own.

This function support z-score scaling of training data, and application of these scaling and shiftingcoefficients to the test data. It also supports repeated tuning of the penalty parameters and selectionof the model with greatest cross-validated likelihood.

Usage

opt.splitval(optFUN="opt1D",testset="equal",scaling=TRUE,...)

Arguments

optFUN "opt1D" for Lasso or Ridge regression, "opt2D" for Elastic Net. See the helppages for these functions for additional arguments.

testset For the opt.splitval function ONLY. "equal" for randomly assigned equal trainingand test sets, or an integer vector defining the positions of the test samples in theresponse, penalized, and unpenalized arguments which are passed to the optL1,optL2, or cvl functions of the penalized R package.

scaling If TRUE, each feature (column) of the training samples (in matrix/dataframespecified by the penalized argument) are scaled to z-scores, then these scalingand shifting factors are applied to the test data. If FALSE, no scaling is done.

... Additional arguments are required, to be passed to the optL1 or optL2 functionof the penalized R package. See those help pages, and it may be desirable totest these arguments directly on optL1 or optL2 before using this more CPU-consuming and complex function.

Details

This function does split sample model training and testing for a single split of the data, using theoptL1 or optL2 functions of the penalized R package, for each iteration of the cross-validation.Scaling of the test samples is done independently, using scale factors determined from the train-ing samples. Repeated starts of model training can be parallelized as documented in the opt1Dand opt2D functions. This function is used for nested cross-validation by the opt.nested.crossvalfunction.

Page 12: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

12 opt.splitval

Value

Returns a vector of cross-validated continuous risk score predictions.

Note

Depends on the R packages: penalized, parallel, rlecuyer

Author(s)

Levi Waldron et al.

References

Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized applica-tion of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406.(*equal contribution)

See Also

opt1D, opt2D, opt.nested.crossval

Examples

data(beer.exprs)data(beer.survival)

##select just 250 genes to speed computation:set.seed(1)beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 250), ]

gene.quant <- apply(beer.exprs.sample,1,quantile,probs=0.75)dat.filt <- beer.exprs.sample[gene.quant>log2(100),]gene.iqr <- apply(dat.filt,1,IQR)dat.filt <- as.matrix(dat.filt[gene.iqr>0.5,])dat.filt <- t(dat.filt)

library(survival)surv.obj <- Surv(beer.survival$os,beer.survival$status)

##Single split training/test evaluation. Ideally nsim would be 50 and##fold=10, but this requires 100x more resources.set.seed(1)preds50 <- opt.splitval(optFUN="opt1D",scaling=TRUE,testset="equal",

setpen="L1",nsim=1,nprocessors=1,response=surv.obj,penalized=dat.filt,fold=5,positive=FALSE,standardize=FALSE,trace=FALSE)

preds50.dichot <- preds50 > median(preds50)

surv.obj.50 <- surv.obj[match(names(preds50),rownames(beer.survival))]coxfit50.continuous <- coxph(surv.obj.50~preds50)

Page 13: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

opt1D 13

coxfit50.dichot <- coxph(surv.obj.50~preds50.dichot)summary(coxfit50.continuous)summary(coxfit50.dichot)

##clean-uprm(coxfit50.dichot,coxfit50.continuous,preds50,preds50.dichot,surv.obj,

surv.obj.50,dat.filt,gene.iqr,gene.quant)

opt1D Parallelized repeated tuning of Lasso or Ridge penalty parameter

Description

This function is a wrapper to the optL1 and optL2 functions of the penalized R package, useful forparallelized repeated tuning of the penalty parameters.

Usage

opt1D(nsim = 50, nprocessors = 1, setpen = "L1", cl = NULL, ...)

Arguments

nsim Number of times to repeat the simulation (around 50 is suggested)nprocessors An integer number of processors to use.setpen Either "L1" (Lasso) or "L2" (Ridge) penaltycl Optional cluster object created with the makeCluster() function of the parallel

package. If this is not set, pensim calls makeCluster(nprocessors, type="SOCK").Setting this parameter can enable parallelization in more diverse scenarios thanmulti-core desktops; see the documentation for the parallel package. Note that ifcl is user-defined, this function will not automatically run parallel::stopCluster()to shut down the cluster.

... arguments passed on to optL1 or optL2 function of the penalized R package

Details

This function sets up a SNOW (Simple Network of Workstations) "sock" cluster to parallelize thetask of repeated tunings the L1 or L2 penalty parameter. Tuning of the penalty parameters is doneby the optL1 or optL2 functions of the penalized R package.

Value

Returns a matrix with the following columns:

L1 (or L2) optimized value of the penalty parametercvl optimized cross-validated likelihoodcoef_1, coef_2, ..., coef_n

argmax coefficients for the model with this value of the tuning parameter

The matrix contains one row for each repeat of the regression.

Page 14: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

14 opt1D

Note

Depends on the R packages: penalized, parallel, rlecuyer

Author(s)

Levi Waldron et al.

References

Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized applica-tion of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406.(*equal contribution)

See Also

optL1, optL2

Examples

data(beer.exprs)data(beer.survival)

##select just 100 genes to speed computation:set.seed(1)beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 100), ]

gene.quant <- apply(beer.exprs.sample, 1, quantile, probs=0.75)dat.filt <- beer.exprs.sample[gene.quant>log2(100), ]gene.iqr <- apply(dat.filt, 1, IQR)dat.filt <- as.matrix(dat.filt[gene.iqr>0.5, ])dat.filt <- t(dat.filt)

##define training and test setsset.seed(1)trainingset <- sample(rownames(dat.filt), round(nrow(dat.filt)/2))testset <- rownames(dat.filt)[!rownames(dat.filt) %in% trainingset]

dat.training <- data.frame(dat.filt[trainingset, ])pheno.training <- beer.survival[trainingset, ]

library(survival)surv.training <- Surv(pheno.training$os, pheno.training$status)

dat.test <- data.frame(dat.filt[testset, ])all.equal(colnames(dat.training), colnames(dat.test))pheno.test <- beer.survival[testset, ]surv.test <- Surv(pheno.test$os, pheno.test$status)

##ideally nsim should be on the order of 50, but this slows computation##50x without parallelization.set.seed(1)system.time(output <- opt1D(nsim=1, nprocessors=1, setpen="L2", response=surv.training,

Page 15: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

opt1D 15

penalized=dat.training, fold=10, positive=FALSE, standardize=TRUE,minlambda2=0.2, maxlambda2=100))

##Try again using two processors:system.time(output <- opt1D(nsim=2, nprocessors=2, setpen="L2", response=surv.training,penalized=dat.training, fold=10, positive=FALSE, standardize=TRUE,minlambda2=0.2, maxlambda2=100))

cc <- output[which.max(output[, "cvl"]), -(1:2)] #coefficientssum(abs(cc)>0) #count non-zero coefficients

preds.training <- as.matrix(dat.training) %*% ccpreds.training.median <- median(preds.training)preds.training.dichot <- ifelse(preds.training > preds.training.median, "high risk", "low risk")preds.training.dichot <- factor(preds.training.dichot[, 1], levels=c("low risk", "high risk"))preds.test <- as.matrix(dat.test) %*% ccpreds.test.dichot <- ifelse(preds.test > preds.training.median, "high risk", "low risk")preds.test.dichot <- factor(preds.test.dichot[, 1], levels=c("low risk", "high risk"))

coxphfit.training <- coxph(surv.training~preds.training.dichot)survfit.training <- survfit(surv.training~preds.training.dichot)summary(coxphfit.training)coxphfit.test <- coxph(surv.test~preds.test.dichot)survfit.test <- survfit(surv.test~preds.test.dichot)summary(coxphfit.test)

(p.training <- signif(summary(coxphfit.training)$logtest[3], 2)) #likelihood ratio test(hr.training <- signif(summary(coxphfit.training)$conf.int[1], 2))(hr.lower.training <- summary(coxphfit.training)$conf.int[3])(hr.upper.training <- summary(coxphfit.training)$conf.int[4])par(mfrow=c(1, 2))plot(survfit.training,

col=c("black", "red"),conf.int=FALSE,xlab="Months",main="TRAINING",ylab="Overall survival")

xmax <- par("usr")[2]text(x=xmax, y=0.9, lab=paste("HR=", hr.training), pos=2)text(x=xmax, y=0.8, lab=paste("p=", p.training, "", sep=""), pos=2)tmp <- summary(preds.training.dichot)text(x=c(xmax, xmax), y=c(0.7, 0.6), lab=paste(tmp, names(tmp)), col=1:2, pos=2)(p.test <- signif(summary(coxphfit.test)$logtest[3], 2)) #likelihood ratio test(hr.test <- signif(summary(coxphfit.test)$conf.int[1], 2))(hr.lower.test <- summary(coxphfit.test)$conf.int[3])(hr.upper.test <- summary(coxphfit.test)$conf.int[4])plot(survfit.test,

col=c("black", "red"),conf.int=FALSE,xlab="Months",main="TEST")

xmax <- par("usr")[2]text(x=xmax, y=0.55, lab=paste("HR=", hr.test), pos=2)

Page 16: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

16 opt2D

text(x=xmax, y=0.45, lab=paste("p=", p.test, "", sep=""), pos=2)tmp <- summary(preds.test.dichot)text(x=c(xmax, xmax), y=c(0.3, 0.2), lab=paste(tmp, names(tmp)), col=1:2, pos=2)

opt2D Parallelized, two-dimensional tuning of Elastic Net L1/L2 penalties

Description

This function implements parallelized two-dimensional optimization of Elastic Net penalty param-eters. This is accomplished by scanning a regular grid of L1/L2 penalties, then using the top fiveCVL penalty combinations from this grid as starting points for the convex optimization problem.

Usage

opt2D(nsim,L1range = c(0.001, 100),L2range = c(0.001, 100),dofirst = "both",nprocessors = 1,L1gridsize = 10, L2gridsize = 10,cl = NULL,...)

Arguments

nsim Number of times to repeat the simulation (around 50 is suggested)

L1range numeric vector of length two, giving minimum and maximum constraints on theL1 penalty

L2range numeric vector of length two, giving minimum and maximum constraints on theL2 penalty

dofirst "L1" to optimize L1 followed by L2, "L2" to optimize L2 followed by L1, or"both" to optimize both simultaneously in a two-dimensional optimization.

nprocessors An integer number of processors to use.

L1gridsize Number of values of the L1 penalty in the regular grid of L1/L2 penalties

L2gridsize Number of values of the L2 penalty in the regular grid of L1/L2 penalties

cl Optional cluster object created with the makeCluster() function of the parallelpackage. If this is not set, pensim calls makeCluster(nprocessors, type="SOCK").Setting this parameter can enable parallelization in more diverse scenarios thanmulti-core desktops; see the documentation for the parallel package. Note that ifcl is user-defined, this function will not automatically run parallel::stopCluster()to shut down the cluster.

... arguments passed on to optL1 and optL2 (dofirst="L1" or "L2"), or cvl (dofirst="both")functions of the penalized R package

Page 17: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

opt2D 17

Details

This function sets up a SNOW (Simple Network of Workstations) "sock" cluster to parallelize thetask of repeated tunings the Elastic Net penalty parameters. Three methods are implemented, as de-scribed by Waldron et al. (2011): lambda1 followed by lambda2 (lambda1-lambda2), lambda2 fol-lowed by lambda1 (lambda2-lambda1), and lambda1 with lambda2 simultaneously (lambda1+lambda2).Tuning of the penalty parameters is done by the optL1 or optL2 functions of the penalized R pack-age.

Value

Returns a matrix with the following columns:

L1 optimized value of the L1 penalty parameter

L2 optimized value of the L2 penalty parameter

cvl optimized cross-validated likelihood

convergence 0 if the optimization converged, non-zero otherwise (see stats:optim for details)

fncalls number of calls to cvl function during optimizationcoef_1, coef_2, ..., coef_n

argmax coefficients for the model with this value of the tuning parameter

The matrix contains one row for each repeat of the regression.

Note

Depends on the R packages: penalized, parallel, rlecuyer

Author(s)

Levi Waldron et al.

References

Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized applica-tion of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406.(*equal contribution)

See Also

optL1, optL2, cvl

Examples

data(beer.exprs)data(beer.survival)

##select just 100 genes to speed computation:set.seed(1)beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 100), ]

Page 18: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

18 opt2D

##apply an unreasonably strict gene filter here to speed computation##time for the Elastic Net example.gene.quant <- apply(beer.exprs.sample, 1, quantile, probs=0.75)dat.filt <- beer.exprs.sample[gene.quant>log2(150), ]gene.iqr <- apply(dat.filt, 1, IQR)dat.filt <- as.matrix(dat.filt[gene.iqr>1, ])dat.filt <- t(dat.filt)

##define training and test setsset.seed(9)trainingset <- sample(rownames(dat.filt), round(nrow(dat.filt)/2))testset <- rownames(dat.filt)[!rownames(dat.filt)%in%trainingset]

dat.training <- data.frame(dat.filt[trainingset, ])pheno.training <- beer.survival[trainingset, ]

library(survival)surv.training <- Surv(pheno.training$os, pheno.training$status)

dat.test <- data.frame(dat.filt[testset, ])all.equal(colnames(dat.training), colnames(dat.test))pheno.test <- beer.survival[testset, ]surv.test <- Surv(pheno.test$os, pheno.test$status)

set.seed(1)##ideally set nsim=50, fold=10, but this takes 100x longer.system.time(output <- opt2D(nsim=1, L1range=c(0.1, 1),L2range=c(20, 1000), dofirst="both", nprocessors=1,response=surv.training, penalized=dat.training, fold=5, positive=FALSE, standardize=TRUE))

cc <- output[which.max(output[, "cvl"]), -1:-5]output[which.max(output[, "cvl"]), 1:5] #small L1, large L2sum(abs(cc)>0) #number of non-zero coefficients

preds.training <- as.matrix(dat.training) %*% ccpreds.training.median <- median(preds.training)preds.training.dichot <- ifelse(preds.training > preds.training.median, "high risk", "low risk")preds.training.dichot <- factor(preds.training.dichot[, 1], levels=c("low risk", "high risk"))preds.test <- as.matrix(dat.test) %*% ccpreds.test.dichot <- ifelse(preds.test > preds.training.median, "high risk", "low risk")preds.test.dichot <- factor(preds.test.dichot[, 1], levels=c("low risk", "high risk"))

coxphfit.training <- coxph(surv.training~preds.training.dichot)survfit.training <- survfit(surv.training~preds.training.dichot)summary(coxphfit.training)coxphfit.test <- coxph(surv.test~preds.test.dichot)survfit.test <- survfit(surv.test~preds.test.dichot)summary(coxphfit.test)

(p.training <- signif(summary(coxphfit.training)$logtest[3], 2)) #likelihood ratio test(hr.training <- signif(summary(coxphfit.training)$conf.int[1], 2))(hr.lower.training <- summary(coxphfit.training)$conf.int[3])

Page 19: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

scan.l1l2 19

(hr.upper.training <- summary(coxphfit.training)$conf.int[4])par(mfrow=c(1, 2))plot(survfit.training,

col=c("black", "red"),conf.int=FALSE,xlab="Months",main="TRAINING",ylab="Overall survival")

xmax <- par("usr")[2]-5text(x=xmax, y=0.35, lab=paste("HR=", hr.training), pos=2)text(x=xmax, y=0.3, lab=paste("p=", p.training, "", sep=""), pos=2)tmp <- summary(preds.training.dichot)text(x=c(xmax, xmax), y=c(0.25, 0.2), lab=paste(tmp, names(tmp)), col=1:2, pos=2)## now the test set.## in the test set, HR=1.7 is not significant - not surprising with the## overly strict non-specific pre-filter (IQR>1, 75th percentile > log2(150)(p.test <- signif(summary(coxphfit.test)$logtest[3], 2)) #likelihood ratio test(hr.test <- signif(summary(coxphfit.test)$conf.int[1], 2))(hr.lower.test <- summary(coxphfit.test)$conf.int[3])(hr.upper.test <- summary(coxphfit.test)$conf.int[4])plot(survfit.test,

col=c("black", "red"),conf.int=FALSE,xlab="Months",main="TEST")

xmax <- par("usr")[2]text(x=xmax, y=0.35, lab=paste("HR=", hr.test), pos=2)text(x=xmax, y=0.3, lab=paste("p=", p.test, "", sep=""), pos=2)tmp <- summary(preds.test.dichot)text(x=c(xmax, xmax), y=c(0.25, 0.2), lab=paste(tmp, names(tmp)), col=1:2, pos=2)

scan.l1l2 Function calculate cross-validated likelihood on a regular grid ofL1/L2 penalties

Description

This function generates a grid of values of L1/L2 penalties, then calculated cross-validated likeli-hood at each point on the grid. The grid can be regular (linear progression of the penalty values), orpolynomial (finer grid for small penalty values, and coarser grid for larger penalty values).

Usage

scan.l1l2(L1range = c(0.1, 100.1),L2range = c(0.1, 100.1),L1.ngrid = 50,L2.ngrid = 50,nprocessors = 1,polydegree = 1,

Page 20: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

20 scan.l1l2

cl = NULL,...)

Arguments

L1range numeric vector of length two, giving minimum and maximum constraints on theL1 penalty

L2range numeric vector of length two, giving minimum and maximum constraints on theL2 penalty

L1.ngrid Number of values of the L1 penalty in the regular grid of L1/L2 penalties

L2.ngrid Number of values of the L2 penalty in the regular grid of L1/L2 penalties

nprocessors An integer number of processors to use.

polydegree power of the polynomial on which the L1/L2 penalty values are fit. ie if polyde-gree=2, penalty values could be y=x^2, x=1,2,3,..., so y=1,4,9,...

cl Optional cluster object created with the makeCluster() function of the parallelpackage. If this is not set, pensim calls makeCluster(nprocessors, type="SOCK").Setting this parameter can enable parallelization in more diverse scenarios thanmulti-core desktops; see the documentation for the parallel package. Note that ifcl is user-defined, this function will not automatically run parallel::stopCluster()to shut down the cluster.

... arguments passed on to cvl function of the penalized R package

Details

This function sets up a SNOW (Simple Network of Workstations) "sock" cluster to parallelize thetask of scanning a grid of penalty values to search for suitable starting values for two-dimensionaloptimization of the Elastic Net.

Value

cvl matrix of cvl values along the grid

L1range range of L1 penalties to scan

L2range range of L2 penalties to scan

xlab A text string indicating the range of L1 penalties

ylab A text string giving the range of L2 penalties

zlab A text string giving the range of cvl values

note A note to the user that rows of cvl correspond to values of lambda1, columns tolambda2

Note

Depends on the R packages: penalized, parallel, rlecuyer

Author(s)

Levi Waldron et al.

Page 21: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

scan.l1l2 21

References

Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized applica-tion of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406.(*equal contribution)

See Also

cvl

Examples

data(beer.exprs)data(beer.survival)

##select just 250 genes to speed computation:set.seed(1)beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 250), ]

gene.quant <- apply(beer.exprs.sample, 1, quantile, probs=0.75)dat.filt <- beer.exprs.sample[gene.quant>log2(150), ]gene.iqr <- apply(dat.filt, 1, IQR)dat.filt <- as.matrix(dat.filt[gene.iqr>1, ])dat.filt <- t(dat.filt)

##define training and test setsset.seed(9)trainingset <- sample(rownames(dat.filt), round(nrow(dat.filt)/2))testset <- rownames(dat.filt)[!rownames(dat.filt)%in%trainingset]

dat.training <- data.frame(dat.filt[trainingset, ])pheno.training <- beer.survival[trainingset, ]

library(survival)surv.training <- Surv(pheno.training$os, pheno.training$status)

dat.test <- data.frame(dat.filt[testset, ])all.equal(colnames(dat.training), colnames(dat.test))pheno.test <- beer.survival[testset, ]surv.test <- Surv(pheno.test$os, pheno.test$status)

set.seed(9)system.time(output <- scan.l1l2(L1range=c(0.2, 3.2), L2range=c(2, 30),L1.ngrid=10, L2.ngrid=10, polydegree=1, nprocessors=1,response=surv.training, penalized=dat.training, fold=4, positive=FALSE, standardize=TRUE))

##Note that the cvl surface is not smooth because a different folding of##the data was used for each cvl calculationimage(x=seq(output$L1range[1], output$L1range[2], length.out=nrow(output$cvl)),y=seq(output$L2range[1], output$L2range[2], length.out=ncol(output$cvl)),z=output$cvl,xlab="lambda1",ylab="lambda2",

Page 22: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

22 scan.l1l2

main="red is higher cross-validated likelihood")

Page 23: Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang

Index

∗Topic datagencreate.data, 4pensim-package, 2

∗Topic datasetsbeer.exprs, 3beer.survival, 4

∗Topic multivariatepensim-package, 2

∗Topic packagepensim-package, 2

∗Topic regressionopt.nested.crossval, 7opt.splitval, 11opt1D, 13opt2D, 16pensim-package, 2scan.l1l2, 19

∗Topic survivalcreate.data, 4opt.nested.crossval, 7opt.splitval, 11opt1D, 13opt2D, 16pensim-package, 2scan.l1l2, 19

beer.exprs, 3beer.survival, 4

create.data, 4

opt.nested.crossval, 7opt.splitval, 11opt1D, 13opt2D, 16

pensim (pensim-package), 2pensim-package, 2

scan.l1l2, 19

23