Package ‘pensim’ February 20, 2015 Type Package Title Simulation of high-dimensional data and parallelized repeated penalized regression Version 1.2.9 Author L. Waldron, M. Pintilie, M.- S. Tsao, F. A. Shepherd, C. Huttenhower*, I. Jurisica* (*equal contribution) Maintainer Levi Waldron <[email protected]> Depends R (>= 2.10.0) Imports parallel, penalized, MASS Suggests survivalROC, survival Description Simulation of continuous, correlated high-dimensional data with time to event or bi- nary response, and parallelized functions for Lasso, Ridge, and Elastic Net penalized regres- sion with repeated starts and two-dimensional tuning of the Elastic Net. License GPL (>= 2) LazyLoad yes NeedsCompilation no Repository CRAN Date/Publication 2014-03-14 11:32:43 R topics documented: pensim-package ....................................... 2 beer.exprs .......................................... 3 beer.survival ......................................... 4 create.data .......................................... 4 opt.nested.crossval ..................................... 7 opt.splitval .......................................... 11 opt1D ............................................ 13 opt2D ............................................ 16 scan.l1l2 ........................................... 19 Index 23 1
23
Embed
Package ‘pensim’ - The Comprehensive R Archive … · Package ‘pensim ’ February 20, 2015 ... status a numeric vector os a numeric vector Source Beer DG, Kardia SL, Huang
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Package ‘pensim’February 20, 2015
Type Package
Title Simulation of high-dimensional data and parallelized repeatedpenalized regression
Version 1.2.9
Author L. Waldron, M. Pintilie, M.-S. Tsao, F. A. Shepherd, C. Huttenhower*, I. Jurisica* (*equal contribution)
Description Simulation of continuous, correlated high-dimensional data with time to event or bi-nary response, and parallelized functions for Lasso, Ridge, and Elastic Net penalized regres-sion with repeated starts and two-dimensional tuning of the Elastic Net.
pensim-package Functions and data for simulation of high-dimensional data and par-allelized repeated penalized regression
Description
Simulation of continuous, correlated high-dimensional data with time-to-event or binary response,and parallelized functions for Lasso, Ridge, and Elastic Net penalized regression model training andvalidation by split-sample or nested cross-validation. See the help page for opt.nested.crossval() forthe most extensive usage examples.
Model training and validation by Lasso, Ridge, and Elastic Net penalized regression. This packagealso contains a function for simulation of correlated high-dimensional data with binary or time-to-event response.
Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized applica-tion of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406.(*equal contribution)
See Also
penalized-package
Examples
set.seed(9)##create some data, with one of a group of five correlated variables##having an association with the binary outcome:x <- create.data(nvars=c(10, 3), cors=c(0, 0.8),
##predictor data frame and binary response vectorpen.data <- x$data[, -match("outcome", colnames(x$data))]response <- x$data[, match("outcome", colnames(x$data))]## lasso regression. Note that epsilon=1e-2 is passed onto optL1, and## reduces the precision of the tuning compared to the default 1e-10.output <- opt1D(nsim=1, nprocessors=1, penalized=pen.data, response=response, epsilon=1e-2)cc <- output[which.max(output[, "cvl"]), -1:-3] ##non-zero b.* are true positives
beer.exprs Lung adenocarcinoma microarray expression data of Beer et al.(2002)
Description
Lung adenocarcinomas were profiled by Beer et al. (2002) using Affymetrix hu6800 microarrays.The data here were normalized from raw .CEL files by RMAExpress (v0.3). The expression matrixcontains expression data for 86 patients with 7,129 probe sets.
Usage
data(beer.exprs)
Format
A data frame with 7129 probe sets (rows) for 86 patients (columns)
Source
Beer DG, Kardia SL, Huang C, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG,Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JM, Iannettoni MD, Orringer MB, HanashS: Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med 2002,8:816-824.
References
Irizarry, R.A., et al. (2003) Summaries of Affymetrix GeneChip probe level data, Nucl. Acids Res.,31, e15+-e15+.
create.data simulate correlated predictors with time-to-event or binary outcome
Description
This function creates multiple groups of predictor variables which may be correlated within eachgroup, and binary or survival time (without censoring) response according to specified weights ofthe predictors.
nvars integer vector giving the number of variables of each variable type. The numberof variable types is equal to the length of this vector.
cors integer vector of the same length as nvars, giving the population pairwise Pear-son correlation within each group.
associations integer vector of the same length as nvars, giving the associations of each typewith outcome
firstonly logical vector of the same length as nvars, specifying whether only the first vari-able of each type is associated with outcome (TRUE) or all variables of that type(FALSE)
nsamples an integer giving the number of observations
censoring "none" for no censoring, or a vector of length two c(a,b) for uniform U(a,b)censoring.
labelswapprob This provides an option to add uncertainty to binary outcomes by randomlyswitching labels with probability labelswapprob. The probability of a label be-ing swapped is independent for each observation. The value is ignored if re-sponse is "timetoevent"
response either "timetoevent" or "binary"
basehaz baseline hazard, used for "timetoevent"
logisticintercept
intercept which is added to X%*%Beta for "binary"
Details
This function simulates "predictor" variables in one or more groups, which are standard normallydistributed. The user can specify the population correlation within each variable group, the associ-ation of each variable group to outcome, and whether the first or all variables of that type should beassociated with outcome. The simulated response variable can be time to event with an exponentialdistribution, or binary survival with a logistic distribution.
6 create.data
Value
Returns a list with items:
summary a summary of the variable types produced
associations weights of each variable in computing the outcome
covariance covariance matrix used for generating potentially correlated random predictors
data dataframe containing the predictors and response. Response is the last columnfor binary outcome ("outcome"), and the last two columns for timetoevent out-come ("time" and "cens")
Note
Depends on the MASS package for correlated random number generation
Author(s)
Levi Waldron et al.
References
Waldron L., Pintilie M., Tsao M.-S., Shepherd F. A., Huttenhower C.*, and Jurisica I.* Optimizedapplication of penalized regression methods to diverse genomic data. (2010). Under review. (*equalcontribution)
opt.nested.crossval Parallelized calculation of cross-validated risk score predictions fromL1/L2/Elastic Net penalized regression.
Description
calculates risk score predictions by a nested cross-validation, using the optL1 and optL2 functions ofthe penalized R package for regression. In the outer level of cross-validation, samples are split intotraining and test samples. Model parameters are tuned by cross-validation within training samplesonly.
By setting nprocessors > 1, the outer cross-validation is split between multiple processors.
The functions support z-score scaling of training data, and application of these scaling and shiftingcoefficients to the test data. It also supports repeated tuning of the penalty parameters and selectionof the model with greatest cross-validated likelihood.
outerfold number of folds in outer cross-validation (the level used for validation)
nprocessors An integer number of processors to use. If specified in opt.nested.crossval, itera-tions of the outer cross-validation are sent to different processors. If specified inopt.splitval, repeated starts for the penalty tuning are sent to different processors.
cl Optional cluster object created with the makeCluster() function of the parallelpackage. If this is not set, pensim calls makeCluster(nprocessors, type="SOCK").Setting this parameter can enable parallelization in more diverse scenarios thanmulti-core desktops; see the documentation for the parallel package. Note that ifcl is user-defined, this function will not automatically run parallel::stopCluster()to shut down the cluster.
... optFUN (either "opt1D" or "opt2D"), scaling (TRUE to z-score training datathen apply the same shift and scale factors to test data, FALSE for no scaling)are passed onto the opt.splitval function. Additional arguments are required, tobe passed to the optL1 or optL2 function of the penalized R package. See thosehelp pages, and it may be desirable to test these arguments directly on optL1 oroptL2 before using this more CPU-consuming and complex function.
8 opt.nested.crossval
Details
This function calculates cross-validated risk score predictions, tuning a penalized regression modelusing the optL1 or optL2 functions of the penalized R package, for each iteration of the cross-validation. Tuning is done by cross-validation in the training samples only. Test samples are scaledusing the shift and scale factors determined from the training samples. parameter. If nprocessors >1, it uses the SNOW package for parallelization, dividing the iterations of the outer cross-validationamong the specified number of processors.
Some arguments MUST be passed (through the ... arguments) but which are documented for thefunctions in which they are used. These include, from the opt.splitval function:
optFUN="opt1D" for Lasso or Ridge regression, or "opt2D" for Elastic Net. See the help pages foropt1D and opt2D for additional arguments associated with these functions.
scaling=TRUE to scale each feature (column) of the training sample to z-scores. These same scalingand shifting factors are applied to the test data. If FALSE, no scaling is done. Note that only data inthe penalized argument are scaled, not the optional unpenalized argument (see documentation foropt1D, opt2D, or cvl from the penalized package for descriptions of the penalized and unpenalizedarguments). Alternatively, the standardize=TRUE argument to the penalized package functions canbe used to do scaling internally.
nsim=50 this number specifies the number of times to repeat tuning of the penalty parameters ondifferent data foldings for the cross-validation.
setpen="L1" or "L2" : if optFUN="opt1D", this sets regression type to LASSO or Ridge, respec-tively. See ?opt1D.
L1range, L2range, dofirst, L1gridsize, L2gridsize: options for Elastic Net regression if optFUN="opt2D".See ?opt2D.
Value
Returns a vector of cross-validated continuous risk score predictions.
Note
Depends on the R packages: penalized, parallel
Author(s)
Levi Waldron et al.
References
Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized applica-tion of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406.(*equal contribution)
See Also
opt.splitval
opt.nested.crossval 9
Examples
data(beer.exprs)data(beer.survival)
##select just 100 genes to speed computation:set.seed(1)beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 100), ]
## First, test the regression arguments using functions from## the penalized package. I use maxlambda1=5 here to ensure at least## one non-zero coefficient.testfit <- penalized::optL1(response=surv.obj,
## Now pass these arguments to opt.nested.splitval() for cross-validated## calculation and assessment of risk scores, with the additional## arguments:## outerfold and nprocessors (?opt.nested.crossval)## optFUN and scaling (?opt.splitval)## setpen and nsim (?opt1D)
## Ideally nsim would be 50, and outerfold and fold would be 10, but the## values below speed computation 200x compared to these recommended## values. Note that here we are using the standardize=TRUE argument of## optL1 rather than the scaling=TRUE argument of opt.splitval. These## two approaches to scaling are roughly equivalent, but the scaling## approaches are not the same (scaling=TRUE does z-score,## standardize=TRUE scales to unit central L2 norm), and results will## not be identical. Also, using standardize=TRUE scales variables but## provides coeffients for the original scale, whereas using## scaling=TRUE scales variables in the training set then applies the## same scales to the test set.set.seed(1)## In this example I use two processors:preds <- opt.nested.crossval(outerfold=5, nprocessors=2, #opt.nested.crossval arguments
## We can also include unpenalized covariates, if desired.## Note that when keeping only one variable for a penalized or## unpenalized covariate, indexing a dataframe like [1] instead of doing## [,1] preserves the variable name. With [,1] the variable name gets## converted to "".
beer.coefs <- opt1D(setpen="L1", nsim=1,maxlambda1=5,response=surv.obj,penalized=dat.filt[-1], # This is equivalent to dat.filt[, -1]unpenalized=dat.filt[1],fold=5,positive=FALSE,standardize=TRUE,trace=FALSE)
## (note the non-zero first coefficient this time, due to it being unpenalized).
## Summarization and plotting.preds.dichot <- preds > median(preds)
opt.splitval Parallelized calculation of split training/test set predictions fromL1/L2/Elastic Net penalized regression.
Description
uses a single training/test split to train a penalized regression model in the training samples, thenuse the model to calculate values of the linear risk score in the test samples. This function is usedby opt.nested.crossval, but can also be used on its own.
This function support z-score scaling of training data, and application of these scaling and shiftingcoefficients to the test data. It also supports repeated tuning of the penalty parameters and selectionof the model with greatest cross-validated likelihood.
optFUN "opt1D" for Lasso or Ridge regression, "opt2D" for Elastic Net. See the helppages for these functions for additional arguments.
testset For the opt.splitval function ONLY. "equal" for randomly assigned equal trainingand test sets, or an integer vector defining the positions of the test samples in theresponse, penalized, and unpenalized arguments which are passed to the optL1,optL2, or cvl functions of the penalized R package.
scaling If TRUE, each feature (column) of the training samples (in matrix/dataframespecified by the penalized argument) are scaled to z-scores, then these scalingand shifting factors are applied to the test data. If FALSE, no scaling is done.
... Additional arguments are required, to be passed to the optL1 or optL2 functionof the penalized R package. See those help pages, and it may be desirable totest these arguments directly on optL1 or optL2 before using this more CPU-consuming and complex function.
Details
This function does split sample model training and testing for a single split of the data, using theoptL1 or optL2 functions of the penalized R package, for each iteration of the cross-validation.Scaling of the test samples is done independently, using scale factors determined from the train-ing samples. Repeated starts of model training can be parallelized as documented in the opt1Dand opt2D functions. This function is used for nested cross-validation by the opt.nested.crossvalfunction.
12 opt.splitval
Value
Returns a vector of cross-validated continuous risk score predictions.
Note
Depends on the R packages: penalized, parallel, rlecuyer
Author(s)
Levi Waldron et al.
References
Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized applica-tion of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406.(*equal contribution)
See Also
opt1D, opt2D, opt.nested.crossval
Examples
data(beer.exprs)data(beer.survival)
##select just 250 genes to speed computation:set.seed(1)beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 250), ]
##Single split training/test evaluation. Ideally nsim would be 50 and##fold=10, but this requires 100x more resources.set.seed(1)preds50 <- opt.splitval(optFUN="opt1D",scaling=TRUE,testset="equal",
opt1D Parallelized repeated tuning of Lasso or Ridge penalty parameter
Description
This function is a wrapper to the optL1 and optL2 functions of the penalized R package, useful forparallelized repeated tuning of the penalty parameters.
nsim Number of times to repeat the simulation (around 50 is suggested)nprocessors An integer number of processors to use.setpen Either "L1" (Lasso) or "L2" (Ridge) penaltycl Optional cluster object created with the makeCluster() function of the parallel
package. If this is not set, pensim calls makeCluster(nprocessors, type="SOCK").Setting this parameter can enable parallelization in more diverse scenarios thanmulti-core desktops; see the documentation for the parallel package. Note that ifcl is user-defined, this function will not automatically run parallel::stopCluster()to shut down the cluster.
... arguments passed on to optL1 or optL2 function of the penalized R package
Details
This function sets up a SNOW (Simple Network of Workstations) "sock" cluster to parallelize thetask of repeated tunings the L1 or L2 penalty parameter. Tuning of the penalty parameters is doneby the optL1 or optL2 functions of the penalized R package.
Value
Returns a matrix with the following columns:
L1 (or L2) optimized value of the penalty parametercvl optimized cross-validated likelihoodcoef_1, coef_2, ..., coef_n
argmax coefficients for the model with this value of the tuning parameter
The matrix contains one row for each repeat of the regression.
14 opt1D
Note
Depends on the R packages: penalized, parallel, rlecuyer
Author(s)
Levi Waldron et al.
References
Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized applica-tion of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406.(*equal contribution)
See Also
optL1, optL2
Examples
data(beer.exprs)data(beer.survival)
##select just 100 genes to speed computation:set.seed(1)beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 100), ]
##define training and test setsset.seed(1)trainingset <- sample(rownames(dat.filt), round(nrow(dat.filt)/2))testset <- rownames(dat.filt)[!rownames(dat.filt) %in% trainingset]
##ideally nsim should be on the order of 50, but this slows computation##50x without parallelization.set.seed(1)system.time(output <- opt1D(nsim=1, nprocessors=1, setpen="L2", response=surv.training,
##Try again using two processors:system.time(output <- opt1D(nsim=2, nprocessors=2, setpen="L2", response=surv.training,penalized=dat.training, fold=10, positive=FALSE, standardize=TRUE,minlambda2=0.2, maxlambda2=100))
cc <- output[which.max(output[, "cvl"]), -(1:2)] #coefficientssum(abs(cc)>0) #count non-zero coefficients
opt2D Parallelized, two-dimensional tuning of Elastic Net L1/L2 penalties
Description
This function implements parallelized two-dimensional optimization of Elastic Net penalty param-eters. This is accomplished by scanning a regular grid of L1/L2 penalties, then using the top fiveCVL penalty combinations from this grid as starting points for the convex optimization problem.
nsim Number of times to repeat the simulation (around 50 is suggested)
L1range numeric vector of length two, giving minimum and maximum constraints on theL1 penalty
L2range numeric vector of length two, giving minimum and maximum constraints on theL2 penalty
dofirst "L1" to optimize L1 followed by L2, "L2" to optimize L2 followed by L1, or"both" to optimize both simultaneously in a two-dimensional optimization.
nprocessors An integer number of processors to use.
L1gridsize Number of values of the L1 penalty in the regular grid of L1/L2 penalties
L2gridsize Number of values of the L2 penalty in the regular grid of L1/L2 penalties
cl Optional cluster object created with the makeCluster() function of the parallelpackage. If this is not set, pensim calls makeCluster(nprocessors, type="SOCK").Setting this parameter can enable parallelization in more diverse scenarios thanmulti-core desktops; see the documentation for the parallel package. Note that ifcl is user-defined, this function will not automatically run parallel::stopCluster()to shut down the cluster.
... arguments passed on to optL1 and optL2 (dofirst="L1" or "L2"), or cvl (dofirst="both")functions of the penalized R package
opt2D 17
Details
This function sets up a SNOW (Simple Network of Workstations) "sock" cluster to parallelize thetask of repeated tunings the Elastic Net penalty parameters. Three methods are implemented, as de-scribed by Waldron et al. (2011): lambda1 followed by lambda2 (lambda1-lambda2), lambda2 fol-lowed by lambda1 (lambda2-lambda1), and lambda1 with lambda2 simultaneously (lambda1+lambda2).Tuning of the penalty parameters is done by the optL1 or optL2 functions of the penalized R pack-age.
Value
Returns a matrix with the following columns:
L1 optimized value of the L1 penalty parameter
L2 optimized value of the L2 penalty parameter
cvl optimized cross-validated likelihood
convergence 0 if the optimization converged, non-zero otherwise (see stats:optim for details)
fncalls number of calls to cvl function during optimizationcoef_1, coef_2, ..., coef_n
argmax coefficients for the model with this value of the tuning parameter
The matrix contains one row for each repeat of the regression.
Note
Depends on the R packages: penalized, parallel, rlecuyer
Author(s)
Levi Waldron et al.
References
Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized applica-tion of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406.(*equal contribution)
See Also
optL1, optL2, cvl
Examples
data(beer.exprs)data(beer.survival)
##select just 100 genes to speed computation:set.seed(1)beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 100), ]
18 opt2D
##apply an unreasonably strict gene filter here to speed computation##time for the Elastic Net example.gene.quant <- apply(beer.exprs.sample, 1, quantile, probs=0.75)dat.filt <- beer.exprs.sample[gene.quant>log2(150), ]gene.iqr <- apply(dat.filt, 1, IQR)dat.filt <- as.matrix(dat.filt[gene.iqr>1, ])dat.filt <- t(dat.filt)
##define training and test setsset.seed(9)trainingset <- sample(rownames(dat.filt), round(nrow(dat.filt)/2))testset <- rownames(dat.filt)[!rownames(dat.filt)%in%trainingset]
set.seed(1)##ideally set nsim=50, fold=10, but this takes 100x longer.system.time(output <- opt2D(nsim=1, L1range=c(0.1, 1),L2range=c(20, 1000), dofirst="both", nprocessors=1,response=surv.training, penalized=dat.training, fold=5, positive=FALSE, standardize=TRUE))
cc <- output[which.max(output[, "cvl"]), -1:-5]output[which.max(output[, "cvl"]), 1:5] #small L1, large L2sum(abs(cc)>0) #number of non-zero coefficients
scan.l1l2 Function calculate cross-validated likelihood on a regular grid ofL1/L2 penalties
Description
This function generates a grid of values of L1/L2 penalties, then calculated cross-validated likeli-hood at each point on the grid. The grid can be regular (linear progression of the penalty values), orpolynomial (finer grid for small penalty values, and coarser grid for larger penalty values).
L1range numeric vector of length two, giving minimum and maximum constraints on theL1 penalty
L2range numeric vector of length two, giving minimum and maximum constraints on theL2 penalty
L1.ngrid Number of values of the L1 penalty in the regular grid of L1/L2 penalties
L2.ngrid Number of values of the L2 penalty in the regular grid of L1/L2 penalties
nprocessors An integer number of processors to use.
polydegree power of the polynomial on which the L1/L2 penalty values are fit. ie if polyde-gree=2, penalty values could be y=x^2, x=1,2,3,..., so y=1,4,9,...
cl Optional cluster object created with the makeCluster() function of the parallelpackage. If this is not set, pensim calls makeCluster(nprocessors, type="SOCK").Setting this parameter can enable parallelization in more diverse scenarios thanmulti-core desktops; see the documentation for the parallel package. Note that ifcl is user-defined, this function will not automatically run parallel::stopCluster()to shut down the cluster.
... arguments passed on to cvl function of the penalized R package
Details
This function sets up a SNOW (Simple Network of Workstations) "sock" cluster to parallelize thetask of scanning a grid of penalty values to search for suitable starting values for two-dimensionaloptimization of the Elastic Net.
Value
cvl matrix of cvl values along the grid
L1range range of L1 penalties to scan
L2range range of L2 penalties to scan
xlab A text string indicating the range of L1 penalties
ylab A text string giving the range of L2 penalties
zlab A text string giving the range of cvl values
note A note to the user that rows of cvl correspond to values of lambda1, columns tolambda2
Note
Depends on the R packages: penalized, parallel, rlecuyer
Author(s)
Levi Waldron et al.
scan.l1l2 21
References
Waldron L, Pintilie M, Tsao M-S, Shepherd FA, Huttenhower C*, Jurisica I*: Optimized applica-tion of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399-3406.(*equal contribution)
See Also
cvl
Examples
data(beer.exprs)data(beer.survival)
##select just 250 genes to speed computation:set.seed(1)beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 250), ]
##define training and test setsset.seed(9)trainingset <- sample(rownames(dat.filt), round(nrow(dat.filt)/2))testset <- rownames(dat.filt)[!rownames(dat.filt)%in%trainingset]
##Note that the cvl surface is not smooth because a different folding of##the data was used for each cvl calculationimage(x=seq(output$L1range[1], output$L1range[2], length.out=nrow(output$cvl)),y=seq(output$L2range[1], output$L2range[2], length.out=ncol(output$cvl)),z=output$cvl,xlab="lambda1",ylab="lambda2",