CMA - A comprehensive Bioconductor package for super- vised classification with high dimensional data M. Slawski 1 , M. Daumer 1 and A.-L. Boulesteix *1,2 1 Sylvia Lawry Centre for Multiple Sclerosis Research, Hohenlindenerstr. 1, D-81677 Munich, Germany 2 Department of Statistics, University of Munich, Ludwigstr. 33, D-80539 Munich, Germany Email: M. Slawski - [email protected]; M. Daumer - [email protected]; A.-L. Boulesteix * - [email protected]; * Corresponding author Abstract Background: For the last eight years, microarray-based classification has been a major topic in statistics, bioinformatics and biomedicine research. Traditional methods often yield unsatisfactory results or may even be inapplicable in the so-called “p n” setting where the number of predictors p by far exceeds the number of observations n, hence the term“ill-posed-problem”. Careful model selection and evaluation satisfying accepted good-practice standards is a very complex task for statisticians without experience in this area or for scientists with limited statistical background. The multiplicity of available methods for class prediction based on high-dimensional data is an additional practical challenge for inexperienced researchers. Results: In this article, we introduce a new Bioconductor package called CMA (standing for“Classification for MicroArrays”) for automatically performing variable selection, parameter tuning, classifier construction, and unbiased evaluation of the constructed classifiers using a large number of usual methods. Without much time and effort, users are provided with an overview of the unbiased accuracy of most top-performing classifiers. Furthermore, the standardized evaluation framework underlying CMA can also be beneficial in statistical research for comparison purposes, for instance if a new classifier has to be compared to existing approaches. Conclusions: CMA is a user-friendly comprehensive package for classifier construction and evaluation implementing most usual approaches. It is freely available from the Bioconductor website at http://bioconductor.org/packages/2.3/bioc/html/CMA.html. 1
34
Embed
CMA - A comprehensive Bioconductor package for super ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CMA - A comprehensive Bioconductor package for super-vised classification with high dimensional data
M. Slawski1, M. Daumer1 and A.-L. Boulesteix∗1,2
1 Sylvia Lawry Centre for Multiple Sclerosis Research, Hohenlindenerstr. 1, D-81677 Munich, Germany2 Department of Statistics, University of Munich, Ludwigstr. 33, D-80539 Munich, Germany
The idea of an R interface for the integration of microarray-based classification methods is not new. The
CMA package shows similarities to the Bioconductor package ’MLInterfaces’ standing for “An interface to
3
various machine learning methods” [12], see also the Bioconductor textbook [13] for a presentation of an
older version. The MLInterfaces package includes numerous facilities such as the unified MLearn interface,
the flexible learnerSchema design enabling the introduction of new procedures on the fly, and the
xvalSpec interface that allows arbitrary types of resampling and cross-validation to be employed. MLearn
also returns the native R object from the learner for further interrogation. The package architecture of
MLInterfaces is similar the CMA structure in the sense that wrapper functions are used to call
classification methods from other packages.
However, CMA includes additional predefined features as far as variable selection, hyperparameter tuning,
classifier evaluation and comparison are concerned. While the method xval is flexible for experienced
users, it provides only cross-validation (including leave-one-out) as predefined option. As the CMA package
also addresses inexperienced users, it includes the most common validation schemes in a standardized
manner. In the current version of MLInterfaces, variable selection can also be carried out separately for
each different learning set, but it does not seem to be a standard procedure. In the examples presented in
the Bioconductor textbook [13], variable selection is only performed once using the complete sample. In
contrast, CMA performs variable selection separately for each learning set by default. Further, CMA
includes additional features for hyperparameter tuning, thus allowing an objective comparison of different
class prediction methods. If tuning is ignored, simpler methods without (or with few) tuning parameters
tend to perform seemingly better than more complex algorithms. CMA also implements additional
measures of prediction accuracy and user-friendly visualization tools.
The package ’MCRestimate’ [14,15] emphasizes very similar aspects as CMA, focussing on the estimation
of misclassification rates and cross-validation for model selection and evaluation. It is (to our knowledge)
the only Bioconductor package beside ours supporting hyperparameter tuning and the workflow is fully
compatible with good practice standards . The advances of CMA compared to MCRestimate are
summarized below. CMA includes much more classifiers (21 in the current version), which allows a
comfortable extensive comparison without much effort. In particular, it provides an interface to recent
machine learning methods, including two highly competitive boosting methods (tree-based and
componentwise boosting). CMA also allows to pass arguments to the classifier, which may be useful in
some cases, for instance to reduce the number of trees in a random forest for computational reasons.
Furthermore, all the methods included in CMA support multi-class response variables, even the methods
based on logistic regression (which can only be applied to binary response variables in MCRestimate).
A very wide range of variable selection methods are available from CMA, e.g. fast implementations of
4
important univariate test statistics including typical multi-class approach (Kruskal-Wallis/F-test).
Moreover, CMA offers the possibility of constructing classifiers in a hybrid way: variable selection can be
performed via the lasso and subsequently plugged into another algorithm. In addition to cross-validation,
evaluation can be performed based on several most often used schemes such as bootstrap (and the
associated ’0.632’ or ’0.632+’ estimators) or repeated subsampling. The definition of the learning sets can
also be customized, which may be an advantage when, e.g. one wants to evaluate a classifier based on a
single split learning/test data, as usual in the context of validation. CMA also includes additional accuracy
measures which are commonly used in medical research. Convivial visualization tools are provided at the
intention of either statisticians or practitioners. When several classifiers are run, the compare function
produces ready-to-use tables listing different performance measures for several classifiers.
From the technical point of view, an additional advance is that CMA’s implementation is fully organized in
S4 classes, which bears advantages for both experienced users (who may easily incorporate their own
functions) and inexperienced users (who have access to convenient visualization tools without entering
much code). As a consequence, CMA has a clear inherent modular structure. The use of S4 classes is
highly beneficial when adding new features, because it requires at most changes in one ’building block’.
Furthermore, S4 classes offer the advantage of specifying the input data in manifold ways, depending on
the user’s needs. For example, the CMA users can enter their gene expression data as matrices, data
frames combined with formulae, or ExpressionSets.
Overview of class prediction with high-dimensional data and notationsSettings and Notation
The classification problem can be briefly outlined as follows. We have a predictor space X , here X ⊆ Rp
(for instance, the predictors may be gene expression levels, but the scope of CMA is not limited to this
case). The finite set of class labels is denoted as Y = {0, . . . ,K − 1}, with K standing for the total number
of classes, and P (x, y) denotes the joint probability distribution on X × Y. We are given a finite sample
S = {(x1, y1), . . . , (xn, yn)} of n predictor-class pairs. The considered task is to construct a decision
functionf : X → Y
x 7→ f(x)
such that the generalization error
R[f ] = EP [L(f(x), y)] =∫X×Y
L(y, f(x)) dP (x, y) (1)
5
is minimized, where L(·, ·) is a suitable loss function, usually taken to be the indicator loss (L(u, v) = 1 if
u 6= v, L(u, v) = 0 otherwise). Other loss functions and performances measures are discussed extensively in
section 3.1.5. The symbol indicates that the function is estimated from the given sample S.
Estimation of the generalization error
As we are only equipped with a finite sample S and the underlying distribution is unknown,
approximations to Eq. (1) have to be found. The empirical counterpart to R[f ]
Remp[f ] = n−1n∑
i=1
L(yi, f(xi)) (2)
has a (usually large) negative bias, i.e. prediction error is underestimated. Moreover, choosing the best
classifier based on Eq. (2) potentially leads to the selection of a classifier overfitting the sample S which
may show poor performance on independent data. More details can be found in recent overview
articles [16–18]. A better strategy consists of splitting S into distinct subsets L (learning sample) and T(test sample) with the intention to separate model selection and model evaluation. The classifier f(·) is
constructed using L only and evaluated using T only, as depicted in Figure 1 (top).
In microarray data, the sample size n is usually very small, leading to serious problems for both the
construction of the classifier and the estimation of its prediction accuracy. Increasing the size of the
learning set (nL → n) typically improves the constructed prediction rule f(·), but decreases the reliability
of its evaluation. Conversely, increasing the size of the test set (nT → n) improves the accuracy
estimation, but leads to poor classifiers, since these are based on fewer observations. While a compromise
can be found if the sample size is large enough, alternative designs are needed for the case of small sizes.
The CMA package implements several approaches which are all based on the following scheme.
1. Generate B learning sets Lb (b = 1, . . . , B) from S and define the corresponding test set as
Tb = S \ Lb.
2. Obtain fb(·) from Lb, for b = 1, . . . , B.
3. The quantity
ε =1B
B∑b=1
1|Tb|
∑i∈Tb
L(yi, fb(xi)) (3)
is then used as an estimator of the error rate, where |.| stands for the cardinality of the considered set.
6
The underlying idea is to reduce the variance of the error estimator by averaging, in the spirit of the
bagging principle introduced by Breiman [19]. The function GenerateLearningsets from the package
CMA implements several methods for generating Lb and Tb in step 1, which are described below.
LOOCV Leaving-one-out cross-validation
For the b-th iteration, Tb consists of the b-th observation only. This is repeated for each observation
with a vector of covariates x, one forms projections aTr x, r = 1, . . . , R, that are then transformed
using an activation function h(·), usually sigmoidal, in order to obtain a hidden layer consisting of
units {zr = h(aTr x)}Rr=1 that are subsequently used for prediction. Training of neural networks tends
to be rather complicated and unstable. For large p, CMA works in the space of “eigengenes”, following
the suggestion of [34] by applying the singular value decomposition [35] to the predictor matrix.
Probabilistic Neural Networks
Although termed “Neural Networks”, probabilistic neural networks (classifier="pnnCMA") are
actually a Parzen-Windows type classifier [36] related to the nearest neighbors approach. For x ∈ Tfrom the test set and each class k = 0, . . . ,K − 1, one computes
� Other requirements: Installation of the R software for statistical computing, release 2.7.0 or higher.
For full functionality, the add-on packages ’MASS’, ’class’, ’nnet’, ’glmpath’, ’e1071’, ’randomForest’,
’plsgenomics’, ’gbm’, ’mgcv’, ’corpcor’, ’limma’ are also required.
� License: None for usage
� Any restrictions to use by non-academics: None
6 Authors contributions
MS implemented the CMA package and wrote the manuscript. ALB had the initial idea and supervised the
project. ALB and MD contributed to the concept and to the manuscript.
7 Acknowledgements
We thank the four referees for their very constructive comments which helped us to improve this
manuscript. This work was partially supported by the Porticus Foundation in the context of the
International School for Technical Medicine and Clinical Bioinformatics.
References1. Ihaka R, Gentleman R: R: A language for data analysis and graphics. Journal of Computational and
Graphical Statistics 1996, 5:299–314.
2. Gentleman R, Carey J, Bates D, et al.: Bioconductor: Open software development for computationalbiology and bioinformatics. Genome Biology 2004, 5:R80.
3. Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunkencentroids of gene expression. Proceedings of the National Academy of Sciences 2002, 99:6567–6572.
4. Breiman L: Random Forests. Machine Learning 2001, 45:5–32.
5. Boulesteix AL, Strimmer K: Partial Least Squares: A versatile tool for the analysis ofhigh-dimensional genomic data. Briefings in Bioinformatics 2007, 8:32–44.
6. Dupuy A, Simon R: Critical Review of Published Microarray Studies for Cancer Outcome andGuidelines on Statistical Analysis and Reporting. Journal of the National Cancer Institute 2007,99:147–157.
7. Ambroise C, McLachlan GJ: Selection bias in gene extraction in tumour classification on basis ofmicroarray gene expression data. Proceedings of the National Academy of Science 2002, 99:6562–6566.
8. Berrar D, Bradbury I, Dubitzky W: Avoiding model selection bias in small-sample genomic datasets.BMC Bioinformatics 2006, 22:2245–2250.
9. Boulesteix AL: WilcoxCV: An R package for fast variable selection in cross-validation.Bioinformatics 2007, 23:1702–1704.
25
10. Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation ofmulticategory classification methods for microarray gene expression cancer diagnosis.Bioinformatics 2005, 21:631–643.
11. Varma S, Simon R: Bias in error estimation when using cross-validation for model selection. BMCBioinformatics 2006, 7:91.
12. Mar J, Gentleman R, Carey V: MLInterfaces: Uniform interfaces to R machine learning procedures for data inBioconductor containers 2007. [R package version 1.10.2].
13. Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S: Bioinformatics and Computational Biology SolutionsUsing R and Bioconductor. New York: Springer 2005.
14. Ruschhaupt M, Mansmann U, Warnat P, Huber W, Benner A: MCRestimate: Misclassification error estimationwith cross-validation 2007. [R package version 1.10.2].
15. Ruschhaupt M, Huber W, Poustka A, Mansmann U: A compendium to ensure computationalreproducibility in high-dimensional classification tasks. Statistical Applications in Genetics andMolecular Biology 2004, 3:37.
16. Braga-Neto U, Dougherty ER: Is cross-validation valid for small-sample microarray classification?Bioinformatics 2004, 20:374–380.
17. Molinaro A, Simon R, Pfeiffer RM: Prediction error estimation: a comparison of resampling methods.Bioinformatics 2005, 21:3301–3307.
24. Ripley B: Pattern Recognition and Neural Networks. Cambridge University Press 1996.
25. Wood S: Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC 2006.
26. Friedman J: Regularized discriminant analysis. Journal of the American Statistical Association 1989,84(405):165–175.
27. Guo Y, Hastie T, Tibshirani R: Regularized Discriminant Analysis and its Application inMicroarrays. Biostatistics 2007, 8:86–100.
28. Tibshirani R: Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical SocietyB 1996, 58:267–288.
29. Zou H, Hastie T: Regularization and variable selection via the elastic net. Journal of the RoyalStatistical Society B 2005, 67:301–320.
30. Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning. New York: Springer-Verlag 2001.
31. Breiman L, Friedman JH, Olshen RA, Stone JC: Classification and Regression Trees. Monterey, CA:Wadsworth 1984.
32. Freund Y, Schapire RE: A Decision-Theoretic Generalization of On-Line Learning and anApplication to Boosting. Journal of Computer and System Sciences 1997, 55:119–139.
33. Friedman J: Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics2001, 29:1189–1232.
36. Parzen E: On estimation of a probability density function and mode. Annals of Mathematical Statistics1962, 33:1065–1076.
37. Smyth G: Linear models and empirical Bayes methods for assessing differential expression inmicroarray experiments. Statistical Applications in Genetics and Molecular Biology 2004, 3:3.
38. Guyon I, Weston J, Barnhill S, Vapnik V: Gene Selection for Cancer Classification using supportvector machines. Journal of Machine Learning Research 2002, 46:389–422.
39. Buhlmann P, Yu B: Boosting with the L2 loss: Regression and Classification. Journal of the AmericanStatistical Association 2003, 98:324–339.
40. Golub T, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing J, CaligiuriMA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and classprediction by gene expression monitoring. Science 1999, 286:531–537.
41. Ioannidis JP: Microarrays and molecular research: noise discovery. The Lancet 2005, 365:488–492.
42. Efron B, Tibshirani R: Improvements on cross-validation: The .632+ bootstrap method. Journal ofthe American Statistical Association 1997, 92:548–560.
43. Khan J, Wei J, Ringner M, Saal L, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu C, PetersonC, Meltzer P: Classification and diagnostic prediction of cancers using gene expression profilingand artificial neural networks. Nature Medicine 2001, 7:673–679.
44. Tibshirani R, Hastie T, Narasimhan B, Chu G: Class prediction by nearest shrunken centroids, withapplications to DNA microarrays. Statistical Science 2002, 18:104–117.
45. Boulesteix AL: PLS dimension reduction for classification with microarray data. StatisticalApplications in Genetics and Molecular Biology 2004, 3:33.
46. Binder H, Schumacher M: Allowing for mandatory covariates in boosting estimation of sparsehigh-dimensional survival models. BMC Bioinformatics 2008, 9:14.
47. Slawski M, Boulesteix AL: GeneSelector. Bioconductor 2008. [R package version 1.1.0:http://www.bioconductor.org/packages/devel/bioc/html/GeneSelector.html].
48. Davis C, Gerick F, Hintermair V, Friedel C, Fundel K, Kueffner R, Zimmer R: Reliable gene signatures formicroarray classification: assessment of stability and performance. Bioinformatics 2006,22:2356–2363.
49. Kanehisa M, Goto S: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 2000,28:27–30.
50. Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, Gilette M, Paulovich A, Pomeroy S, Golub T,Lander E, Mesirov J: Gene set enrichment analysis: A knowledge-based approach for interpretinggenome-wide expression profiles. Proceedings of the National Acadeny of Science 2005, 102:15545–15550.
51. Bovelstad HM, Nygard S, Storvold HL, Aldrin M, Borgan O, Frigessi A, Lingjaerde OC: Predicting survivalfrom microarray data a comparative study. Bioinformatics 2007, 23:2080–2087.
52. Schumacher M, Binder H, Gerds T: Assessment of survival prediction models based on microarraydata. Bioinformatics 2007, 23:1768–1774.
53. Diaz-Uriarte R: SignS: a parallelized, open-source, freely available, web-based tool for geneselection and molecular signatures for survival and censored data. BMC Bioinformatics 2008, 9:30.
54. van Wieringen W, Kun D, Hampel R, Boulesteix AL: Survival prediction using gene expression data: areview and comparison. Computational Statistics Data Analysis (accepted) 2008.
55. Daumer M, Held U, Ickstadt K, Heinz M, Schach S, Ebers G: Reducing the probability of false positiveresearch findings by pre-publication validation: Experience with a large multiple sclerosisdatabase. BMC Medical Research Methodology 2008, 8:18.
56. McLachlan G: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York 1992.
57. Young-Park M, Hastie T: L1-regularization path algorithm for generalized linear models. Journal ofthe Royal Statistical Society B 2007, 69:659–677.
27
58. Zhu J: Classification of gene expression microarrays by penalized logistic regression. Biostatistics2004, 5:427–443.
Table 1: Overview of the classification methods in CMA. The first column gives the method name,whereas the name of the classifier in the CMA package is given in the second column. For each classifier,CMA uses either own code or code borrowed from another package, as specified in the third column.
29
Method Name in CMA Range SignificationgbmCMA n.trees 1, 2, . . . number of base learners (decision trees)LassoCMA norm.fraction [0;1] relative bound imposed on the `1 norm on the weight vectorknnCMA k 1, 2, . . . , |L| number of nearest neighboursnnetCMA size 1, 2, . . . number of units in the hidden layerscdaCMA delta R+ shrinkage towards zero applied to the centroidssvmCMA cost R+ cost: controls the violations of the margin of the hyperplane
gamma R+ controls the width of the Gaussian kernel (if used)
Table 2: Overview of hyperparameter tuning in CMA. The first column gives the method name,whereas the name of the hyperparameter in the CMA package is given in the second column. The thirdcolumn gives the range of the parameter and the fourth column its signification.
30
Variable selection methodsMethod Running time per learningsetMulticlass F-Test 3.1 sKrusal-Wallis test 3.5 sLimma 0.16sRandom Forest† 4.1 sClassification methodsMethod # variables Running time per 50 learningsetsDLDA all (2308) 2.7 sLDA 10 1.4 sQDA 2 1.0 sPartial Least Squares all (2308) 4.2 sShrunken Centroids all (2308) 2.8 sSVM all (2308) 88s
Table 3: Running times of the different variable selection and classification methods used in the real lifeexample. †: 500 bootstrap trees per run.
31
Figures
S
L
f(·)
T
1
Figure 1: Top: Splitting into learning and test data sets. The whole sample S is split into a learningset L and a test set T . The classifier f(.) is constructed using the learning set L and subsequently applied tothe test set T . Bottom: Evaluation schemes. Schematic display of k-fold cross-validation (left), Monte-Carlo cross-validation with n = 5 and ntrain=3 (middle), and bootstrap sampling (with replacement) withn = 5 and ntrain=3 (right).
32
S
S1. . . . . . Sk
first outer loop
L1 f(·) T1
S11. . . . . . S1l
L11 T11
first inner iteration
...
S11. . . . . . S1l
L1l −→ λopt1
T1l
...
last inner iteration
...
S
S1. . . . . . Sk
last outer loop
Lk f(·) Tk
Sk1. . . . . . Skl
Lk1 Tk1
first inner iteration
...
Sk1. . . . . . Skl
Lkl −→ λoptk
Tkl
...
last inner iteration
1
Figure 2: Hyperparameter tuning: Schematic display of nested cross-validation. In the procedure dis-played above, k-fold cross-validation is used for evaluation purposes, whereas tuning is performed withineach iteration using inner (l-fold) cross-validation.
33
●● ●
●●●
●
●●●●●●● ●●●
●
●●●●●●
●●
DLD
A
LDA
QD
A
scD
A
pls_
lda
svm
svm
2
0.00.10.20.30.40.50.6
misclassification
●●
●●
●
●●
●●●●
●●
●●●
●●
●●
●
●
●●
DLD
A
LDA
QD
A
scD
A
pls_
lda
svm
svm
20.00.10.20.30.40.50.60.7
brier score
●●
●
●
●●●●●●●
●●
●●
DLD
A
LDA
QD
A
scD
A
pls_
lda
svm
svm
2
0.50.60.70.80.91.0
average probability
Figure 3: Boxplots representing the misclassification rate (top), the Brier score (middle), and the averageprobability of correct classification (bottom) for Khan’s SRBCT data, using seven classifiers: diagonal lin-ear discriminant analysis, linear discriminant analysis, quadratic discriminant analysis, shrunken centroidsdiscriminant analysis (PAM), PLS followed by linear discriminant analysis, SVM without tuning, and SVMwith tuning.