Top Banner
Information-Preserving Techniques Improve Chemosensitivity Prediction of Tumours Based on Expression Profiles E. G. Christodoulou 1 , O. D. Røe 2 , A. Folarin 3 , I. Tsamardinos 1,4 1 Bioinformatics Laboratory, ICS-FORTH, Heraklion, Crete, Greece 2 Dept. of Cancer Research and Molecular Med., NTNU 3 University College London 4 Computer Science Department, University of Crete Abstract. Prior work has shown that the sensitivity of a tumour to a specific drug can be predicted from a molecular signature of gene ex- pressions. This is an important finding for improving drug efficacy and personalizing drug use. In this paper, we present an analysis strategy that, compared to prior work, maintains more information for analysis and leads to improved chemosensitivity prediction. Specifically we show (a) that prediction is improved when the GI50 value of a drug is esti- mated by all available measurements and fitting a sigmoid curve and (b) application of regression techniques often results in more accurate models compared to classification techniques. Thus it is suggested to consider both of the two families of techniques for analysis. In addition, we show that (c) modern variable selection techniques, such as MMPC result in better predictive performance than simple univariate filtering. We demonstrate the strategy on 59 tumor cell lines after treatment with 118 fully characterized drugs obtained by the National Cancer Institute (NCI 60 screening) and biologically comment on the identified molecular signatures of the best predicted drugs. 1 Introduction Prior work shows that the sensitivity of a tumour to a drug can be predicted better than chance based on the gene-expressions of the tumour [18], [5]. This is an important finding because it paves the way to personalized therapy models. Moreover the molecular signatures that are involved in the prediction provide insight into the drug-affecting mechanisms and pathways. However, the machine learning and statistical analysis employed in prior work processes the data in a way that reduces the available information with potential detrimental effects both on the models’ prediction performance as well as the identification of the molecular signatures. First the estimation of the response to a drug in prior work is sub-optimal. The response of a tumour depends of course, on the dosage. The National Can- cer Institute has treated a panel of 60 cancer cell lines with several thousands drugs and has created a dosage-response profile for each combination of drug
11

Information-preserving techniques improve chemosensitivity prediction of tumours based on expression profiles

Apr 27, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information-preserving techniques improve chemosensitivity prediction of tumours based on expression profiles

Information-Preserving Techniques ImproveChemosensitivity Prediction of Tumours Based

on Expression Profiles

E. G. Christodoulou1, O. D. Røe2, A. Folarin3, I. Tsamardinos1,4

1 Bioinformatics Laboratory, ICS-FORTH, Heraklion, Crete, Greece2 Dept. of Cancer Research and Molecular Med., NTNU

3 University College London4 Computer Science Department, University of Crete

Abstract. Prior work has shown that the sensitivity of a tumour to aspecific drug can be predicted from a molecular signature of gene ex-pressions. This is an important finding for improving drug efficacy andpersonalizing drug use. In this paper, we present an analysis strategythat, compared to prior work, maintains more information for analysisand leads to improved chemosensitivity prediction. Specifically we show(a) that prediction is improved when the GI50 value of a drug is esti-mated by all available measurements and fitting a sigmoid curve and(b) application of regression techniques often results in more accuratemodels compared to classification techniques. Thus it is suggested toconsider both of the two families of techniques for analysis. In addition,we show that (c) modern variable selection techniques, such as MMPCresult in better predictive performance than simple univariate filtering.We demonstrate the strategy on 59 tumor cell lines after treatment with118 fully characterized drugs obtained by the National Cancer Institute(NCI 60 screening) and biologically comment on the identified molecularsignatures of the best predicted drugs.

1 Introduction

Prior work shows that the sensitivity of a tumour to a drug can be predictedbetter than chance based on the gene-expressions of the tumour [18], [5]. This isan important finding because it paves the way to personalized therapy models.Moreover the molecular signatures that are involved in the prediction provideinsight into the drug-affecting mechanisms and pathways. However, the machinelearning and statistical analysis employed in prior work processes the data ina way that reduces the available information with potential detrimental effectsboth on the models’ prediction performance as well as the identification of themolecular signatures.

First the estimation of the response to a drug in prior work is sub-optimal.The response of a tumour depends of course, on the dosage. The National Can-cer Institute has treated a panel of 60 cancer cell lines with several thousandsdrugs and has created a dosage-response profile for each combination of drug

Page 2: Information-preserving techniques improve chemosensitivity prediction of tumours based on expression profiles

2 Christodoulou, Røe, Folarin, Tsamardinos

and tumour. Often, this profile is summarized with a single value such as thelog10GI50 where GI50 is the dosage (in µM) of the drug that reduces the nat-ural tumour growth to 50% within 48 hours. NCI, in the majority of cases,estimates log10GI50 by piece-wise linear interpolation which are then employedby all prior work (e.g., [18], [16], [26]). In this paper, we show that estimating thelog10GI50 values by fitting a sigmoid to the dosage-response profile preservesmore information about the effects of the drug that lead to statistically signifi-cantly improved predictive performance. We show this for the set of the NCI fullycharacterized 118 drugs in the CellMiner database [22].

Second, prior work typically quantizes the log10GI50 values to create classesof tumours, which again disregards information about the exact value of log10GI50.For example, [5], [18] quantize tumours relative to a drug as sensitive and resis-tant, while [26, 16] quantized tumours in three categories: sensitive, intermediate,and resistant. This type of quantization allows the application of machine learn-ing classification techniques, variable selection methods for classification tasks,and statistical hypothesis testing techniques for discrete outcomes. Our compu-tational experiments demonstrate that maintaining the exact log10GI50 valuesand employing regression analysis instead of classification is often preferable asit improves chemosensitivity prediction in approximately half of the cases.

Third, prior work often employs simple methods for identifying molecular sig-natures such as selecting the top k genes that are mostly differentially expressedbetween different classes of tumours. We show that more sophisticated methodssuch as the Max Min Parents and Children (MMPC) algorithm for multi-variatefeature selection [28] select more predictive signatures for the same parameter k.We biologically interpret these signatures for the 4 better predicted drugs.

2 Data and Problem Description

Data Description: Gene-expression profiles were obtained for the NCI-60 cell-line panel [1] (these actually contain expressions only for 59 of the 60 cell lines)representing nine types of cancers: 5 Breast, 6 Central Nervous System (CNS),7 Colon, 6 Leukemia, 10 Melanoma, 8 Lung, 7 Ovarian, 2 Prostate, 8 Renal.The expressions were measured on AffymtrixU133plus2 array containing 54,675probesets that correspond to about 47,000 transcript variants which in turn rep-resent more than 39,500 of the best characterized human genes. We denote withXi the vector of expressions for cell-line i, Xv

i the expression value for probesetv on cell-line i, and with X = {Xi} the matrix of expressions. The raw datahave been subjected to GCRMA normalization before analysis as implementedin the BioConductor platform [2]. The drug-response data for all 59 cell-lineswere obtained from the CellMiner database [22] for a panel of 118 drugs thatare fully characterized. Specifically, for each combination of drug and cell-line,the data contain several pairs of 〈d, r〉, where d is the log10 drug dosage and ris the percentage of tissue that survived at 48 hours after treatment. We denotewith Ri,j the set of such pairs for cell-line i and drug j.

Page 3: Information-preserving techniques improve chemosensitivity prediction of tumours based on expression profiles

Chemosensitivity Prediction of Tumours Based on Expression Profiles 3

Problem Definition: The analysis task we address is to predict the responseto a drug of a tissue with expression vector X. The response of cell-line i to adrug j is often characterized with a single number that we denote with GI50i,jand corresponds to log10GI50, where GI50 is the dosage that reduces tumourgrowth to 50% within 48 hours. GI50i,j is typically not available in the rawdata as it is unlikely that a pair 〈d, 50%〉 is contained in a Ri,j . Thus, the valueof GI50i,j is estimated from the data in Ri,j . Learning predictive models forGI50i,j given a vector X is a regression task. In addition to the above, weare interested in identifying minimal molecular signatures that are optimallypredictive of response. Such minimal yet predictive signatures could provideinsight into the molecular mechanisms of the drug.

3 Improving the estimation of GI50

The GI50i,j values in the publicly available NCI data are usually estimated asfollows : The mean response r(d) for each dosage d is calculated and a piece-wiselinear function is interpolated through these mean values. The estimated GI50value is the concentration that corresponds to r = 50% on this function, denotedas GI50PLIi,j . According to the official NCI-60 site [3], this is the methodologyfollowed for estimating the 55% of the GI50i,j values. The remaining 45% wereeither approximated (manually, we presume) or chosen to be the highest con-centration tested. The latter was chosen when none of the tested concentrationsproduced a 50% drop in the tumour growth rate.

We now present an estimation method that employs all available measure-ments in Ri,j . We assume the dosage-response curve to have a sigmoid shapewhere at 0 dosage (i.e., its logarithm approaches −∞) there is no reduction of thetumour (r=100%) and at infinity the tumour size is reduced to zero (r=-100%).

The equation of a sigmoid that ranges asymptotically between α and α + βand crosses the mid-range at γ is

r = α+β

1 + e(d−γ)δ(1)

where δ is a parameter controlling the slope of the function, r the responseand d the dosage (expressed by its logarithm). Considering that asymptoticallythe drug has no effect at small dosages we set α = −100%; equivalently, athigh dosages the tumour is eradicated completely which corresponds to −100%growth, and so we set β = 200. The remaining two parameters γ and δ wereestimated using least-squares numerical optimization. Specifically, we used thefunction nlinfit of Matlab with initial values γ = −5 and δ = 1. This functionperforms a number of steps towards the steepest descend direction for the pa-rameters γ and δ in order to converge to a good value. In the cases where theprocedure would not converge with these initial values, we repeated it 100 timeswith different initial values for the parameters γ and δ uniformly sampled within[-15 2] (the range of all concentrations in the data). Out of these 100 repetitionsthe parameter pair that led to the least mean squared error (MSE) was selected.

Page 4: Information-preserving techniques improve chemosensitivity prediction of tumours based on expression profiles

4 Christodoulou, Røe, Folarin, Tsamardinos

Fig. 1. (Left)The drug-response measurements for cell line CCRF-CEM (Leukemia) and Carmus-tine (BCNU). The Ri,j values are shown, as well as the fitted sigmoid curve (red color) and therespective piece-wise linear interpolation segments (green color), (Right) Histogram of differences of5-fold cross validated R2 when MMPC is used as the feature selection procedure minus the 5-foldcross validated R2 when univariate filtering is used as the feature selection procedure.

The estimated GI50i,j values are found by setting r=50% and solving Eq. 1. Incertain cases, fitting a sigmoid leads to extreme values. In order to detect theoutliers we applied the matlab function deletoutliers. This implements iterativelythe Grubbs Test that tests one value at a time. In any given iteration, the testedvalue is either the highest or the lowest GI50, and is the value that is furthestfrom the mean GI50j for all cell lines i. If outliers are found they are trimmedto ±2 ∗ σj , where σj is the standard deviation of all currently fitted values to

drug j. We denote the final estimates as GI50Sigi,j . Figure 1(a) shows a graph-ical depiction of Ri,j for the CCRF-CEM (Leukemia) cell-line and Carmustine(BCNU) with the fitted sigmoid superimposed. The corresponding piece-wiselinear interpolation segments are also shown in the figure. We now show thatthis method of estimation leads to improvements in chemosensitivity prediction.The analysis includes the following steps:

Feature Selection: We employed the Max Min Parents and Children algorithms(MMPC) [28] to select a minimal-size, optimally predictive set of probe-sets.MMPC is an algorithm that seeks to identify the neighbors of the variable-to-predict in the Bayesian Network capturing the data distribution. Under certainbroad conditions the identified features have a causal interpretation. MMPChas been shown quite effective in recent extensive experiments [4] against anarray of state-of-the art feature selection methods. In this work, the causal ex-plorer implementation of MMPC was used [25] with the default values for theparameters.

Regression: We employed SVM Regression to construct the predictive mod-els [9]. In our implementation we used the Radial Basis kernel and all otherparameters set to default.

Estimation of Performance: We used a 5-fold cross validation protocol. Foreach training set, the combination of MMPC and SVM regression produced apredictive model that was applied on the hold-out test fold. Metric of per-formance: The metric to measure prediction performance is the 5-fold cross-

Page 5: Information-preserving techniques improve chemosensitivity prediction of tumours based on expression profiles

Chemosensitivity Prediction of Tumours Based on Expression Profiles 5

Fig. 2. (Left) Histogram R2 for GI50Sig (GI50’s fitted by a sigmoid), GI50PLI (standard estima-tion), and their difference, from top to bottom respectively. (Right) Histograms of cross validatedclassification accuracies Aj , discretized regression accuracies Dj and the Aj −Dj differences, fromtop to bottom, respectively

validated R2 (coefficient of determination) [27]. Specifically, for a given drug j,let µ\i be the mean value of GI50 in the data excluding cell-lines in the fold i

(training data), GI50\i the predicted GI50 by the model constructed excludingcell-lines in the fold i, and GI50i the GI50 as estimated by the experiments inthe corresponding fold i for drug j. We define:

R2j ≡ 1−

∑i(GI50i − GI50\i)

2∑i(GI50i − µ\i)2

(2)

The interpretation of R2 is that it corresponds to variance explained (uncer-tainty) by the models, or the reduction of variance by the use of the gen-expressionmodels. The 5-fold cross-validated R2 is a metric that is conservative and doesnot require adjustment for the number of predictors used.

We have computed R2j for all 118 drugs both when the GI50 values are

estimated using piece-wise linear interpolation as well as when fitting a sigmoidfunction, as described above. We denote the corresponding values as RPLIj and

Page 6: Information-preserving techniques improve chemosensitivity prediction of tumours based on expression profiles

6 Christodoulou, Røe, Folarin, Tsamardinos

RSigj . The results are shown in Figure 2(a). The figure depicts the histograms of

RPLIj , RSigj , as well as their difference RSigj −RPLIj .The figure shows that GI50 values estimated by the sigmoid are better pre-

dicted using the protocol described above. Thus at least for the combination ofMMPC and SVM Regression GI50Sig values facilitate the induction of predic-tive models vs. using the GI50PLI . The 95% confidence interval for the medianRSigj − RPLIj as estimated by a Wilcoxon signed-rank test is [0.055, 0.133]. Ofcourse, one could argue that the results may not transfer to other feature se-lection or regression methods. The results however, corroborate our intuitionthat the sigmoid estimation better preserves information in the Ri,j measure-ments and given no evidence to the contrary, we would suggest this method ofestimation in future analyses.

A second observation on the results is that several drugs are well predictedby the models. The specific values of R2 that are considered important, unlessotherwise justified, are subjective. Nevertheless, at least for the Pearson correla-tion r between two quantities and the social sciences, Cohen gives the followinginterpretation guidelines: small effect size, r = 0.1 - 0.23; medium, r = 0.24 -0.36; large, r = 0.37 or larger. Interpreting R2 as r2 and translating the valueswe get approximately the intervals [0.01, 0.05), [0.05, 0.13), [0.13, 1]. Under thisinterpretation for 20 drugs out of the 118, the tumour expression profiles have alarge effect in predicting response; for 23 drugs they have a medium effect, andfor 9 drugs they have a small effect. On the other hand, 69 out of 118 drugshave a negative size effect, meaning that our prediction does not improve muchcompared to the prediction by the mean value.

4 Comparison of Regression versus Classification

In all related prior work, to the best of our knowledge, classification modelshave been constructed for predicting GI50 values [18, 26, 16]. Given that thelatter values are continuous, the authors of these works have quantized thembefore applying any classifiers. They typically construct two categories, sensitiveor resistant tumours, or three categories, sensitive, intermediate, and resistanttumours. Obviously, this quantization is an information-losing transformation.We now show that quantization is sometimes detrimental to performance andregression techniques have greater predictive power.

In this next set of computational experiments we pre-process the GI50 valuesof each drug to discretize them as described in [16]. Specifically, the class Ci,jof a cell-line i and drug j is computed as sensitive, intermediate, or resistant ifGI50Sigi,j falls within (−∞, µj−0.5σj ], (µj−0.5σj , µj+0.5σj ], and [µj−0.5σj ,∞)respectively, where µj is the average GI50 value over all cell lines for drug j andσj the standard deviation.

To evaluate classification, we employed the same overall protocol described inSection 3 with the following modifications: we used multi-class SVM classificationinstead of SVM Regression [7]. SVMs have been very popular and successfulclassifiers, particularly in bioinformatics [24]. We used the libsvm implementation

Page 7: Information-preserving techniques improve chemosensitivity prediction of tumours based on expression profiles

Chemosensitivity Prediction of Tumours Based on Expression Profiles 7

of SVMs with the Radial Basis kernel and all other parameters set to default.In addition, the metric of performance for classification is accuracy, i.e., thepercentage of samples whose class is correctly predicted. We denote with Aj the5-fold cross-validated accuracy of the method on drug j.

To compare regression vs. classification is not straightforward given that re-gression outputs a continuous prediction for GI50Sigi,j while classification outputsits class. To overcome this issue we discretize the output of the regression modelsto the three stated classes using the same intervals as above. This allows us tocompute the cross-validated accuracy of the regression for each drug j, denotedas Dj . In other words, Aj are computed by first discretizing the data, then usingclassification, and measuring the accuracy of the output, while Dj is computedby using regression, then discretizing the predictions, and computing accuracy.

Figure 2(b) shows the histograms of Aj , Dj , and their difference Dj − Aj .In some cases regression accuracy scores higher than classification accuracy andin other cases classification has more predictive results. More specifically, fordrug piperazinedione(NSC:135758) regression accuracy is 57,6271 while classifi-cation accuracy is 38,9394, for drug aminopterin(NSC:132483) regression accu-racy is 61,0169 while classification accuracy is 40,6061 and for drug morpholino-adriamycin(NSC:354646) regression accuracy is 52,5758 while classification ac-curacy is 35,5932. On the other hand, for drug pyrazofurin(NSC:143095) classi-fication accuracy is 52,4242 while classification accuracy is 40,6780.

5 Feature Selection

The most commonly applied feature (variable) selection method is to rank thegenes according to their association with the class (equivalently the p-value)and select the top k. We call this method univariate filtering. In this section, weshow that for this task it is beneficiary to use a state-of-the-art algorithm, thattakes into consideration multivariate associations. MMPC (already mentionedin the previous sections) is such a method. Specifically, we compute the cross-validated R2 for both methods on all drugs using the same protocol as before.The k parameter is set to the number of genes returned by MMPC, so thatboth methods return the signatures of the same sizes. Figure 1(b) presents ahistogram of the results. The figures shows that, on average, MMPC returnsmore informative signatures and thus, it should be preferred.

6 Biological Interpretation of the Molecular Signatures

We now consider the molecular signatures and their genes from a biologicalperspective. Due to space limitation we consider only the top five drugs with thehighest R2, i.e., the drugs for which we get the larger improvement in predictionwhen gene expressions are employed. Out of these, we omit a discussion ontriethylenemelamine as being obsolete. The rest four drugs and their signaturesare shown in Table 1.

Page 8: Information-preserving techniques improve chemosensitivity prediction of tumours based on expression profiles

8 Christodoulou, Røe, Folarin, Tsamardinos

Compound R2 Selected Probesets and Their Gene SymbolsCamptothecin,9-NH2 (RS) 0.4020 216268 s at (JAG1), 218499 at (MST4), 226743 at (FLJ34922)Topoisomerase I inhibitor 227877 at (LOC389289), 229892 at (EP400NL), 238175 at (ACAD8)

Geldanamycin 0.2992 1554479 a at (CARD8), 200673 at (LAPTM4A), 203959 s at (ZBTB40)HSP90 inhibitor 208868 s at (GABARAPL1), 222549 at (CLDN1), 228441 s at (No name)

234478 at (LOC440863), 244007 at (transcribed locus)Camptothecin,9-NH2 (S) 0.2755 1553190 s at (PARD3B), 206440 at (LIN7A), 211440 x at (CYP3A43)Topoisomerase I inhibitor 221191 at (STAG3L1), 226743 at (SLFN11), 236846 at (LOC284757)

Carmustine (BCNU) 0.2718 1554316 at (PGLS), 1558517 s at (LRRC8C), 200641 s at (YWHAZ)Alkylating agent 201488 x at (KHDRBS1), 203460 s at (PSEN1), 211979 at (GPR107)

215427 s at (ZCCHC14), 225864 at (FAM84B), 227346 at (EP400)230629 s at (IKZF1), 232870 at (HDAC10), 233400 at (CDNA FLJ13671)239413 at (CEP152), 240234 at (Trancribed locus)

Table 1. The 4 best predicted drugs (highest R2) along with their gene signatures.

The biological correlates of these findings are quite interesting. We will dis-cuss the target systems of camptothecin, geldanamycin and carmustine (BCNU)and what is already known about the relation of these genes, tumour resistanceand tumour response.

Camptothecin is a topoisomerase I inhibitor that confers DNA damage andconsequently apoptosis of the cancer cell. It is used for the treatment of colorec-tal, lung and breast cancer. Here, two derivatives of camptothecin were iden-tified with two non-overlapping signatures. In the first signature, only two ofthe six genes have some functional information related to cancer, the JAG1 andMST4. JAG1 has been shown to predict poor survival in breast cancer whenover-expressed and has also been proposed as a drug target for these cases [19].MST4 is a gene involved in cell proliferation and transformation, but little isknown about this gene in human tumours [15]. The second camptothecin sig-nature includes six genes, but only five have known function. The PARD3 isinvolved in polarity of cells, microtubulus function and has a co-function with atumour suppressor gene, VHL [20]. The LIN7A encodes a protein that has shownto promote migration and invasiveness in glioblastoma cell lines [17]. It also playsa role in establishing and maintaining the asymmetric distribution of channelsand receptors at the plasma membrane of polarized cells. CYP3A43 encodes aprotein involved in testosterone metabolism, and it has reduced expression inprostate cancer versus normal. Polymorphisms affecting CYP3A5 activity mayresult in increased prostate cancer risk and aggressiveness [13]. The STAG3L1 isrelated to multiple sclerosis, but yet no defined role in cancer. SLFN11 (Schlafengene family) is cell cycle regulator that mediate growth-inhibitory responses.Geldanamycin is a novel and promising experimental drug, possibly for severalcancer forms, an inhibitor of the heat shock protein HSP90. Five of eight genesin the signature have known function. CARD8 belongs to the CARD family pro-teins that are critical components of the conserved cell death machinery which,when dys-regulated, promotes oncogenesis and contributes prominently to tumorresistance to chemotherapy [10]. LAPTM4A is involved in multi-drug resistance,but the relation to geldanamycin is not explored [8]. ZBTB40 is related to bonemineral density and osteoporosis, no relation to cancer so far. GABARAPL1 isan estrogen related gene, and it was currently discovered that the level mRNAexpression in breast tumours is a good indicator of the risk of recurrence [6].

Page 9: Information-preserving techniques improve chemosensitivity prediction of tumours based on expression profiles

Chemosensitivity Prediction of Tumours Based on Expression Profiles 9

CLDN1 is important for the tight junctions between the cells but currently onefound that over-expression was inducing resistance to cell death by chemotherapyin nasopharyngeal carcinoma cells [12].

Carmustine (BCNU) is an alkylatin agent, old drug, but still in use for tu-mour of the central nervous system. Few of the 13 genes are related to cancer, butYWHAZ is an important anti-apoptotic gene where over-expression is relatedto anthracycline resistance in breast cancer [14]. KHDRBS1 is involved in RNAprocessing and its down-regulation was found to inhibit, proliferation and tu-mourigenicity of breast cancer cells [23]. PSEN1 is down-regulated in Alzheimersdisease and mice with deficient PSEN1 were more sensitive to brain carcinogensthan their normal counterparts [21]. FAM84B is related to cancer susceptibility.Knock-down of the EP400 gene confers chemosensitivity [11]. IKZF1 or IKAROSis related to prognosis of various cancers but also to chemotherapy response [29].HDAC10, is important for cancer progression, and a target for cancer therapy,but no relation to BCNU is noted.

7 Conclusion

Predicting chemosensitivity of tumours from gene expressions is important forselecting treatment, understanding the molecular mechanisms of drug response,and selecting molecular signatures. In this paper, we show that predictive per-formance can be improved by employing a new method for estimating the GI50(indication of response to drug), regression algorithms instead of classification,and state-of-the-art, multivariate feature selection. The signatures identified herehave several links to cancer progression and resistance to chemotherapy, but thedirect relations between the genes and the respective drugs are missing. Knowl-edge on these relations are still expanding and the methods used to identifythose signatures may be important tool for novel biological hypotheses.

Acknowledgements

We would like to thank Matina Fragoyanni for her code supply. Moreover, manythanks to Sofia Triantafyllou, Vincenzo Lagani and Angelos Armen for their helpand fruitful comments.

References

1. http://dtp.nci.nih.gov/index.html.

2. http://www.bioconductor.org.

3. http://dtp.nci.nih.gov/docs/compare/compare$\_$methodology.html.

4. C. F. Aliferis, A. Statnikov, I. Tsamardinos, S. Mani, and X. D. Koutsoukos. Localcausal and markov blanket induction for causal discovery and feature selection forclassification part i: Algorithms and empirical evaluation. Journal of MachineLearning Research, Special Topic on Causality, 11:171–234, 2010.

Page 10: Information-preserving techniques improve chemosensitivity prediction of tumours based on expression profiles

10 Christodoulou, Røe, Folarin, Tsamardinos

5. C. K. Augustin, A. Yoo, J. S. Potti, Y. Yoshimoto, P. A. Zipfel, H. S. Friedman,J. R. Nevins, F. Ali Osman, and D. S. Tyler. Genomic and molecular profilingpredicts response to temozolomide in melanoma. Clinical Cancer Res, 15(2), 2009.

6. A. Berthier, S. Seguin, A. J. Sasco, J. Y. Bobin, G. De Laroche, J. Datchary,S. Saez, C. Rodriguez-Lafrasse, F. Tolle, A. Fraichard, M. Boyer-Guittaut, M. Jou-venot, R. Delage-Mourroux, and F. Descotes. High expression of gabarapl1 is asso-ciated with a better outcome for patients with lymph node-positive breast cancer.Br J Cancer, 102:1024–1031, 2010.

7. B. Boser, I. Guyon, and V. Vapnik. An training algorithm for optimal marginclassifiers. In In Fifth Annual Workshop on Computational Learning Theory, pages144–152, 1992.

8. M. A. Cabrita, T. C. Hobman, D. L. Hogue, K. M. King, and C. E. Cass. Mousetransporter protein, a membrane protein that regulates cellular multidrug resis-tance, is localized to lysosomes. Cancer Res, 59:4890–4897, 1999.

9. C. C. Chang and C. J. Lin. LIBSVM: a library for support vector machines, 2001.Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

10. J. S. Damiano and J. C. Reed. Card proteins as therapeutic targets in cancer.Curr Drug Targets, 5:367–374, 2004.

11. G. V. Helgason, J. O’Prey, and K. M. Ryan. Oncogene-induced sensitization tochemotherapy-induced death requires induction as well as deregulation of e2f1.Cancer Res, 70:4074–4080, 2010.

12. J. W. Lee, W. T. Hsiao, H. Y. Chen, L. P. Hsu, P. R. Chen, M. D. Lin, S. J. Chiu,W. L. Shih, and Y. C. Hsu. Upregulated claudin-1 expression confers resistance tocell death of nasopharyngeal carcinoma cells. Int J Cancer, 126:1353–1366, 2010.

13. S. Leskela, E. Honrado, C. Montero-Conde, I. Landa, A. Cascon, R. Leton, P. Ta-lavera, C. J. M., A. Concha, M. Robledo, and C. Rodriguez-Antona. Cytochromep450 3a5 is highly expressed in normal prostate cells but absent in prostate cancer.Endocr Relat Cancer, 14:645–654, 2007.

14. Y. Li, L. Zou, Q. Li, B. Haibe-Kains, R. Tian, L. Y., C. Desmedt, C. Sotiriou,Z. Szallasi, J. D. Iglehart, A. L. Richardson, and Z. C. Wang. Amplificationof laptm4b and ywhaz contributes to chemotherapy resistance and recurrence ofbreast cancer. Nat Med, 16:214–218, 2010.

15. J. L. Lin, H. C. Chen, H. I. Fang, D. Robinson, H. J. Kung, and H. M. Shih.Mst4, a new ste20-related kinase that mediates cell growth and transformation viamodulating erk pathway. Oncogene, 20:6559–6569, 2001.

16. Y. Ma, Z. Ding, Y. Qian, Y.-W. Wan, K. Tosun, X. Shi, V. Castranova, J. Harner,and N. I. Guo. An integrative genomic and proteomic approach to chemosensitivityprediction. Int. J. Oncol, 34(1):107–115, 2009.

17. C. Perego, C. Vanoni, S. Massari, A. Raimondi, S. Pola, M. G. Cattaneo, M. Fran-colini, L. M. Vicentini, and G. Pietrini. Invasive behaviour of glioblastoma cell linesis associated with altered organisation of the cadherin-catenin adhesion system. JCell Sci, 115:3331–3340, 2002.

18. H. K. Potti, A. Dressman, A. Bild, R. F. Riedel, G. Chan, R. Sayer, J. Cragun,H. Cottrill, M. J. Kelley, R. Petersen, D. Harpole, J. Marks, A. Berchunck, G. S.Ginsburg, P. Febbo, J. Lancaster, and J. R. Nevins. Genomic signatures to guidethe use of chemotherapeutics, 2006.

19. M. Reedijk, S. Odorcic, L. Chang, H. Zhang, N. Miller, D. R. McCready, G. Lock-wood, and S. E. Egan. High-level coexpression of jag1 and notch1 is observed inhuman breast cancer and is associated with poor overall survival. Cancer Res,65:8530–8537, 2005.

Page 11: Information-preserving techniques improve chemosensitivity prediction of tumours based on expression profiles

Chemosensitivity Prediction of Tumours Based on Expression Profiles 11

20. B. Schermer, C. Ghenoiu, M. Bartram, R. U. Muller, F. Kotsis, M. Hohne,W. Kuhn, M. Rapka, R. Nitschke, H. Zentgraf, M. Fliegauf, H. Omran, G. Walz,and T. Benzing. The von hippel-lindau tumor suppressor protein controls ciliogen-esis by orienting microtubule growth. J Cell Biol, 175:547–554, 2006.

21. J. Serrano, A. P. Fernandez, R. Martinez-Murillo, and A. Martinez. High sensitivityto carcinogens in the brain of a mouse model of alzheimer’s disease. Oncogene,29:2165–2171, 2010.

22. U. T. Shankavaram, S. Varma, D. Kane, M. Sunshine, K. K. Chary, W. C. Reinhold,Y. Pommier, and J. N. Weinstein. Cellminer: a relational database and query toolfor the nci-60 cancer cell lines. BMC Genomics, 10(277), 2009.

23. L. Song, L. Wang, Y. Li, H. Xiong, J. Wu, J. Li, and M. Li. Sam68 up-regulationcorrelates with, and its down-regulation inhibits, proliferation and tumourigenicityof breast cancer cells. J Pathol, 222:227–237, 2010.

24. A. Statnikov, C. F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy. A com-prehensive evaluation of multicategory classification methods for microarray geneexpression cancer diagnosis. Bioinformatics, 21(5):631–643, 2005.

25. A. Statnikov, T. Tsamardinos, L. E. Brown, and C. F. Aliferis. Causal explorer:A matlab library of algorithms for causal discovery and variable selection for clas-sification. Challenges in Causality, 1, 2009.

26. J. E. Staunton, D. K. Slonim, H. A. Coller, P. Tamayo, J. Angelo Michael, J. Park,U. Scherf, J. K. Lee, W. O. Reinhold, J. N. Weinstein, J. P. Mesirov, E. S. Lander,and T. R. Golub. Chemosensitivity prediction by transcriptional profiling. Proc.Natl. Acad. Sci., 98(19):1078710792, 2001.

27. R. G. D. Steel and J. H. Torrie. Principles and Procedures of Statistics. New York:McGraw-Hill, 1960.

28. I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The max-min hill-climbingbayesian network structure learning algorithm. Journal of Machine Learning,65:31–78, 2006.

29. Z. Zuo, D. Jones, H. Yao, D. A. Thomas, S. O’Brien, F. Ravandi, H. M. Kantarjian,L. V. Abruzzo, L. Medeiros, S. S. Chen, and R. Luthra. A pathway-based genesignature correlates with therapeutic response in adult patients with philadelphiachromosome-positive acute lymphoblastic leukemia. Mod Pathol, 23:1524–1534,2010.