Bayesian Weibull tree models for survival analysis of clinico-genomic data

ARTICLE IN PRESS

Statistical Methodology ( ) –www.elsevier.com/locate/stamet

Bayesian Weibull tree models for survival analysis ofclinico-genomic data

Jennifer Clarkea,∗, Mike Westb

a Department of Epidemiology and Public Health, Leonard M. Miller School of Medicine, University of Miami, Miami,FL 33136, USA

b Department of Statistical Science, Duke University, Durham, NC 27705, USA

Received 30 November 2006; received in revised form 9 September 2007; accepted 16 September 2007

Abstract

An important goal of research involving gene expression data for outcome prediction is to establishthe ability of genomic data to define clinically relevant risk factors. Recent studies have demonstratedthat microarray data can successfully cluster patients into low- and high-risk categories. However, theneed exists for models which examine how genomic predictors interact with existing clinical factors andprovide personalized outcome predictions. We have developed clinico-genomic tree models for survivaloutcomes which use recursive partitioning to subdivide the current data set into homogeneous subgroups ofpatients, each with a specific Weibull survival distribution. These trees can provide personalized predictivedistributions of the probability of survival for individuals of interest. Our strategy is to fit multiple models;within each model we adopt a prior on the Weibull scale parameter and update this prior via EmpiricalBayes whenever the sample is split at a given node. The decision to split is based on a Bayes factorcriterion. The resulting trees are weighted according to their relative likelihood values and predictionsare made by averaging over models. In a pilot study of survival in advanced stage ovarian cancer wedemonstrate that clinical and genomic data are complementary sources of information relevant to survival,and we use the exploratory nature of the trees to identify potential genomic biomarkers worthy of furtherstudy.c© 2007 Elsevier B.V. All rights reserved.

Keywords: Survival analysis; Weibull; Recursive partitioning; Gene expression; Bayes factor; Variable selection; Ovariancancer; Clustering

∗ Corresponding author. Tel.: +1 604 628 9831; fax: +1 604 628 9831.E-mail address: [email protected] (J. Clarke).

1572-3127/$ - see front matter c© 2007 Elsevier B.V. All rights reserved.doi:10.1016/j.stamet.2007.09.003

Please cite this article in press as: J. Clarke, M. West, Bayesian Weibull tree models for survival analysis of clinico-genomic data, Statistical Methodology (2007), doi:10.1016/j.stamet.2007.09.003

http://www.elsevier.com/locate/stamet

mailto:[email protected]

http://dx.doi.org/10.1016/j.stamet.2007.09.003

ARTICLE IN PRESS2 J. Clarke, M. West / Statistical Methodology ( ) –

1. Introduction

Genomic information, in the form of microarray or gene expression signatures, has anestablished capacity to define clinically relevant risk factors in disease prognosis. Recent studieshave generated such signatures related to disease recurrence and survival in ovarian cancer[62,2] as well as in numerous other disease contexts [72,58,48]. Analyses involving geneexpression signatures have focused on clustering or classification to associate such signatures orpatterns with ‘low-risk’ versus ‘high-risk’ survival prognoses. The clustering of tumors based onexpression levels into multiple subgroups has been performed using various methods includingsupport vector machines [72], k-NN models [62], PLS [45] and hierarchical clustering [47].A more formal discussion and comparison of various tumor discrimination methods with geneexpression data can be found in [19].

The application of gene expression signatures to the prediction of disease outcome is aresearch area distinct from clustering applications. Less attention has been focused on predictionto date, although single genes or gene signatures have been studied for the prediction of tumorclassification related to ‘good’ versus ‘poor’ survival prognoses [70,21]. However, the ‘signature’approaches to prediction of cancer outcome with microarrays have been shown to be highlyunstable and strongly dependent on the selection of patients in the training sets [43]. This hasbeen attributed to inadequate validation leading to overoptimistic results, but also reflects theheterogeneity of complex disease. From the perspective of the individual patient a sharper, morespecialized approach to prediction is needed.

Bayesian regression tree models, described in Section 2, form the basis of one such approach.The conventional binary regression tree associated with CART [9] has been used successfullyfor prediction in various modeling contexts [5,63] as have the Bayesian versions of CART [12,16], other Bayesian binary trees [49], and the relative risk trees of Ishwaran, Blackstone, Pothier,and Lauer [32]. A review of tree-based methods for survival can be found in [73]. Research hasshown that the prediction accuracy of such models can be improved through Bayesian modelaveraging [25], bagging [5], boosting [60], and related methods [13]. This holds true for treeswhether the model search is stochastic [67,7,35] or deterministic [10,44]. Our approach is areflection of this finding and of the recent emphasis in the literature on ensemble methods forprediction [14,26,27]. A key aspect in our approach is the averaging of predictions over multiplecandidate models, which we discuss in Section 3. Note that Bayesian model averaging not onlyimproves predictive performance but the posterior parameter estimates and standard deviationsdirectly incorporate model uncertainty [51].

In this article we discuss the development of Bayesian tree models that allow the useof clinical, histopathological, and genomic data in the prediction of disease-related survivaloutcomes. These regression tree models have ability to discover and evaluate interactions ofmultiple predictor variables, and define flexible, non-linear predictive tools [49]. Specifically, ourmethod allows the direct evaluation of the relative importance of clinical and genomic predictors.Our approach is demonstrated in the context of prediction of survival after surgery for ovariancancer patients. We stress the utility of such tree models in the exploration of genomic data, andthe resulting identification of genes plausibly associated with clinical endpoints, as well as forprediction.

2. Regression trees

Our focus is the development of regression trees that recursively generate binary partitions ofthe covariate space, based upon specific clinical and genomic variables, and within each partition


ARTICLE IN PRESSJ. Clarke, M. West / Statistical Methodology ( ) – 3

accurately model a continuous survival time response variable. One key advantage of such trees istheir interpretability: the entire feature space can be explained by a single tree and the predictionfor any given individual can be interpreted as a conjunction of simple logical expressions [17,24]. Regression tree models serve as tools for prediction as well as for exploratory data analysisby discovering simple combinations of covariates that correlate with a particular outcome. In thecase of genomic data these combinations can then serve as a basis for further biological study.Recent additions to the survival tree modeling literature, including [26,27] and [33], reflect theimportance of survival trees as an analytic technique for data sets with complex structure.

In the remainder of this section we discuss model construction and model inference. Webegin with a brief overview of recursive partitioning models (Section 2.1) and the use of theExponential and Weibull distributions to model the conditional distribution of the responsevariable (Section 2.3). Then we discuss the splitting criterion based on Bayes factors andinference via Empirical Bayes methods (Section 2.2) and posterior distributions and predictivedistributions (Section 2.4). The generation of predictive distributions by model averaging isdiscussed in Section 3. Although our models can be applied to censored data (under theassumption of non-informative censoring) [15,48], we confine our discussion to the fullyobserved case.

2.1. Recursive partitioning

We assume a continuous survival time response variable Y and a p-dimensional vector ofcovariates X. Each covariate X j , j = 1, . . . , p, may be categorical or continuous. We assumethat the distribution of Y |X can be expressed as Y |g(X) where g is a recursive binary partitioningor splitting of the covariate space into disjoint subspaces. Each binary split is defined by a rulewhich assigns an observation in the current partition to one of two partition subspaces basedupon a predictor X j and a threshold value τ . The choice of the pair (X j , τ ) is made by findingthe pair which reorganizes the data in the current partition into two subgroups whose survivaldistributions are most different, as assessed by a splitting criterion (see Section 2.2.1). A splitis performed if the value of this criterion exceeds a specified threshold of significance. Thissplitting process continues in a recursive fashion until the existing model cannot be improved.The result is a tree model M(Y, X) in which the terminal nodes or leaves represent a partition ofthe covariate space in which the distribution of Y is distinct.

For a given node and predictor it is possible that any of several threshold values would yielda significant split. The ability to generate multiple trees at a node may be advantageous. Inproblems with many predictors, this naturally leads to the generation of many trees, often withsmall changes from one to the next, and the consequent need to develop inference and predictionin the context of multiple trees generated this way. The use of ‘forests of trees’ and similarensemble methods has been urged by Breiman [8] as well as others [26] and our perspectiveendorses this. The involvement of multiple trees in our analyses is supported by the viewpointthat the splitting of nodes is based on the selection of (predictor, threshold) pairs which we viewas parameters of the overall tree model. Any single tree is formed by selecting specific valuesfor these parameters and the uncertainty in these parameters is reflected in the variability amongtrees. The resulting models generate predictions via model averaging. This process is discussedin more detail in the following Section and in Section 3.

2.2. Tree generation

We employ a forward-selection process to generate tree models. If the data in a node of asingle tree is a candidate for splitting, we find the (predictor,threshold) pair that maximizes the



splitting criterion (see below) for a split at the given node. The node is split if the value of thiscriterion is sufficiently large. Given a current tree the splitting process continues until either theexisting model cannot be improved, i.e., the splitting criterion is not sufficiently large for anychoice of (predictor,threshold) at any node, or until all of the remaining candidate terminal nodeshave very few observations (usually less than 5 observed survival times). Our strategy is unlikeother tree-growing methods (including CART), which purposely overgrow a tree and then pruneback, due primarily to our focus on prediction in settings of low signal to noise. We want tolimit adaptivity and avoid overfitting, at the possible cost of missing an association of moderatesignificance.

2.2.1. Bayes factorsThe choice of splitting criterion is based on the association between the outcome variable Y

(survival time) and the covariates X in subsamples. Splitting variables and splitting thresholds areselected based on their ability to strengthen this association. With data y1, . . . , yn in a given nodeand a specified threshold τ on a given predictor x j , our test of association is based on assessingwhether the data are more consistent with a single exponential distribution (with exponentialparameter µτ ) or with two separate exponential distributions (with parameters µ0,τ and µ1,τ )defined by the specified partition.

In our Bayesian approach we adopt the standard conjugate Gamma prior model on theExponential parameter [61]; the prior is Gamma(a, b) where b = a/m and m is the mean of theGamma prior. We specify a fixed global prior mean but treat the scale parameter a as uncertainand node specific; a is estimated via Empirical Bayes (EB). In brief, suppose a node has rzindividuals with observed survival times and Yz is the sum of all survival times (here z = 0, 1identifies the node as one of two children nodes of a parent node). Assuming µ0,τ 6= µ1,τ wetake µ0,τ and µ1,τ to be independent with common prior Gamma(aτ , bτ ) with mean aτ /bτ .Under the null hypothesis µ0,τ = µ1,τ the common value has the same Gamma prior. Letthe parameters of the current prior Gamma(aτ , bτ ) be expressed as aτ = cτ and bτ = cτ /mwhere m is the prior mean. The empirical Bayes approximation to µz,τ | (rz, Yz, µ|1−z|,τ ) isGamma(cτ +rz, (cτ /µz,τ )+Yz) where µz,τ is the marginal maximum likelihood estimate (MLE)(found iteratively). The updated prior will serve as the prior on µz,τ and the EB estimate of a willbe used in the splitting criterion. This has two key aspects: first, it permits ‘borrowing strength’across the two subgroups or children nodes to estimate this key parameter; second, it allowsfor differing prior Gamma shape parameters at different nodes in each tree, thus it is flexible inresponding to varying degrees of uncertainty as we move down the tree.

A candidate split of a given node will organize the data as follows:

nobs∑

i Yix j ≤ τ r0 Y0 n0x j > τ r1 Y1 n1

r Y

where nz is the total number of survival times in subgroup z (in the uncensored case rz = nz).The splitting criterion or test of association is based on assessing the Bayes factor Bτ [37]comparing the null hypothesis H0 : µ0,τ = µ1,τ (with common value µτ ) with the alternativeH1 : µ0,τ 6= µ1,τ . The Bayes factor Bτ in favor of the alternative over the null hypothesis issimply

Bτ =Γ (aτ + r0)Γ (aτ + r1)

Γ (aτ )Γ (aτ + r)

baττ (bτ + y)aτ +r

(bτ + y0)aτ +r0(bτ + y1)

aτ +r1.



The Bayes factor is calibrated to the likelihood-ratio scale. However, it will provide moreconservative estimates of significance than both likelihood-based approaches and moretraditional significance tests [57]. The Bayes factor will naturally choose smaller models overmore complex ones if the quality of fit is comparable and hence provide a control on the size ofour trees [3].

In comparing predictors the Bayes factor can be evaluated for each predictor across a rangeof predictor-specific thresholds. For a given predictor this generates values of Bτ as a function ofτ , which may suggest promising threshold values.

2.3. Weibull transformation

Suppose that a node of a given tree is to be split on a predictor x j at the (threshold) value τ .Let yzi and rz be as defined in Section 2.2.1 where i denotes the i th individual in subgroup z,i = 1, . . . , nz , and yzi ∼ Exp(µz,τ ). The data density is

p(yz | µz,τ , rz) = µrzz,τ e

−µz,τ

nz∑i=1

yzi.

A careful examination of data from earlier studies of survival and cancer [2,15] revealed that thesurvival distribution could be more accurately represented by a Weibull distribution. The Weibullmay be not only the most widely used parametric survival model but with its shape parameter itcan be viewed as a generalization of the Exponential [30]. We subsequently denote the survivaltime as tz where tz has a Weibull distribution with parameters µz,τ and α. If tzi is the survivaltime for individual i from subgroup z then

p(tzi | µz,τ , rz, α) = αµz,τ tα−1zi e−µz,τ tαzi

for i = 1, . . . , nz, z = 0, 1. Note that the Weibull distribution is a power transformation of theExponential distribution (yzi = tαzi ). For a specified, global Weibull shape parameter α, we cantransform the data to Exponential, analyze the data and build trees with Exponential survivaldistributions, and then transform back to the original scale for predictions of new cases.

In our parameterization of the Weibull the scale parameter has been incorporated into thedefinition of µz,τ . As the value of µz,τ varies across different nodes of a tree so does the scaleparameter. Since the splitting criterion for the trees is based on a significance test of the valueof µz,τ , the scale parameter is implicitly, although not directly, incorporated into the splittingcriterion and hence used for growing the tree. The current model could be reparameterized toaddress the scale parameter directly; however, this would require an entirely different Bayesiananalysis as the interpretation of µz,τ is essential to the current conjugate analysis (see [61]).

2.4. Inference and prediction

Inference and prediction at a terminal node or leaf of a given tree involve the calculationof branch probabilities and the posterior predictive distributions which underlie the predictiveprobabilities for new cases. To calculate the branch probabilities for a leaf we must follow thepath or sequence of nodes of the tree that connect the root node with the specified leaf.

We consider the kth node of the tree and suppose that it is split on the pair (x jk, τ jk), wherethe notation of Section 2.2 has been extended to include the node index. The data in node kcan be divided into two groups based on the values of (x j , τ j ), where the sums of all of thesurvival times in the x j ≤ τ j and x j > τ j groups are Y0k and Y1k , respectively. The impliedconditional probabilities P(Yzki > t | Z = z), i = 1, . . . , nzk , for some time t are the branch



probabilities defined by this split (the dependence of these probabilities on the tree and the dataare suppressed for clarity). From Sections 2.2 and 2.3 we know that these probabilities are basedon Exponential distributions for yzki with parameter µz,τ jk for z = 0, 1 and specified Gammapriors which we index by the parent node, i.e., Gamma(aτ jk , bτ jk ). The use of EB to estimateaτ jk has been described in Section 2.2 and will not be discussed here. Assuming that node kis split, the resulting conditional posterior branch probability parameters would be independentwith posterior Gamma distributions:

µ0,τ jk ∼ Gamma(aτ jk + r0 j , bτ jk + y0 j ) and µ1,τ jk ∼ Gamma(aτ jk + r1 j , bτ jk + y1 j ).

These distributions allow inference on the branch probabilities and play an essential role inpredictive calculations, as we now describe.

Let x? be an observed vector of covariates for a new case and consider predicting the responseP(y? > t | x?) for a given time t . The current tree will define a single path for this observationfrom the root node to a terminal node or leaf. Prediction requires that we follow x? along its pathdown the tree to the implied leaf and construct the relevant posterior defined by the (x, τ ) pairsat the splits that we encounter along the path. For example, suppose that our new case x? has animplied path through nodes 1 and 2 terminating at node 5 (a leaf), where each tree split definesexactly 2 children nodes (node numbers increase from left to right within levels starting with theroot node as node 1). This path is based on (predictor, threshold) pairs (x1, τ1) and (x2, τ2) andis a result of predictor values (x?

1 ≤ τ1) and (x?2 > τ2). After the root split the parameter of

the Exponential distribution of the survival times in node 2 has a posterior Gamma distribution,i.e., µ0,τ1,1 ∼ Gamma(aτ1,1 + r01, bτ1,1 + y01).

The prior parameters aτ1,1 and bτ1,1 are updated via empirical Bayes using r01 and y01resulting in aτ2,2 and bτ2,2. The split of node 2 would lead to a posterior Gamma distributionfor the parameter of the Exponential distribution of the survival times in node 5, i.e., µ1,τ2,2 ∼

Gamma(aτ2,2 + r12, bτ2,2 + y12). For notational simplicity, let µ1,τ2,2, aτ2,2 + r12, and bτ2,2 + y12be denoted by µ5, a5 and b5, respectively. The prediction of the response P(y? > t | x?) involvesthe posterior predictive distribution of future survival times for cases in node 5, i.e.,

P(y? > t | x?) =ba5

5

Γ (a5)

∫∞

0µ

(a5−1)5 e−µ5(t+b5)dµ5 =

ba55

Γ (a5)

Γ (a5)

(b5 + t)a5=

(b5

b5 + t

)a5

which is a Gamma mixture of exponentials, or a Pareto distribution of the second kind(P(I I )(0, b5, a5)) [34].

Prediction follows by estimating P(y? > t | x?) based on the sequence of conditionally-independent posterior distributions for the branch probabilities that define it. Simply plugging inthe posterior conditional means of each µz,τ, j will lead to a plug-in estimate of P(y? > t | x?).Since each exponential mean follows a Gamma posterior, it is possible to draw Monte Carlosamples of the µz,τ, j and compute the corresponding values of P(y? > t | x?) to generate aposterior sample for summarization. In this way we can examine the simulation-based posteriormeans and uncertainty intervals for P(y? > t | x?) which represent predictions of the survivalprobabilities for the new case.

3. Generation and weighing of multiple trees

The use of forests of trees and similar ensemble methods has been urged by Breiman [8] aswell as others [44,26] as previously noted. In our analyses the (predictor,threshold) pairs areviewed as parameters of the overall tree model. Statistical learning about relevant trees requires



the examination of aspects of the posterior distributions of these parameters (and of the branchprobabilities). Our Bayesian approach to survival tree modeling allows us to properly addressmodel uncertainty, as has been done in similar contexts by others [10,16,12].

Trees are known as unstable classifiers [9]; however predictions may be improved by selectinga group of models instead of a single model and generating predictions by model averaging, asin [10,25]. Copies of the ‘current’ tree are made and the current node is split on a differentsignificant (predictor,threshold) choice for each copy. Once a number of trees have beengenerated we can involve all or some of them in inference and prediction by weighting thecontribution of each tree by its relative likelihood value. As a result of the current frameworkof forward generation of trees the likelihood values are easy to compute. For any single treethe overall marginal likelihood can be calculated by identifying the nodes which have been splitand taking the product of the component marginal likelihoods defined by each split node. In otherwords (using the notation of Section 2.4) the marginal likelihood component defined by node k is

mk =

∫∞

0

∏z=0,1

p(yzk | µz,τ jk , rzk, τ jk)p(µz,τ jk )dµz,τ jk

where µz,τ jk is the Gamma(aτ jk , bτ jk ) prior for each z = 0, 1. We simplify this to

m j =

∏z=0,1

baτ jkτ jk Γ (aτ jk + rzk)

(bτ jk + yzk)(aτ jk +rzk )Γ (aτ jk )

.

The product of the component marginal likelihood values over all such split nodes k is the over-all marginal likelihood value for the tree. This value is relative to the overall marginal likelihoodvalues of all of the trees generated, which can be normalized to provide relative posterior proba-bilities for the trees based on an assumed uniform prior. These probabilities are valuable for bothtree assessment and as relative weights in calculating average predictions for future observations.To represent predictions across many candidate trees, we use simulation: sample a tree model ac-cording to the posterior probabilities, i.e., the normalized relative likelihoods, then sample theimplied unique Pareto distribution for a candidate future sample, based on the predictor profileof that case, in the chosen tree. Repeating this leads to a Monte Carlo sample from the predictivedistribution that represents both within-tree uncertainties and, potentially critically, uncertaintyacross tree models. These samples can be summarized to produce point and interval estimates ofsurvival probabilities at any chosen set of time points, and profiles of the full predicted survivaldistributions.

4. Sensitivity and performance on simulated data

Like any method for statistical inference our modeling approach and results will depend onvarious assumptions. These include the choice of prior and the data likelihood. In this sectionwe consider the sensitivity of our method to the assumed value of the Weibull shape parameter α

(see Section 2.3) in a predictive context using simulated data. To aid in determining whether ourmethod behaves as expected, we employ two other modeling approaches for comparison. Ourhope is that this assessment, although limited, will provide useful information concerning thestrengths and weaknesses of our approach.

4.1. Setup

Our setup is similar to that of Hothorn et al. [28]. Five independent predictors X1, X2, . . . , X5were generated from a uniform distribution on [0,1]. Survival times were generated from a



Fig. 1. Median MIE values (with 95% error bars) from model A simulation with Weibull shape parameter = 0.8. Thetwo y-axes indicate median MIE value (left) and percent improvement over Kaplan–Meier (right).

Exponential distribution with conditional survival function S(y|x) = exp(−yµx) under threemodels with logarithms of the hazards (A) log(µx) = 0, (B) log(µx) = 3I (X1 ≤ 0.5 ∩ X2 >

0.5), or (C) log(µx) = 3X1 + X2. These times were then transformed to follow a Weibulldistribution with shape parameter α; values of α were [0.5, 0.8, 1.0, 1.2, 1.5].

The behavior of our models was compared to both a simple Kaplan–Meier curve and survivaltrees as implemented in the rpart package [66] in the R system for statistical computing [50].Comparison to proportional hazards has been presented elsewhere [15]. The parameters for therpart routine were set as in [28]. For our tree models the maximum number of trees allowed was30, the minimum Bayes factor value required for a split was 2.5, and only nodes containing atleast 3 observations were candidates for splitting. Numerous parameter combinations were triedwith minimal impact on the results, if any. Trees with normalized likelihood values below 5%were removed from consideration. The mean integrated squared error was employed as a measureof the quality of the model predictions (computed by numerical integration). The learning samplecontained 200 observations and the value of the predictions was evaluated on an independentsample of 100 observations.

4.2. Results

We have selected three representative runs to discuss: model A at α = 0.8, model B atα = 1.2, and model C at α = 1.5. Within each run the value of α assumed by the Weibulltree models (which we will refer to as αfix) takes each value from the set [0.5, 0.8, 1.0, 1.2, 1.5].The median MIE result and the 95% confidence interval for the median MIE calculated for100 replications of the learning and evaluation sets for these runs are displayed in Figs. 1–3.For model A the Weibull tree performs best at values of αfix near the true α. As expected, theerror increases as αfix moves away from α but the performance of the trees does not degradenotably until αfix � 1. Note that α = 1 is the transition point for the monotonicity of the hazardfunction (the hazard is decreasing for α < 1 and increasing for α > 1). The Weibull methodwas able to capture the correct (null) model in all runs across αfix values; however the numberof trees selected increased as αfix � 1 (from an average of 1 tree selected to an average of 1.25



Fig. 2. Median MIE values (with 95% error bars) from model B simulation with Weibull shape parameter = 1.2. Thetwo y-axes indicate median MIE value (left) and percent improvement over Kaplan–Meier (right).

trees selected). For comparison the rpart method selected the correct (null) model in 91.8% ofruns.

In Fig. 2 the Weibull model performs well for αfix > 1 with α = 1.2. For small values of αfixthe error and variability of the results increase; this parallels an increase in the number of treesselected and a decrease from 97% to 71% in the number of runs where the correct model wasselected (the rpart method selected the correct model in 86.6% of runs). Similarly to the resultsin Fig. 1 we see a loss of performance for values of αfix < 1 where the monotonicity of theassumed hazard function is opposite to the monotonicity of the true hazard function. This patternis also reflected in Fig. 3 but with relatively larger error bars. The increase in variability in Fig. 3is the result of averaging over more trees (an average of 1.58 trees) in an attempt to capture thelinear equation in the log hazard function. Overall the results demonstrate that the Weibull treesare sensitive to the monotonicity of the assumed hazard function, as reflected in the value of αfix,and its correspondence to the monotonicity of the true hazard function.

5. Analysis in ovarian cancer research

Ovarian cancer is the deadliest of the gynecologic cancers and the fifth leading cause of cancerdeaths among women today [1]. When making ovarian cancer diagnoses and prognoses cliniciansrely on subjective interpretations of both clinical and histopathological information, which can beincomplete or unreliable [62]. Recent studies in ovarian cancer have demonstrated the potentialof genomic data to improve our ability to predict patient survival and treatment response [62,2].

We chose to utilize Weibull trees to explore pilot data collected from 119 advanced stageovarian cancer patients treated at either Duke University Medical Center or H. Lee MoffittCancer Center & Research Institute. The primary purpose of this analysis was to determinewhether genomic data could demonstrate ability to predict survival that was not reflected inavailable clinical data such as disease-free interval (time between primary chemotherapy/diseaserelapse and disease recurrence) and, if so, to explore which genes may demonstrate such abilityand whether a larger study would be of interest. Tissue samples were collected at the time ofinitial cytoreductive surgery and all patients received primary chemotherapy with a platinum-



Fig. 3. Median MIE values (with 95% error bars) from model C simulation with Weibull shape parameter = 1.5. Thetwo y-axes indicate median MIE value (left) and percent improvement over Kaplan–Meier (right).

based regimen (usually including taxane) subsequent to surgery. Detailed clinical records oftraditional risk factors (age, stage, grade, debulking status) and measurement of disease-freeinterval were available for 55 of the 119 patients and have been summarized in Table 1. Geneexpression data was generated for each patient at the institution of sample origin from RNAextracted from banked tissue derived from primary tumor biopsies. This RNA was hybridized toAffymetrix Human U133A GeneChips according to standard Affymetrix protocol. The resultswere expression levels from over 22,000 genes and expressed sequence tags (ESTs) for eachindividual. The pre-processing of the gene expression data (normalization and screening) and theuse of dimension reduction techniques to build composite genomic predictors prior to analysisare discussed in Sections 5.1 and 5.2. The overall survival time (time from diagnosis to patientdeath) was selected as the response variable.

The clinical characteristics of the Duke and Moffitt samples were not comparable (see Table 1)and hence we could not use one for training the model and one for validation. We excluded thepossibility of using leave-one-out cross-validation due to its instability in model selection [6]and decided instead to divide the combined data set of 55 samples into a training set (60% ofsamples) and a test set (40% of samples). Although this may introduce bias in internal validation[52], the primary interest in terms of a possible future study is in external validation. Training andtest sets were balanced for age, array location (Duke or Moffitt), debulking status, and responseto platinum therapy. In order to account for possible assignment bias due to unknown factors weperformed 10 runs; in each run the samples were split into different training and test sets and allsteps of the analysis, including expression data pre-processing, were repeated.

5.1. Pre-processing of expression data

The ovarian cancer data contained expression levels from over 22,000 genes and expressedsequence tags (ESTs) for each individual. We chose to use GeneChip RMA (GCRMA) asour measure of expression since it has been shown to balance accuracy and precision [31].Our expression data were initially screened to exclude genes showing minimal variation acrosssamples. We evaluated the remaining genes for consistency across both sets using integrative



Table 1Ovarian cancer clinical data

Duke MoffittN % N %

Age<45 2 5.7 2 10.045–55 8 22.9 5 25.055–65 11 31.4 7 35.065–75 11 31.4 4 20.0>=75 3 8.6 2 10.0Mean/Min/Max 60/33/79 59/33/76StageIII 31 88.6 14 70.0IV 4 11.4 6 30.0

GradeI 1 2.9 2 10.0II 15 42.9 3 15.0III 19 54.3 15 75.0

DFS interval (mo.)<12 24 68.6 12 60.0>=12 11 31.4 8 40.0Mean/Min/Max 20.0/0.0/156.0 12.1/0.0/44.0

Surgical debulkingSuboptimal (>1 cm) 21 60.0 4 20.0Optimal (<1 cm) 14 40.0 16 80.0

Platinum responseYes 22 68.9 4 20.0Partial 8 22.9 16 80.0No/Stable disease 5 14.3 0 0.0

Survival time (mo.)Observed 30 85.7 6 30.0Censored 5 14.3 14 70.0Mean/Min/Max 55.3/6.0/185.0 39.0/11.0/101.0

DFS = disease-free survival.

correlations as described in [46]. Across different runs an average of 6400 genes passed allscreens (sd = 53.42 genes). Although individual genes could be used as predictors, we chose tocreate predictors from clusters of similar genes both to reduce dimension and to identify multipleunderlying patterns of variation across samples.

5.2. Clustering and metagene selection

The evaluation and summarization of large-scale gene expression data in terms of lowerdimensional factors of some form are being increasingly utilized both to reduce dimensionand to characterize the diversity of expression patterns evidenced in the full sample [39,23].The idea is to extract multiple patterns as candidate predictors while reducing dimensionand multiplicities and smoothing out gene-specific noise. Discussion of various factor modelapproaches appears in [71]. Considering the number of genes in our data set and the heterogeneityof the sample patients we first applied k-means correlation-based clustering to the genes andselected the dominant principal component (or metagene [29]) to represent each cluster. These



Fig. 4. Permutation null distribution (blue) and distribution of gene silhouette values from run 1 (red). (For interpretationof the references to colour in this figure legend, the reader is referred to the web version of this article.)

metagene predictors are input to the tree model analysis, along with the clinical predictors, as are-expression of the genomic information contained in the original microarray data. Althoughk-means was chosen for its ease of use and wide availability our approach is amenable to otherclustering techniques.

The k-means clustering algorithm was applied to the training data in each run, generating anaverage of 490 gene clusters (sd = 2.76 clusters). As the true number of clusters is unknownit was possible that some clusters did not represent subsets of related genes but were simplyan artifact of the clustering algorithm. We identified such clusters by assessing the silhouettewidths [38] of genes within clusters and removing clusters containing genes whose widths werenot significant. This approach is similar to that of Dudoit and Fridlyand [18]. The significanceof a width was determined by comparison to a permutation-based null distribution generated byrandomly permuting the entries of each row of the observed gene expression matrix, clusteringthis permuted matrix using k-means as above, and calculating the silhouette values for thepermuted genes. Only clusters whose genes had significant silhouette values (p < 0.05)were retained, leaving an average of 310 metagenes for analysis (sd = 20.87 clusters). Thepermutation null distribution and the gene silhouette values from the initial training/test run aredisplayed in Fig. 4. The silhouette values by cluster size are displayed in Fig. 5.

5.3. Predictive results

Using the training data as a learning set we generated multiple trees under a variety ofparameter settings using clinical predictors only, metagenes only, and both metagenes andclinical predictors. The parameter settings were as follows: Bayes factor thresholds of 2.0, 2.5,or 3 on the log base 2 scale, Weibull shape parameter values of 0.8, 1.0, or 1.2, Gamma priorparameters of α = 2 and β = 1/60 or 1/120, up to 20 splits (i.e., 20 new trees) at the rootnode and up to 3 at each second level node. The choice of Bayes factor threshold was based onfrequentist properties: a Bayes factor of 3 is approximately equivalent to a p-value of 0.05.The Gamma prior parameters were chosen to roughly match the mean of the training data,



Fig. 5. Gene silhouette values by cluster size from run 1.

i.e., αβ = µ. The Weibull shape parameter is unknown but values were selected based on thehistogram of the training data. Any tree whose relative likelihood value exceeded 1% contributedto the generation of predictions via model averaging. The combination of parameter settingswhich produced the trees with the most accurate fitted values were retained and used to generatepredictions for the validation set. A fitted value at time t for an individual was ‘accurate’ if thefitted probability of surviving for at least time t was greater than a specified cutoff if the recordedsurvival time for the individual is greater than time t , and vice versa. The specified cutoff wasbased on an ROC curve to balance specificity and sensitivity. The predictive accuracy of a fittedmodel was assessed by calculating the predicted auROC estimates at 3-, 4-, and 5-year survivalendpoints [11].

As can be seen in Table 2 the predictive results varied across the runs with a validation auROCfor the median predictions of the clinical only (C), genomic only (G), and clinico-genomic (CG)tree models of 78.96%, 81.27%, and 84.28% at 3-year survival; 79.94%, 81.19%, and 83.55% at4-year survival; and 76.93%, 77.92%, and 81.11% at 5-year survival. For C models an averageof 4 trees had appreciable relative likelihood and contributed to the predictions in any given run.For the G and CG models the average number of contributing trees was 35 and 36, respectively,although only an average of 4.2 and 2.4 trees, respectively, had relative likelihoods above 5%.Note that in several runs the genomic predictors did not improve upon the predictive ability ofthe clinical data, and in one run (run #8) none of the models demonstrated the ability to predict,



Fig. 6. A high likelihood clinico-genomic survival tree; each node contains the (predictor,threshold) which created thenode split, the number of sample individuals in the node, and the posterior predictive probability of 3-year survival forthat subpopulation.

but the additional predictive ability provided by the genomic variables is evident when lookingacross all runs.

A high likelihood tree from run 1 is shown in Fig. 6; each node contains the (predictor,threshold) which created the node split, the number of sample individuals in the node, and theposterior predictive probability of 3-year survival for the node subpopulation. Several clinicalvariables, particularly disease-free interval but also age, grade, and debulking status, appear in toptrees along with a group of metagene predictors. The specific metagene predictors vary with eachrun but by comparing the key metagenes across runs we do find genes which appear frequentlyand for which potentially very relevant biological connections can be made; see Section 5.4for a discussion of several such connections. A summary of the predictors which appear in thetop trees from run 4 is presented in a tree matrix plot in Fig. 7. For each predictor the sumof the probabilities of the trees in which the predictor appears is shown on the horizontal axis;this serves as a simple numeric assessment of the relative importance of these variables in theprediction of survival.

Fig. 8 shows a snapshot of predictions of the probability of 3-year survival from run 6. Inthis example many of the uncertainty intervals are large which reflects the small sample size andheterogeneity of the sample population.

A posterior sample of predictions for each individual can be generated via Monte Carlosampling of the µz,τ j and computing the corresponding values of P(y? > t | x?). This providessimulation-based posterior means and uncertainty intervals which are critical in determiningthe importance of a prediction in clinical decision making. To illustrate this, we selectedthree individuals from the data set and displayed their predicted survival curves from the CGmodels in the panels of Fig. 9. These curves extend over several years and include uncertaintyintervals at certain time points. Cases 2424 and 1451 are examples where the confidencein prediction, either of short-term (#2424) or long-term survival (#1451), is quite high, as



Tabl

e2

Pred

ictiv

eau

RO

CV

alue

sof

Clin

ical

only

(C),

Gen

omic

only

(G),

and

Clin

ico-

Gen

omic

(CG

)T

ree

mod

els

at3-

,4-,

and

5-ye

arsu

rviv

alen

dpoi

nts

Val

idat

ion

accu

racy

inea

chof

10tr

aini

ng/te

stru

ns3

year

s1

23

45

67

89

10A

vg

C52

.289

.098

.865

.582

.287

.578

.954

.481

.110

0.0

(45.

6,53

.3)

(69.

0,94

.0)

(93.

4,98

.8)

(56.

4,68

.2)

(82.

2,88

.9)

(85.

0,92

.5)

(74.

4,78

.9)

(52.

2,54

.4)

(78.

9,81

.1)

(82.

2,10

0.0)

78.9

6

G86

.786

.082

.376

.470

.097

.588

.947

.896

.780

.0(7

0.0,

86.7

)(7

7.0,

91.0

)(7

6.5,

82.3

)(6

7.3,

76.4

)(6

4.4,

70.0

)(8

6.3,

100.

0)(7

1.1,

88.9

)(4

1.1,

65.6

)(8

4.4,

96.7

)(6

0.0,

84.4

)81

.27

CG

96.7

86.0

82.3

80.0

78.9

98.8

85.6

48.9

96.7

88.9

(72.

2,96

.7)

(81.

0,86

.0)

(82.

7,85

.2)

(66.

4,80

.0)

(68.

9,82

.2)

(86.

3,10

0.0)

(67.

8,85

.6)

(36.

7,65

.6)

(82.

2,96

.7)

(66.

7,88

.9)

84.2

8

4ye

ars

12

34

56

78

910

Avg

C53

.886

.998

.868

.887

.091

.478

.955

.078

.810

0.0

(48.

8,53

.8)

(78.

8,89

.9)

(93.

8,98

.8)

(57.

3,70

.8)

(85.

1,95

.5)

(88.

9,93

.8)

(74.

4,78

.9)

(52.

5,55

.0)

(76.

3,78

.8)

(82.

5,10

0.0)

79.9

4

G86

.391

.982

.776

.170

.092

.688

.948

.896

.378

.8(7

0.0,

86.3

)(7

7.8,

97.0

)(7

6.5,

82.7

)(6

4.6,

76.1

)(6

4.4,

70.0

)(8

4.0,

96.3

)(7

1.1,

88.9

)(4

5.0,

63.8

)(8

6.3,

96.3

)(6

0.0,

85.0

)81

.19

CG

96.3

90.9

77.8

77.1

78.9

92.6

85.6

50.0

97.5

88.8

(75.

0,96

.3)

(81.

8,90

.9)

(59.

3,77

.8)

(62.

5,77

.1)

(68.

9,82

.2)

(81.

5,92

.6)

(67.

8,85

.6)

(40.

0,63

.8)

(87.

5,97

.5)

(66.

3,88

.8)

83.5

5

5ye

ars

12

34

56

78

910

Avg

54.5

85.2

93.5

68.8

77.3

85.0

77.5

56.1

71.4

100.

0C

(50.

7,59

.7)

(76.

1,88

.6)

(85.

7,93

.5)

(57.

3,70

.8)

(77.

3,84

.1)

(83.

8,85

.0)

(72.

5,77

.5)

(53.

0,56

.0)

(66.

2,71

.4)

(82.

5,10

0.0)

76.9

3

G80

.590

.976

.676

.065

.982

.587

.547

.093

.578

.8(7

2.7,

80.5

)(7

5.0,

96.6

)(6

8.8,

80.5

)(6

4.6,

76.1

)(5

9.9,

65.9

)(7

3.8,

86.3

)(7

0.0,

87.5

)(3

7.9,

57.6

)(8

4.4,

93.5

)(6

0.0,

85.0

)77

.92

CG

94.8

89.8

74.1

77.1

78.4

82.5

83.8

47.0

94.8

88.8

(75.

3,94

.8)

(79.

6,89

.8)

(58.

4,74

.0)

(62.

5,77

.1)

(64.

8,78

.4)

(71.

3,82

.5)

(67.

5,83

.8)

(31.

8,57

.6)

(80.

5,94

.8)

(66.

3,88

.8)

81.1

1

auR

OC

atav

erag

epr

edic

tions

and

rang

eof

auR

OC

over

pred

ictio

nsar

epr

ovid

ed.



Fig. 7. Summary of split variables and corresponding split levels for top trees in run 4. Vertical axis shows tree indicesand tree weights; horizontal axis shows each split variable and sum of probabilities of trees in which variable occurs(importance weight).

Fig. 8. Predictions of 3-year survival for validation samples generated by averaging over trees.



evidenced by the narrow uncertainty bars. Case #2424 was an older patient whose tumorwas suboptimally debulked and whose disease remained stable after platinum chemotherapy;she had no disease-free interval and survived for 16 months. Case #1451 was also an olderpatient whose tumor was suboptimally debulked but whose disease responded to platinumchemotherapy; her disease-free interval was 28 months and her overall survival time was 132months. In contrast, the predictive survival curves for other cases are highly uncertain. Theseare cases where the number of patients with similar characteristics is very small or there isconflict among the clinico-genomic predictors and hence disagreement among tree outcomes.Case #1774 was an older patient who was optimally debulked and whose disease responded toplatinum therapy, which would classify her as clinically low risk. Her disease-free interval wasmore than 10 months and her overall survival time was 22 months. Upon closer examinationit was revealed that she had values on key metagenes that conflicted with her low-risk clinicalassessment. The short-term predictions for case #1774 were the result of her value for metagene357; on further inspection we discovered that this metagene contained a probe for the geneTNK2 (alias ACK1) for which this patient had an extremely high value (see Section 5.4 forfurther discussion of this finding). It seems evident that the metagene predictors are capturinginformation in the genomic predictors which may or may not be reflected in the clinicalpredictors. In such cases it is important that the overall prediction summary recognizes andreflects this uncertainty and that models be open to investigation so that such results can beexplored.

In some cases the results using clinical data alone are better than those using both clinical andgenomic data (see, for example, Run 3 in Table 2). We suppose that this is due to heterogeneity inthe patient subsamples, as no specific gene or metagene was found to be relevant to all samples.It is possible that the clinico-genomic trees could be improved further in these cases by alteringthe hyperparameter values, such as the Bayes factor threshold, but given the limited amount ofdata available we chose not to vary the parameter settings across different runs. Given more datathe specific tuning of model parameters can be explored in more depth.

5.4. Biological relevance

As mentioned in Section 5.3 the metagene predictors vary with each run but we did identifygenes which appear in the key metagenes of several runs and for which potentially veryrelevant biological connections can be made. This demonstrates the power of our approachfor exploratory data analysis as well as prediction. We mention a few examples here; a morecomplete list of metagenes which appeared in predictive trees and their component genes aregiven in Table 3 [65].

First, the tree in Fig. 6 includes metagene (Mg) 254 as a split variable. This metageneincludes multiple probes for gene CYP1B1; the enzyme encoded by this gene is involvedin androgen metabolism and the metabolism of various procarcinogens. CYP1B1 has beenassociated with risk of endometrial cancer [55] and breast and ovarian cancer as a downstreamtarget of BRCA1 expression during xenobiotic stress [36]. Second, the key variables whichappear in the CG trees from run 4 appear in Fig. 7; we will focus on Mg 127 and Mg 178.Mg 127 contains Krit1 as a component gene; Krit1 has been shown to interact with a proposedtumor suppressor and may act as an antagonist of the oncogene Ras. The Krit1 cDNA hasbeen mapped to a chromosomal location frequently deleted or amplified in multiple forms ofcancer [59]. Mg 178 contains NR2F2 (COUP-TFII), a gene which encodes for a transcriptionfactor shown to be critical for normal female reproduction in mice [64] and menstrual cycling in



Fig. 9. Predicted disease-free survival curves with uncertainty intervals at chosen time points for three individuals.The ROC cutoff (dashed line) for classification as short-term or long-term survivor and the prediction of survival at3 years (blue number on the y-axis) are identified. The actual survival time is marked with an arrow on the x-axis. (Forinterpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.).

human ovaries [56]. This cluster also includes TCF7L2 (TCF-4) which plays a role in the betacatenin-Wnt signaling pathway, a pathway considered one of the key developmental and growthregulatory mechanisms of the cell [22]. In particular the regulation of Cyclin D1 by Rac1 is beta-catenin/TCF-dependent [20]; Cyclin D1 is a Wnt target gene which alters cell cycle progressionand in which mutations, amplification and overexpression are observed frequently in a variety oftumors and may contribute to tumorigenesis. Finally, NF-YB is also found in this cluster; NF-YB encodes a transcription factor necessary for the negative regulation of Chk2 expression byp53 [42]. This process is critical for the control of cell cycle progression in response to DNAdamage. NF-YB also interacts with the oncogenes c-Myc, pRb, and p53 to control expression ofthe PDGF beta-receptor which is tightly regulated during a normal cell cycle [68].

The examination of subjects whose predictions improve significantly upon the inclusionof genomic data can also yield potentially informative genes. The ACK1 gene mentioned inSection 5.3 was discovered by this strategy. The amplification of the ACK1 gene in primarytumors has been shown to correlate with poor prognosis and the overexpression of ACK1 incancer cell lines can increase the invasive phenotype of these cells both in vitro and in vivo [69].In our data set the expression of ACK1 was found to be negatively correlated (not significantly)with survival; however, in the complete data set of 119 individuals this correlation was significant(see Fig. 10). These findings support the theory of Bernards and Weinberg [4] that geneticalterations acquired early in the process of tumor development may drive primary tumor growthand determine metastatic potential.

The identification of multiple genes with predictive ability and potential biological relevanceto tumor development, reflective of the heterogeneity of the patient sample and the complexityof the underlying disease, is a key finding and suggestive of plausible directions for biologicalinvestigation.



Fig. 10. Relationship of ACK1 expression and survival; all 119 patients.

6. Discussion

We have presented a Bayesian approach to tree analysis in the specific context of a survivaltime response and both clinical and genomic predictors. Survival times are assumed to followa Weibull distribution and tree construction is based on forward selection where a split on a(predictor,threshold) pair is performed if the evidence for or against a difference in survivaldistributions between the resulting subgroups is significant, as assessed by the associated Bayesfactor. By averaging predictions across trees with the relative likelihood values as weightswe will tend to improve predictions by respecting, and properly accounting for, tree modeluncertainty [25].

We note that although averaging predictions across trees does improve model performance,it also decreases the interpretability of the model. This is an important trade-off: predictiveability versus model interpretability. We advocate model averaging because of its improvedpredictions and because in high-dimensional data settings model uncertainty can be substantial.By building multiple tree models we can explore the covariate space and attempt to address modeluncertainty.

We understand that in the interpretation of tree models (in terms of prediction accuracy aswell as variable selection) it is important that the parameter estimates be unbiased. This has beenstressed in the recent tree literature, e.g., [41,40,27]. Our models are not unbiased in the sense thatvariables with more splitting values are more likely to be selected in model building. To addressthis bias we have chosen a metric for model accuracy based on predictive accuracy. This metricwill help us to identify and remove from consideration ‘fluke’ models which fit the data wellbut have poor predictive performance. We concede that this approach is not computationallyefficient but it does allow for model exploration which is critical at this point of our analysis.Of course as more data is collected we suspect that computational expense will increase butmodel uncertainty will decrease, at which point we may focus on averaging over fewer modelsor employing an alternate method which places more emphasis on unbiasedness and modelestimation.

We implemented our survival tree modeling in the analysis of pilot data from a study ofadvanced stage ovarian cancer. Multiple, related patterns of gene expression in combination



with clinical data provided strong and predictively valid associations with survival. The modelsdelivered predictive survival assessments together with measures of uncertainty about thepredictions. As a result of tree spawning and model averaging these measures of uncertaintyreflected within-tree variability as well as the variability resulting from the sensitivity ofthe Bayes factor to specific predictor choices and small changes in threshold values. Anexamination of genes which demonstrate predictive ability across various training and test setsrevealed several genes with biologically plausible relevance to carcinogenesis, warranting furtherinvestigation.

We chose to use a conjugate Gamma prior in our analysis although a non-informative prior,such as a Jeffreys’ prior, could have been employed. The Jeffreys’ prior is a Gamma(a, b) wherea = b = 0 [53]. This prior would put relatively more weight on extreme survival values;we felt it was more appropriate to choose values of a and b based on the observed survivaltimes. However for large sample sizes there should be little difference in the results underthe conjugate versus the Jeffreys’ prior. Thus we expect little difference in the results fromeach prior at the root and upper level nodes. In small sample sizes, e.g., lower level nodes,we may see some differences in the models but the prior parameters are being updated byprevious tree splits which will mitigate any differences. These suppositions were confirmedwhen we repeated a subset of the simulations from Section 4 using the Jeffreys’ prior. TheMIE values increased under the Jeffreys’ prior relative to the results under the conjugate prior,and the ability to capture the correct model decreased, but qualitatively the results did notchange.

In anticipation of future studies we intend to perform further comparisons with existingmethods [27,33] and further simulations to examine the impact of tuning parameters and priorassumptions on model performance. Our current approach to missing values is to performimputation prior to modeling; however, we are considering adjusting our method to dealwith missing values as these are common in realistic data analysis contexts. In this studyour models were built on 6400 genes and 310 metagenes; it is possible that informationfrom normal tissue samples could be employed to perform further variable selection. Finally,although some progress has been made in developing stochastic simulation methods forBayesian trees [54] the topic remains a very challenging research area, both conceptuallyand computationally, particularly in the context of more than a few predictors. We believethat in problems where the numbers of predictors is very large, properly addressing theissue of stochastic search will involve the development of a formal, conceptual foundationbefore making them practicable. The development of such ideas is a focus of our currentresearch.

Acknowledgements

J.C. was supported by NCI grant 5K25CA111636. The authors wish to thank the followingpersons for their invaluable contributions: Jeffrey R. Marks, Department of ExperimentalSurgery, Duke University Medical Center; John Lancaster, Divisions of Gynecologic SurgicalOncology and Cancer Prevention and Control, H. Lee Moffitt Cancer Center and ResearchInstitute; Holly Dressman and Joe Nevins, Institute for Genome Sciences and Policy, DukeUniversity Medical Center; Bertrand Clarke, Department of Statistics, University of BritishColumbia; Torsten Hothorn, Institut fur Medizininformatik, Biometrie und Epidemiologie,Friedrich-Alexander-Universitat Erlangen-Nurnberg, Erlangen, Germany; Ed Iversen, Instituteof Statistics and Decision Sciences, Duke University.



Appendix

Table 3Genes which appear in Weibull tree models in at least 3 of the 10 training runs

Key genes

Runs Cluster Affy probe ID NCBI entrezgene

Gene ontology

1, 4, 5, 9 468, 185, 25,422

213158 at Unknown gene

1, 5, 8 468, 25, 473 205383 s at ZBTB20 Physiological process

1, 4, 10 378, 148, 295 207571 x at C1orf381, 4, 10 378, 148, 295 210785 s at C1orf38

1, 4, 10 195, 441, 135 201486 at RCN21, 4, 10 195, 441, 135 209085 x at RFC1 DNA-dependent DNA

replication/DNA metabolismPhysiological process

4, 6, 10 441, 210, 241 213838 at NOL73, 7, 10 448, 451, 241 200958 s at SDCBP Substrate-bound cell migration/cell

extensionPhysiological process

2, 4, 5, 7 161, 309, 475,303

213705 at MAT2A Physiological process

2, 5, 9 161, 475, 213 219437 s at ANKRD112, 4, 7 161, 309, 303 202028 s at RPL38 Physiological process2, 4, 7 161, 309, 303 208120 x at Unknown gene2, 4, 7 161, 309, 303 210686 x at SLC25A16 Physiological process2, 4, 7 161, 309, 303 211454 x at Unknown gene2, 4, 7 161, 309, 303 212044 s at RPL27A Protein

biosynthesis/macromoleculebiosynthesisPhysiological process

2, 4, 7 161, 309, 303 213736 at COX5B Physiological process2, 4, 7 161, 309, 303 214001 x at RPS10 Physiological process2, 4, 7 161, 309, 303 214041 x at RPL37A Protein

biosynthesis/macromoleculebiosynthesisPhysiological process

2, 4, 7 161, 309, 303 221943 x at RPL38 Physiological process2, 4, 7 161, 309, 303 218808 at DALRD3 Protein

biosynthesis/macromoleculebiosynthesis arginyl-tRNAaminoacylation

Physiological process4, 7, 9 309, 487, 335 208141 s at HLRC14, 7, 9 309, 487, 335 216180 s at SYNJ2 Physiological process5, 9, 10 404, 335, 313 208868 s at GABARAPL1

2, 4, 9 105, 185, 422 218962 s at FLJ135762, 4, 8 105, 185, 403 216713 at KRIT12, 4, 8 105, 226, 403 34041 i at Unknown gene2, 4, 8 105, 226, 403 221596 s at DKFZP564O0523

(continued on next page)



Table 3 (continued)

Key genes

Runs Cluster Affy probe ID NCBI entrezgene

Gene ontology

2, 3, 7 446, 23, 265 203277 at DFFA DNA fragmentation duringapoptosis disassembly of cellstructures during apoptosisApoptotic nuclear changesDNA catabolism/DNA metabolismPhysiological process

2, 3, 9 446, 23, 357 201155 s at MFN21, 2, 9 378, 446, 357 221269 s at SH3BGRL3

4, 5, 9 409, 25, 100 209120 at NR2F2 Physiological process4, 5, 9 409, 25, 100 209121 x at NR2F2 Physiological process4, 5, 9 409, 25, 100 215073 s at NR2F2 Physiological process4, 5, 7 409, 25, 449 212761 at TCF7L2 Physiological process

1, 3, 10 373, 82, 285 202435 at Unknown gene1, 3, 10 373, 82, 285 202436 at Unknown gene1, 3, 10 373, 82, 285 202437 at Unknown gene1, 3, 10 282, 82, 282 209146 at SC4MOL Physiological process

4, 6, 10 226, 49, 473 202375 at SEC24D Physiological process4, 6, 10 226, 99, 473 209501 at CDR2

4, 6, 9 51, 466, 53 212205 at H2AFV DNA metabolismPhysiological process

4, 6, 9 409, 466, 299 218127 at NF-YB Physiological process

4, 6, 7 312, 490, 434 213246 at C14orf109

3, 5, 10 121, 55, 486 208070 s at REV3L DNA-dependent DNAreplication/DNA metabolismPhysiological process

1, 5, 9 154, 97, 227 209170 s at GPM6B

References

[1] American Cancer Society Cancer Facts and 2006, American Cancer Society. 2006.[2] A. Berchuck, E. Iversen, J. Lancaster, J. Pittman, J. Luo, P. Lee, S. Murphy, H. Dressman, P. Febbo, M. West, J.

Nevins, J. Marks, Patterns of gene expression that characterize long-term survival in advanced stage serous ovariancancers, Clinical Cancer Research 11 (2005) 3686–3696.

[3] J. Berger, Statistical Decision Theory and Bayesian Analysis, 2nd ed., Springer Verlag Inc., 1993.[4] R. Bernards, R. Weinberg, Metastasis genes: A progression puzzle, Nature 418 (2002) 823.[5] L. Breiman, Bagging predictors, Machine Learning 24 (1996) 123–140.[6] L. Breiman, Heuristics of instability and stabilization in model selection, Annals of Statistics 24 (1996b)

2350–2383.[7] L. Breiman, Random forests, Machine Learning 45 (2001) 5–32.[8] L. Breiman, Statistical modeling: The two cultures, Statistical Science 16 (2001) 199–225.[9] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regression Trees, Chapman & Hall/CRC Press,

1984.[10] W. Buntine, Learning classification trees, Statistics and Computing 2 (1992) 63–73.[11] G. Cawley, Miscellaneous MATLAB software, data, tricks and demonstrations, 2004. Online:

http://theoval.sys.uea.ac.uk/matlab/default.html.


http://theoval.sys.uea.ac.uk/matlab/default.html


[12] H. Chipman, E. George, R. McCulloch, Bayesian CART model search (with discussion), Journal of the AmericanStatistical Association 93 (1998) 935–960.

[13] H. Chipman, E. George, R. McCulloch, Managing multiple models, in: T. Jaakkola, T. Richardson (Eds.),Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics, 2001, pp. 11–18.

[14] H. Chipman, E. George, R. McCulloch, Bayesian treed models, Machine Learning 48 (2002) 299–320.[15] J. Clarke, C.-F. Horng, M.-H. Tsou, A. Huang, J. Nevins, M. West, S. Cheng, Modeling of clinical information in

breast cancer for personalized prediction of disease outcomes, Technical Report, Department of Biostatistics andBioinformatics, Duke University, Durham, 2006.

[16] D. Denison, B. Mallick, A. Smith, A Bayesian CART algorithm, Biometrika 85 (1998) 363–377.[17] R. Duda, P. Hart, D. Stork, Pattern Classification, 2nd ed., Wiley, 2001.[18] S. Dudoit, J. Fridlyand, A prediction-based resampling method for estimating the number of clusters in a dataset,

Genome Biology 3 (2002) research0036.1–0036.21.[19] S. Dudoit, J. Fridlyand, T. Speed, Comparison of discrimination methods for the classification of tumors using gene

expression data, Journal of the American Statistical Association 97 (2002) 77–87.[20] S. Esufali, B. Bapat, Cross-talk between Rac1 GTPase and dysregulated Wnt signaling pathway leads to cellular

redistribution of beta-catenin and TCF/LEF-mediated transcriptional activation, Oncogene 23 (2004) 8260–8271.[21] G. Glinsky, T. Higashiyama, A. Glinskii, Classification of human breast cancer using gene expression profiling as

a component of the survival predictor algorithm, Clinical Cancer Research 10 (2004) 2272–2283.[22] S. Grant, G. Thorleifsson, I. Reynisdottir, R. Benediktsson, A. Manolescu, J. Sainz, A. Helgason, H. Stefansson,

V. Emilsson, A. Helgadottir, U. Styrkarsdottir, K. Magnusson, G. Walters, E. Palsdottir, T. Jonsdottir, T.Gudmundsdottir, A. Gylfason, J. Saemundsdottir, R. Wilensky, M. Reilly, D. Rader, Y. Bagger, C. Christiansen, V.Gudnason, G. Sigurdsson, U. Thorsteinsdottir, J. Gulcher, A. Kong, K. Stefansson, Variant of transcription factor7-like 2 (TCF7L2) gene confers risk of type 2 diabetes, Nature Genetics 38 (2006) 320–323.

[23] T. Hastie, R. Tibshirani, Efficient quadratic regularization for expression arrays, Biostatistics 5 (2004) 329–340.[24] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction,

Springer-Verlag Inc., 2001.[25] J. Hoeting, D. Madigan, A. Raftery, C. Volinsky, Bayesian model averaging: A tutorial (with discussion), Statistical

Science 14 (1999) 382–401.[26] T. Hothorn, P. Buhlmann, S. Dudoit, A. Molinaro, M. van der Laan, Survival ensembles, Biostatistics 7 (2006)

355–373.[27] T. Hothorn, K. Hornik, A. Zeileis, Unbiased recursive partitioning: A conditional inference framework, Journal of

Computational and Graphical Statistics 15 (2006) 651–674.[28] T. Hothorn, B. Lausen, A. Benner, Radespiel-Troger, Bagging survival trees, Statistics in Medicine 23 (2004)

77–91.[29] E. Huang, S. Cheng, H. Dressman, J. Pittman, M.-H. Tsou, C.-F. Horng, A. Bild, E. Iversen, M. Liao, C.-M. Chen,

M. West, J. Nevins, A. Huang, Gene expression predictors of breast cancer outcomes, The Lancet 361 (2003)1590–1596.

[30] J. Ibrahim, M.-H. Chen, D. Sinha, Bayesian Survival Analysis, Springer-Verlag Inc., 2001.[31] R. Irizarry, Z. Wu, H. Jaffee, Comparison of affymetrix genechip expression measures, Bioinformatics 1 (2005)

1–7.[32] H. Ishwaran, E. Blackstone, C. Pothier, M. Lauer, Relative risk forests for exercise heart rate recovery as a predictor

of mortality, Journal of the American Statistical Association 99 (2004) 591–600.[33] H. Ishwaran, U. Kogalur, Random survival forests, Rnews 7/2 (2006) 25–31.[34] N. Johnson, S. Kotz, N. Balakrishnan, Continuous Univariate Distributions, 2nd ed., Wiley, 1994.[35] M. Jordan, R. Jacobs, Hierarchical mixtures of experts and the EM algorithm, Neural Computation 6 (1994)

181–214.[36] H. Kang, H. Kim, S. Kim, R. Barouki, C. Cho, K. Khanna, E. Rosen, I. Bae, BRCA1 modulates xenobiotic stress-

inducible gene expression by interacting with arnt in human breast cancer cells, Journal of Biological Chemistry(2006) (epub ahead of print March 27).

[37] R. Kass, A. Raftery, Bayes factors and model uncertainty, Journal of the American Statistical Association 90 (1993)773–795.

[38] L. Kaufman, P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, 1990.[39] L. Li, H. Li, Dimension reduction methods for microarrays with application to censored survival data,

Bioinformatics 20 (2004) 3406–3412.[40] W.-Y. Loh, Regression trees with unbiased variable selection and interaction detection, Statistica Sinica 12 (2002)

361–386.



[41] W.-Y. Loh, Y.-H. Shih, Split selection methods for classification trees, Statistica Sinica 7 (1997) 815–840.[42] T. Matsui, Y. Katsuno, T. Inoue, F. Fujita, T. Joh, H. Niida, H. Murakami, M. Itoh, M. Nakanishi, Negative regulation

of Chk2 expression by p53 is dependent on the CCAAT-binding transcription factor NF-Y, Journal of BiologicalChemistry 279 (2004) 25093–25100.

[43] S. Michaels, S. Koscielny, C. Hill, Prediction of cancer outcome with microarrays: A multiple random validationstrategy, The Lancet 365 (2005) 488–492.

[44] J. Oliver, D. Hand, On pruning and averaging decision trees, in: Proceedings of the Twelfth International Conferenceon Machine Learning, Morgan Kaufmann, 1995, pp. 430–437.

[45] P. Park, L. Tian, I. Kohane, Linking gene expression data with patient survival times using partial least squares,Bioinformatics 18 (2002) S120–S127.

[46] G. Parmigiani, E. Garrett-Mayer, R. Anbazhagan, E. Gabrielson, A cross-study comparison of gene expressionstudies for the molecular classification of lung cancer, Clinical Cancer Research 10 (2005) 2922–2927.

[47] C. Perou, T. Sørlie, M. Eisen, M. van de Rijn, S. Jeffrey, C. Rees, J. Pollack, D. Ross, H. Johnsen, L. Akslen,O. Fluge, A. Pergamenschikov, C. Williams, S. Zhu, P. Lønning, A.-L. Børresen Dale, P. Brown, D. Botstein,Molecular portraits of human breast tumours, Nature 406 (2000) 747–752.

[48] J. Pittman, E. Huang, H. Dressman, C.-F. Horng, S. Cheng, M.-H. Tsou, C.-M. Chen, A. Bild, E. Iversen, A. Huang,J. Nevins, M. West, Clinico-genomic models for personalized prediction of disease outcomes, Proceedings of theNational Academy of Sciences 101 (2004) 8431–8436.

[49] J. Pittman, E. Huang, J. Nevins, M. West, Bayesian analysis of binary prediction tree models for retrospectivelysampled outcomes, Biostatistics 5 (2004) 587–601.

[50] R development Core Team R: A Language and Environment for Statistical Computing, R Foundation for StatisticalComputing, Vienna, Austria, ISBN: 3-900051-07-0, 2007.

[51] A. Raftery, D. Madigan, J. Hoeting, Bayesian model averaging for linear regression models, Journal of AmericanStatistical Association 92 (1997) 179–191.

[52] D. Ransohoff, Bias as a threat to the validity of cancer molecular-marker research, Nature Reviews Cancer 5 (2005)142–149.

[53] C. Ren, D. Sun, D. Dey, Bayesian and frequentist estimation and prediction for exponential distributions, Journalof Statistical Planning and Inference 136 (2006) 2873–2897.

[54] F. Rigat, Parallel hierarchical sampling: a practical multiple-chains sampler, Bayesian Analysis (2007) (submittedfor publication).

[55] M. Sasaki, Y. Tanaka, M. Kaneuchi, N. Sakuragi, R. Dahiya, CYP1B1 gene polymorphisms have higher risk forendometrial cancer and positive correlations with estrogen receptor alpha and estrogen receptor beta expressions,Cancer Research 63 (2003) 3913–3918.

[56] Y. Sato, T. Suzuki, K. Hidaka, H. Sato, K. Ito, S. Ito, H. Sasano, Immunolocalization of nuclear transcription factors,DAX-1 and COUP-TF II, in the normal human ovary: correlation with adrenal 4 binding protein/steroidogenicfactor-1 immunolocalization during the menstrual cycle, Journal of Clinical Endocrinology and Metabolism 88(2003) 3415–3420.

[57] T. Selke, M. Bayarri, J. Berger, Calibration of p-values for testing precise null hypotheses, The AmericanStatistician 55 (2001) 62–71.

[58] D. Seo, H. Dressman, E. Hergerick, E. Iversen, C. Dong, K. Vata, C. Milano, F. Rigat, J. Pittman, J. Nevins, M.West, P. Goldschmidt-Clermont, Gene expression phenotypes of atherosclerosis, Atherosclerosis, Thrombosis, andVascular Biology 24 (2003) 1922–1927.

[59] I. Serebriiskii, J. Estojak, G. Sonoda, J. Testa, E. Golemis, Association of Krev-1/rap1a with Krit1, a novel ankyrinrepeat-containing protein encoded by a gene mapping to 7q21-22, Oncogene 15 (1997) 1043–1049.

[60] R. Shapire, Y. Freund, P. Bartlett, W. Lee, Boosting the margin: A new explanation for the effectiveness of votingmethods, The Annals of Statistics 26 (1998) 1651–1686.

[61] R. Soland, Bayesian analysis of the weibull process with unknown scale parameter and its application to acceptancesampling, IEEE Transactions on Reliability R-17 (1968) 84–90.

[62] D. Spentzos, D. Levine, M. Ramoni, M. Joseph, X. Gu, J. Boyd, T. Libermann, S. Cannistra, Gene expressionsignature with independent prognostic significance in epithelial ovarian cancer, Journal of Clinical Oncology 22(2004) 4648–4658.

[63] X. Sun, Pitch accent prediction using ensemble machine learning, in: Proceedings of ICSLP, 2002, pp. 953–956.[64] N. Takamoto, I. Kurihara, K. Lee, F. Demayo, M. Tsai, S. Tsai, Haploinsufficiency of chicken ovalbumin upstream

promotertranscription factor II in female reproduction, Molecular Endocrinology 9 (2005) 2299–2308.[65] The Gene Ontology Consortium. Gene Ontology: Tool for the unification of biology, Nature Genetics 25 (2000)

25–29.



[66] T. Therneau, E. Atkinson, An introduction to recursive partitioning using the rpart routine, Technical Report, 61,Section of Biostatistics, Mayo Clinic, Rochester, 1997.

[67] R. Tibshirani, K. Knight, Model search by bootstrap ‘bumping’, Journal of Computational and Graphical Statistics8 (1995) 671–686.

[68] H. Uramoto, A. Hackzell, D. Wetterskog, A. Ballagi, H. Izumi, K. Funa, pRb, Myc and p53 are critically involvedin SV40 large t antigen repression of PDGF beta-receptor transcription, Journal of Cell Science 117 (2004)3855–3865.

[69] E. van der Horst, Y. Degenhardt, A. Strelow, A. Slavin, L. Chinn, J. Orf, M. Rong, S. Li, L. See, K. Nguyen, T.Hoey, H. Wesche, S. Powers, Metastatic properties and genomic amplification of the tyrosine kinase gene ACK1,Proceedings of the National Academy of Sciences USA 102 (2005) 15901–15906.

[70] L. van ’t Veer, H. Dai, M. van de Vijver, Y. He, A. Hart, M. Mao, H. Peterse, K. van der Kooy, M. Marton, A.Witteveen, G. Schreiber, R. Kerkhoven, C. Roberts, P. Linsley, R. Bernards, S. Friend, Gene expression profilingpredicts clinical outcome of breast cancer, Nature 415 (2002) 530–536.

[71] M. West, Bayesian factor regression models in the ‘large p, small n’ paradigm, in: Bayesian Statistics 7, OxfordUniversity Press, 2003, pp. 723–732.

[72] E.-J. Yeoh, M. Ross, S. Shurtleff, W. Williams, D. Patel, R. Mahfouz, F. Behm, S. Raimondi, M. Relling, A. Patel,C. Cheng, D. Campana, D. Wilkins, X. Zhou, J. Li, H. Liu, C.-H. Pui, W. Evans, C. Naeve, L. Wong, J. Downing,Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by geneexpression profiling, Cancer Cell 1 (2002) 133–143.

[73] H. Zhang, B. Singer, Recursive Partitioning in the Health Sciences, in: Statistics for Biology and Health, vol. 12,Springer Verlag Inc., 1999.


Bayesian Weibull tree models for survival analysis of clinico-genomic data

Documents