Top Banner
STATISTICS IN MEDICINE, VOL. 3, 143-152 (1984) REGRESSION MODELLING STRATEGIES FOR IMPROVED PROGNOSTIC PREDICTION FRANK E. HARRELL, JR. AND KERRY L. LEE Division of Biometry, Department of Community and Family Medicine, Duke University Medical Center, Box 3337, Durham NC 27710, U.S.A. ROBERT M. CALIFF, DAVID B. PRYOR AND ROBERT A. ROSATI Division of Cardiology, Department of Medicine, Duke University Medical Center, Box 3337, Durham NC 27710, U.S.A. SUMMARY Regression models such as the Cox proportional hazards model have had increasing use in modelling and estimating the prognosis of patients with a variety of diseases. Many applications involve a large number of variables to be modelled using a relatively small patient sample. Problems of overfitting and of identifying important covariates are exacerbated in analysing prognosis because the accuracy of a model is more a function of the number of events than of the sample size. We used a general index of predictive discrimination to measure the ability of a model developed on training samples of varying sizes to predict survival in an independent test sample of patients suspected of having coronary artery disease.We compared three methods of model fitting: (1) standard ‘step-up’ variable selection, (2) incomplete principal components regression, and (3) Cox model regression after developing clinical indices from variable clusters. We found regression using principal components to offer superior predictions in the test sample, whereas regression using indices offers easily interpretable models nearly as good as the principal components models. Standard variable selection has a number of deficiencies. KEY WORDS Prediction Validation Survival analysis Cox model Regression INTRODUCTION The identification of characteristics associated with an outcome has importance in comparing alternative treatments, designing clinical trials, and counselling individual patients. A large number of characteristics reflecting historical symptoms, physical signs, and test results complicates the identification of prognostic factors. The inclusion of a large number of factors in a regression model may appear to improve predictions in the sample on which the model was developed. A test of the fitted model on an independent sample, however, often demonstrates a deterioration of its predictive accuracy, reflecting overfitting of the model to the training data. The value of a set of prognostic factors depends upon reproducibility in a new patient sample. Without considering the stability of the model in an independent sample, investigators may remain unaware that some factors represent spurious associations with the outcome because of ‘noise’ in the data or multiple comparisons. Furthermore, minor changes in the data set may result in selection of different characteristics. This leaves the clinician in a quandary as to which factors actually have prognostic importance. When statistical significance is the sole criterion for including a prognostic factor in the set, the number of variables selected is a function of the sample size. 0277-671 5/84/020143-10$01.O0 0 1984 by John Wiley & Sons, Ltd. Received 7 April 1983 Revised 29 September I983
10

Regression modelling strategies for improved prognostic prediction

Apr 25, 2023

Download

Documents

Phil Benson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Regression modelling strategies for improved prognostic prediction

STATISTICS IN MEDICINE, VOL. 3, 143-152 (1984)

REGRESSION MODELLING STRATEGIES FOR IMPROVED PROGNOSTIC PREDICTION

FRANK E. HARRELL, JR. AND KERRY L. LEE Division of Biometry, Department of Community and Family Medicine, Duke University Medical Center, Box 3337, Durham

NC 27710, U.S.A.

ROBERT M. CALIFF, DAVID B. PRYOR AND ROBERT A. ROSATI Division of Cardiology, Department of Medicine, Duke University Medical Center, Box 3337, Durham NC 27710, U.S.A.

SUMMARY Regression models such as the Cox proportional hazards model have had increasing use in modelling and estimating the prognosis of patients with a variety of diseases. Many applications involve a large number of variables to be modelled using a relatively small patient sample. Problems of overfitting and of identifying important covariates are exacerbated in analysing prognosis because the accuracy of a model is more a function of the number of events than of the sample size.

We used a general index of predictive discrimination to measure the ability of a model developed on training samples of varying sizes to predict survival in an independent test sample of patients suspected of having coronary artery disease. We compared three methods of model fitting: (1) standard ‘step-up’ variable selection, (2) incomplete principal components regression, and (3) Cox model regression after developing clinical indices from variable clusters. We found regression using principal components to offer superior predictions in the test sample, whereas regression using indices offers easily interpretable models nearly as good as the principal components models. Standard variable selection has a number of deficiencies.

KEY WORDS Prediction Validation Survival analysis Cox model Regression

INTRODUCTION

The identification of characteristics associated with an outcome has importance in comparing alternative treatments, designing clinical trials, and counselling individual patients. A large number of characteristics reflecting historical symptoms, physical signs, and test results complicates the identification of prognostic factors. The inclusion of a large number of factors in a regression model may appear to improve predictions in the sample on which the model was developed. A test of the fitted model on an independent sample, however, often demonstrates a deterioration of its predictive accuracy, reflecting overfitting of the model to the training data.

The value of a set of prognostic factors depends upon reproducibility in a new patient sample. Without considering the stability of the model in an independent sample, investigators may remain unaware that some factors represent spurious associations with the outcome because of ‘noise’ in the data or multiple comparisons. Furthermore, minor changes in the data set may result in selection of different characteristics. This leaves the clinician in a quandary as to which factors actually have prognostic importance. When statistical significance is the sole criterion for including a prognostic factor in the set, the number of variables selected is a function of the sample size.

0277-671 5/84/020143-10$01.O0 0 1984 by John Wiley & Sons, Ltd.

Received 7 April 1983 Revised 29 September I983

Page 2: Regression modelling strategies for improved prognostic prediction

144 FRANK E. HARRELL, JR. ET AL.

There are alternative regression modelling strategies that have use in improving the stability of a model. This investigation compares two such strategies with a standard variable selection method. The first modification of the usual stepwise regression strategy is incomplete principal components (IPC) regression, in which one improves the stability of a model by placing linear restrictions on the parameters estimated. The second modification is a combination of statistical and clinical variable clustering, in which one places different kinds of restrictions on the regression model. We compared the three methods by developing models on the same training set and then assessing the predictive accuracy in the same test set. For all three approaches, we also examined the effect of varying the training set sample size.

Predictive accuracy consists of two aspects. First, a prediction should be reliable. For example if one predicts that a patient has a probability of 065 of surviving two years, is the probability really 0.65? We can assess reliability by dividing an independent sample into groups according to predicted probabilities (e.g. by deciles to make the groups large enough) and then comparing the proportions surviving in each group (Kaplan-Meier’ estimate) with the group’s mean predicted survival probability. We shall not discuss predictive reliability further.

The second aspect of predictive accuracy, and perhaps the more important, is discrimination. The discrimination of a prognostic model is the ability to separate patients with good and poor outcomes. In this setting, we wish to quantify the degree to which predicted survival probabilities are lower for patients dying earlier.

The reason we argue for first priority to discrimination in judging a model’s relative performance is that if discrimination deteriorates, no adjustment or calibration can correct the model. On the other hand with good discrimination, one can calibrate the predictor to attain reliability without sacrificing discrimination. In addition, indices of discrimination require no decision regarding the grouping of predicted probabilities.

DATA

The patient sample consisted of 4226 consecutive medically treated patients with possible angina undergoing cardiac catheterization at Duke University Medical Center since 1969. Excluded were patients having previous cardiac surgery. All patients were suspected of having coronary artery disease; the catheterization confirmed this diagnosis in two-thirds of the patients. Follow up of patients was 99 per cent complete. Cardiovascular death occurred in 416.

Although the cardiac catheterization defines prognostically important features such as coronary anatomy and ventricular function, the models developed used only non-invasively determined patient characteristics so that one can estimate the prognosis for new patients prior to catheterization. Table I lists the variables from the patient’s history, physical examination, chest X- ray, and resting electrocardiogram (ECG) used to develop the models to predict time until cardiovascular death.

A previous publication provides detailed description of the variable definitions and the methods of data collection.’ We censored patients dying of non-cardiovascular causes (withdrawn alive) at their times of death. In what follows, ‘death’ refers to cardiovascular death.

We divided the patients at random into two groups constraining the groups to have equal numbers of deaths. Both the training and test samples consisted of 21 13 patients, of whom 208 died. To study the effect of size of the training sample on predictive discrimination, we used successively smaller subsets of the original training sample. We constructed these subsets by taking at random haif of the original training sample, then taking at random a quarter of the original, and then taking at random an eighth of the original training sample. There is overlap among these training samples. We validated all models developed using a given training sample on the same test sample of 21 13 patients.

Page 3: Regression modelling strategies for improved prognostic prediction

REGRESSION MODELLING AND PROGNOSTIC PREDICTION 145

Table I. Non-invasive patient characteristics used to develop models

Variable name Description

Age

TypeAng Sex

CadDur ProgPain Noct Pain Preinfar PainFreq PainSev Prinz HxMI Qwave Hx PVD HxCVD Pbruit Cbruit Smoke HxHT HxCH DM FHx CHF

CMG s3 STTW PVC IVCD LBBB RBBB LAD

Age in years Sex (0 = male, 1 = female) Type of angina (1 = typical, 0 = not typical) Logarithm of 1 plus number of months from onset of anginal symptoms Angina is progressing in last 2 weeks Nocturnal chest pain Preinfarctional angina-prolonged chest pain at rest Average daily frequency of anginal attacks (Coded as 10 if > 10) Severity of angina (N.Y.H.A. class 1-4) Prinzmetal’s variant angina History of myocardial infarction Abnormal Q waves on resting ECG History of peripheral vascular disease History of cerebral vascular disease Peripheral bruit Carotid bruit Has ever smoked at least 1 pack/day History of hypertension or systolic b.p. > 140 currently History of hyperlipidemia or current serum cholesterol > 250 mg 7; Diabetes mellitus Family history of ischemic heart disease Congestive heart failure (coded 0 if none, 1 if N.Y.H.A. class 1-41, 3 if N.Y.H.A. class IV-coding based on previous analyses) Cardiomegaly by chest X-ray S3 (ventricular) gallop present Abnormal ST-T waves on resting ECG Premature ventricular contractions on resting ECG Intraventricular conduction defect on resting ECG Left bundle branch block on resting ECG Right bundle branch block on resting ECG Left axis deviation on resting ECG

- For variables indicating presence or absence of a condition, presence is coded 1 and absence is coded 0.

AN INDEX OF DISCRIMINATION OF A PROGNOSTIC MODEL

With prediction of a response measured on a continuous scale with no censoring, a natural quantification of predictive accuracy is the closeness of predicted to true values. For example, one might compute a mean square error to quantify the accuracy of a given model, or count the number of patients for whom one model’s prediction was closer to the observed value than was the prediction of another model. With a censored variable such as the time until an event, we need an easily interpreted index of predictive ability. Rank correlations are natural choices for deriving an index of discrimination with censored or non-continuous response variables.

Motivated by rank tests based on Kendall’s tau developed by Brown, Hollander and K ~ r w a r , ~ Harrell et aL4 derived an index of concordance they called c. To calculate c, take all possible pairings of patients. For a given pair, we say that the predictions are concordant with the outcomes if the patient having the higher probability estimate lived longer. If both patients are still alive, or if only one has died and the follow-up of the other is less than the survival time of the first, we do not

Page 4: Regression modelling strategies for improved prognostic prediction

146 FRANK E. HARRELL, JR. ET AL.

count that pair. The c index is the proportion of all pairs of patients for which we could determine the ordering of survival times such that the predictions are concordant. This index is a linear transformation of a Kendall-Goodman-Kruskal type correlation between a set of predicted probabilities and the true outcomes. The index is easy to interpret since it estimates the probability that for a randomly chosen pair of patients, the one having the higher predicted survival is the one who survives longer.

In measuring discrimination of a model for a binary outcome such as presence of a disease, c reduces to the fraction of pairs of patients, one with and one without the disease, such that the one with the disease has the higher predicted disease probability. In this case, c is the area under a ‘receiver operating characteristic’ curve.5

STEPWISE VARIABLE SELECTION STRATEGY

We used Breslow’s6 formulation of the Cox7 proportional hazards model to predict time until death with a given set of variables. To select ‘significant’ variables, we employed a strategy using Rao’s efficient score statistic proposed by Bartolucci and Fraser’ as applied to the Cox model by Harrell.’ Briefly, we selected variables from the 30 listed in Table I until no other candidate variables remained significant at the 005 level.

Table I1 displays the discrimination of models with this standard stepwise approach. For calculating discrimination, we estimated the probability of surviving two years for each patient using the method of Kalbfleisch and Prentice” with the Cox model. The apparent quality of a model as measured by the c index for the single training sample improves as the size of the training sample decreases. However, the true ability of the model, as assessed in the test sample, deteriorates as the training sample size decreases. The difference between apparent and true discrimination deteriorates even more quickly.

An inherent problem with stepwise variable selection is that the variables selected, especially those in later steps, may represent noise and cause prediction ability actually to worsen in a new sample. Strategies that use only a significance level for entering variables ignore the number of multiple comparisons actually made. One way to take into account the number ofactual tests made

Table 11. Results of validation study: predictive discrimination

Training sample used

Method used n = 21 13, d = 208

P c1 c2 P c1 c2 P c1 c2 P c1 r 2

n = 1065, d = 101 n = 518, d = 55 n = 284, d = 3 0

Stepwise variable

Incomplete principal

Variable

selection 15 0.85 0.82 11 0.84 0.80 10 0.85 0.76 6 0.88 0.76

components 3 0.84 0.84 2 0.83 0.83 1 0.84 0.84 1 0.86 0.84

clusterslindices 8 0.85 0.83 6 0.83 0.83 5 0-85 0.82 3 0.86 0.80

n = number of patients d = number of deaths p = number of factors or principal components selected in training sample c, = c index in training sample c2 = c index in test sample In all cases the test sample consisted of n = 21 13, d = 208.

Page 5: Regression modelling strategies for improved prognostic prediction

REGRESSION MODELLING AND PROGNOSTIC PREDICTION 147

Table 111. Residual chi-square and predictive discrimination in 114 of original training sample

Variable Added Residual Step added chi-square chi-square d.f. P c1 c2

1 2 3 4 5 6 7 8 9

10

s 3 TypeAng NoctPain CHF Sex IVCD HxPVD HxHT STTW Cbruit

75 27 19 14 12 10 8 5 5 5

116 83 65 54 40 31 23 19 15 10

29 28 27 26 25 24 23 22 21 20

<O.OOol < o.Oo01

0.0001 0.00 1 0.03 0.14 0.47 0.64 0.84 0.97

0.62 0.59 0.75 0.72 0.80 0.73 0.82 075 0.84 0.76 0.84 0.76 0.85 0.76 0.85 0.74 0.86 0.75 0.85 0.76

cI = c index in training sample c2 = c index in test sample

at each step is to compute the 'residual chi-square' for all candidate variables at each step.' This statistic has degrees of freedom equal to the number of candidates. A reasonable stopping rule is to add variables until the residual chi-square is not significant at say the 0.05 level.

Another stopping rule is based on Akaike's information criterion (AIC), equivalent to Mallow's C, in the normal linear regression case." The AIC is as follows: suppose one has a model with p parameters, and considers use of a more complex model having p + q parameters. Preference is for the more complex model is if it increases the log likelihood by q, i.e. if it increases the overall log likelihood chi-square for the model by 2q. This is equivalent to adding q variables to the model if the residual chi-square for the candidates exceeds twice its degrees of freedom. If the residual chi- square does exceed 2q, one can more safely add the single most important of the q variables to the model.

For the quarter of the original training sample with 55 deaths, Table I11 shows the residual chi- square with the addition of each variable. According to the c-index in the test sample, one needs no more than 5 variables. The stopping rule based on significance of the residual chi-square selects 6 variables. In other examples, predictions in test samples have worsened after adding variables beyond the point prescribed by these stopping rules.

Interpretation of the results of a stepwise variable selection strategy can be difficult. The set of

Table IV. Variables selected by step-up strategy on random fourths of training sample listed in order of selection

Original 114 training sample

s3 TYPeAng NoctPain CHF Sex IVCD HxPVD HxHT S ' I T W Cbruit

Second 114 training sample

CMG HxMI Cbruit LBBB IVCD HxPVD Sex ProgPain CHF

Third 114 training sample

CMG Cbruit TYPeAng NoctPain HxMI Prinz Sex

Page 6: Regression modelling strategies for improved prognostic prediction

148 FRANK E. HARRELL, JR. ET ,415.

variables selected changes with minor shifts in the sample used to develop the model, and one sometimes formulates equivalent models that appear to convey different information to a clinical researcher. As a demonstration, Table IV shows the variables selected, in order of statistical significance, for the random quarter of the training set used previously along with two more random quarters of the training set generated with replacement.

There is great variation in the factors chosen, even in the choice of the top one or two factors. In fact, only two variables are common to all three models, although the factors chosen do seem to reflect primarily myocardial damage, vascular disease, and severity of pain. Such variation in factors selected may result from chance fluctuations from sample to sample or from changes in the distribution of the independent variables. If for example, a sample has a narrow age distribution or low prevalence of an ECG abnormality, the chance lessens that one will declare these variables as significant prognostic factors.

INCOMPLETE PRINCIPAL COMPONENTS COX REGRESSION

The previous example demonstrates that the usual stepwise variable selection strategy does not result in reproducible models when one must consider many covariates with respect to the number of uncensored observations. Identification of which covariates to include is more arbitrary than the p-values make it appear. Marquardt and Snee” noted that ‘it is better to use a little bit of all the variables than all of some variables and none of the remaining ones’. If one estimates parameters for all covariates, one must place restrictions on the parameter estimates to control their mean square error. Ridge regression is one solution to the problem, but it suffers from being arbitrary (one must choose a shrinkage constant) and difficult to apply to problems outside the realm of ordinary linear regression. Incomplete principal components regression is another method which places restrictions on the parameters. It does so by reducing the effective number of parameters to estimate. One can apply IPC regressionI2 to any regression model that is linear in the covariates.

The first principal c ~ m p o n e n t ’ ~ (pc) derived from a set of variables X , , X , , . . . , X , is the linear combination of X s having maximum variance across patients, subject to a normalizing constraint. The second pc is the linear combination of X s having largest variance of all linear combinations that are uncorrelated with the first pc. In general the ith pc is the linear combination of X s having the ith largest variance of all linear combinations uncorrelated with the first i - 1 pcs. If there are no linear constraints among the X s there will be p pcs in all. For many problems, one can summarize a large fraction of the total variation in the X s across patients using fewer than p pcs. In other words, one can reduce the statistical information derived from p patient characteristics to q ( < p ) uncorrelated components. One does not analyse separately redundant measurements or measure- ments that do not vary much from patient to patient; the resulting data reduction enhances the stability of an analysis.

One performs IPC regression as follows. For a set of p covariates, first compute the p principal components of the covariates, in order of amount of variation explained. Take a subset, say q, of the pcs, explaining most of the variation in the covariates across patients. Next use the q pcs as candidate variables in a variable selection program and force the regression program to select the pcs in order until the r pcs not in the model at a given step are jointly not significant at the 005 level using the residual chi-square statistic having r degrees of freedom. Imposition of the order of selecting pcs in accordance with the order of variance explained introduces stability and reduces noise in the model. Use of the residual chi-square as a stopping rule allows one to select an adequate set of pcs for describing the response while avoiding the problem that if the second pc is not significant while the third is significant, one will probably select the first three pcs. If the number of candidate pcs is small ( < 5, say), one may choose not to force the order of selection of components.

Page 7: Regression modelling strategies for improved prognostic prediction

REGRESSION MODELLING AND PROGNOSTIC PREDICTION 149

However in other situations, the reduction of effective degrees of freedom by forcing the order will increase model stability. In the majority of datasets examined, we found that the first pc was by far the most significant factor in predicting survival time.

In the present example, p = 30, and we chose q = 10 even though use of eigenvalues to examine amount of variation explained would lead us to select fewer than 10 components. Also, we computed pcs using only the training sample data to avoid bias when assessing performance of IPC models. In general however, no serious overoptimism will result from including the test data in computing pcs since computation of pcs does not use outcome information. We found the c indices computed for predictions on the test sample to be larger by 001 on the average when test data were also used to compute pcs. Results for the IPC models for varying size of the training sample also appear in Table 11. One sees that these models validate extremely well and that their performance does not depend on the sample size, at least for the range of sample sizes chosen. As is often the case, the first pc, which here represents an overall ‘sickness score’, is by far the most important one. With use of even the smallest training sample, a model employing only the first pc performed as well in the test sample as any other model.

The IPC method has two disadvantages. First, one always needs a regression coefficient for each variable; one never excludes variables. Secondly, the models are difficult to interpret, especially beyond the first pc. One can gain insight by examining correlations between the pcs and all of the original variables. However, one cannot delineate the prognostic factors with sufficient clarity for a thorough understanding by the researcher or patient.

VARIABLE CLUSTERING AND DERIVING INDICES

Although the IPC models predict well, the researcher has difficulty in their interpretation. In addition, one often desires to simplify a model by omittinga group of variables that do not provide independent prognostic information. If one could group the 30 variables and derive simple indices, these indices would provide a parsimonious approach to many prediction problems for patients with coronary disease. We used the technique of ‘variable clustering”4 similar to the method developed by D’Agostino and Pozen’ to divide variables into groups according to intercorre- lations of variables. Variable clustering creates groups so that within a group, the first pc explains most of the variation in that group, the second pc explains a minimum amount of variation, variables within a group are highly correlated, and first pcs between groups are as uncorrelated as possible. When a certain variable is not highly correlated with any other variable, one places that variable in a group by itself. Variables within a group usually represent the same clinical phenomenon. The variable clustering algorithm resulted in the following clusters:

(1) CHF, CMG, PVC, IVCD, S3 (2) ProgPain, PainFreq, PainSev, NoctPain (3) HxCVD, HxPVD, Cbruit, Pbruit (4) HxMI, Qwave, STlW ( 5 ) Age, CadDur (6) HxCH, Prinz, Preinfar (7) LAD, RBBB (8) TypeAng, Sex, Smoke (9) DM, FHx, HxHT

(10) LBBB

Although this grouping would result in more interpretable models, we felt that we could form more

Page 8: Regression modelling strategies for improved prognostic prediction

150 FRANK E. HARRELL, JR. ET AL.

clinically relevant groups. Two cardiologists (R.M.C. and D.B.P.) further grouped the variables, and we formed a clinical index for each group that we shall now discuss.

One can form a single index for each group by computing the first pc for the group. However, the pc is sometimes not clinically interpretable. In addition, the first pc will give large weights to features having low prevalence, since pcs weight a variable proportional to the inverse of its standard deviation. The strategy we chose for deriving scores for a group of variables was to assign weights to each variable based on clinical intuition, making small modifications when there was statistical evidence that the weights were inadequate. Often, abnormalities within a group were given equal weight.

We used a proposed index, calculated by summing the weighted variables in the group, to model time until death. Then we tested the individual variables making up the index to see if they provided independent information not contained in the index, i.e. to see if the internal weights for the variables used in the index were adequate. We found only three exceptions to the weights proposed by the cardiologists: ( 1 ) preinfar, initially assigned a weight of 1 point, was found statistically to require a larger weight, (2) PainFreq, assigned a weight of 1 point per episode of pain per day, was found to need a weight of half a point per episode, and (3) RBBB, thought to need half the weight of LBBB, was found to require a quarter of the weight of LBBB. At this point we omitted PainSev from the analysis because we decided that its measurement error was too great. Age, Sex, CadDur, TypeAng, and Prinz remained as individual factors. We isolated TypeAng because it should receive far more weight than other pain descriptors when studying other endpoints such as severity of coronary stenosis. We separated Prinz from other pain descriptors because it is sometimes determined during cardiac catheterization by ergonovine test. The final groupings and weights are:

Pain Index

Progpain + NoctPain + 3 x Preinfar + PainFreq/2

Myocardial damage Index

CHF + STTW + S3 + CMG + PVC + 2 x (HxMI or Qwave)

Vascular disease index

1 x (HxPVD or Pbruit) + 1 x (HxCVD or Cbruit)

Risk factor index

2 x Smoke + HxCH + HxHT + FHx + DM

Conduction defect index

4 x LBBB + RBBB + 2 x LAD +4 x IVCD

Variables standing alone

Age, Sex, CadDur, TypeAng, Prinz

We have reduced the 30 original variables to 10 entities. We used the indices and separate variables as 10 candidate variables in the variable selection

method discussed earlier. Predictive discriminations appear in Table 11. The models using indices compare favourably with the IPC models and are superior to the models using selection of individual variables for the three smaller training samples. Some drop-off in performance occurs for the smallest training sample, and 10 factors appear to be too many to use for variable selection with this small sample size.

Page 9: Regression modelling strategies for improved prognostic prediction

REGRESSION MODELLING AND PROGNOSTIC PREDICTION 151

The myocardial damage index emerged as the most important factor in every case. The pain index was also very important. We could readily interpret each model developed. For example ‘the worse the myocardial damage, the worse the chest pain, the worse the peripheral vascular disease, the older the patient, the worse the prognosis’.

When one does not have time to develop clinical indices for each variable group, one would expect that a model using first pc of each group to perform well. When no continuous predictor variables are in the set, one can use the scoring method proposed by Dagostino and Pozen. For that method, one assumes each variable in a group is binary, and the score for that group is simply the number of positive characteristics (e.g. for a group of symptoms, the score is the number of symptoms present).

CONCLUSION

One can use a simple index of discrimination to study overall performance of a series of survival models. We used this c index to measure predictive discrimination in a single test sample for three different strategies. Standard stepwise variable selection had poor validation, especially with small to moderate size training samples. This strategy also suffers from lack of easy interpretation of the models and lack of uniqueness in what one calls a ‘significant’ prognostic factor. Incomplete principal component Cox regression models validated exceptionally well because of the testing and estimation of fewer regression coefficients. Such models, however, are difficult to interpret. The technique of variable clustering provides a useful starting point for grouping variables. With derivation of clinical indices (without directly using outcome data), one can formulate regression models that are immediately interpretable, concise and without sacrifice of much predictive discrimination over that achieved by the IPC method, unless the training sample size is very small. We recommend the use of the variable clustering method for clinical prediction problems when the number of potential predictor variables is large. Of course, the more clinical insight one injects into the analysis at any point, the better is the end result.

We have found that the clinic‘al indices unify many different analyses. In new problems for selected patient subsets or for different endpoints in which the effective sample size is too small to form a model ‘from scratch’, we can rely upon the indices to generate stable prognostic models.

ACKNOWLEDGEMENTS

The authors wish to thank the Editor and the referees for their comments which significantly improved the quality and clarity of this paper. We also wish to acknowledge helpful discussions with Drs. Thomas A. Reichert and Phillip J. Harris. This work was supported by Research Grant PIS-17670 from the National Heart, Lung, and Blood Institute, Research Grants PIS-03834 and HS-04873 from the National Center for Health Services Research, OASH, Training Grant LM- 07003 and Grants LM-03373 and LM-00042 from the National Library of Medicine, and Grants from the Prudential Insurance Company of America, the Kaiser Family Foundation, and the Andrew W. Mellon Foundation.

REFERENCES 1. Kaplan, E. L. and Meier, P. ‘Nonparametric estimation from incomplete observations’, Journal of the

American Statistical Association, 53, 457-481 (1958). 2. Harris P. J. , Harrell, F. E., Lee K. L. et al. ‘Survival in medically treated coronary artery disease’,

Circulation, 60, 1259- 1269 (1979). 3. Brown, B. W., Hollander, M. and Korwar, R. M. ‘Nonparametric tests of independence of censored data,

with applications to heart transplant studies’, in Proschan, F. and Serfling, R. J . (eds), Reliability and Biometry, SIAM, Philadelphia, 1974, pp. 327- 354.

Page 10: Regression modelling strategies for improved prognostic prediction

152 FRANK E. HARRELL, JR. ET AL.

4. Harrell, F. E., Califf, R. M. Pryor, D. B., Lee, K. L. and Rosati, R. A. ‘Evaluating the yield of medical

5. Hanley, J. A. and McNeil, B. J. ‘The meaning and use of the area under a receiver operating characteristic

6. Breslow, N. E. ‘Covariance analysis of censored survival data’, Biometrics, 30, 89-99 (1974). 7. Cox, D. R. ‘Regression models and life tables (with discussion)’, Journal ofthe Royal Statistical Society,

8 . Bartolucci, A. A. and Fraser, M. D. ‘Comparative step-up and composite tests for selecting prognostic

9. Harrell, F. E. ‘The PHGLM procedure’, in Joyner, S. P. (ed.), SUGI Supplemental Library User’s Guide,

10. Kalbfleisch, J. D. and Prentice, R. L. The Statistical Analysis of Failure Time Data, 85, Wiley, New York,

1 1 . Atkinson, A. C. ‘A note on the generalized information criterion for choice of a model’, Biometrika, 67,

12. Marquardt, D. W. and Snee, R. D. ‘Ridge regression in practice’, American Statistician, 29, 3 (1975). 13. Morrison, D. F. Multioariate Statistical Methods, 266-299, McGraw-Hill, New York, 1976. 14. Ray, A. A. (ed.) SAS User’s Guide: Statistics, SAS Institute, Cary NC, 1982, pp. 461-473. 15. D’Agostino, R. B. and Pozen, M. W. ‘The logistic function as an aid in the detection of acute coronary

tests’, Journal of the American Medical Association, 247, 2543-2546 (1982).

(ROC) curve’, Radiology, 143, 29-36 (1982).

B, 34, 187-220 (1972).

indicators associated with survival’, Biometrical Journal, 19, 437-448 (1977).

1983 Edition, SAS Institute, Cary NC.

1980.

413-418 (1980).

disease in emergency patients (a case study)’, Statistics in Medicine, 1, 41-48 (1982).