Survival Analysis using R Bruce L. Jones Department of Statistical and Actuarial Sciences The University of Western Ontario March 24, 2010
Survival Analysisusing R
Bruce L. Jones
Department of Statistical and Actuarial SciencesThe University of Western Ontario
March 24, 2010
Outline
• What is R?
• Why use R?
• A bit about R
• What is Survival Analysis?
• The survival package in R
• Example
1
What is R?
• R is a free software environment for statistical computing and graphics.
• It compiles and runs on a wide variety of UNIX platforms, Windowsand MacOS.
• R is very popular among researchers in statistics.
• R is similar in appearance to S.
• R was initially written by Ross Ihaka and Robert Gentleman
2
Why use R?
• It contains advanced statistical routines not yet available in otherpackages.
• It provides an unparalleled platform for programming new statisticalmethods in an easy and straightforward manner.
• It has state-of-the-art graphics capabilities.
• It’s free. Just go to http://www.r-project.org
3
Assignment, Vectors and Arrays
> 1+2*3
[1] 7
> x=3
> y<-2
> x+y
[1] 5
> z=c(2,3,4,5)
> z
[1] 2 3 4 5
> 2*z
[1] 4 6 8 10
>
9
Assignment, Vectors and Arrays
> 1+2*3
[1] 7
> x=3
> y<-2
> x+y
[1] 5
> z=c(2,3,4,5)
> z
[1] 2 3 4 5
> 2*z
[1] 4 6 8 10
>
9
Assignment, Vectors and Arrays
> z=2:5
> z
[1] 2 3 4 5
> z=seq(2,5,1)
> z
[1] 2 3 4 5
> zz=seq(10,300,3)
> zz
[1] 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64[20] 67 70 73 76 79 82 85 88 91 94 97 100 103 106 109 112 115 118 121[39] 124 127 130 133 136 139 142 145 148 151 154 157 160 163 166 169 172 175 178[58] 181 184 187 190 193 196 199 202 205 208 211 214 217 220 223 226 229 232 235[77] 238 241 244 247 250 253 256 259 262 265 268 271 274 277 280 283 286 289 292[96] 295 298
>
10
Assignment, Vectors and Arrays
> z=2:5
> z
[1] 2 3 4 5
> z=seq(2,5,1)
> z
[1] 2 3 4 5
> zz=seq(10,300,3)
> zz
[1] 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64[20] 67 70 73 76 79 82 85 88 91 94 97 100 103 106 109 112 115 118 121[39] 124 127 130 133 136 139 142 145 148 151 154 157 160 163 166 169 172 175 178[58] 181 184 187 190 193 196 199 202 205 208 211 214 217 220 223 226 229 232 235[77] 238 241 244 247 250 253 256 259 262 265 268 271 274 277 280 283 286 289 292[96] 295 298
>
10
Assignment, Vectors and Arrays
> mat=array(1:12,c(3,4))
> mat
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
> mat=matrix(1:12,3,4)
> mat
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
>
11
Assignment, Vectors and Arrays
> mat=array(1:12,c(3,4))
> mat
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
> mat=matrix(1:12,3,4)
> mat
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
>
11
Functions
> plus=function(a,b) a+b> plus(3,4)[1] 7> plus(3)
Error in plus(3) : element 2 is empty;the part of the args list of ’+’ being evaluated was:(a, b)
> plus=function(a,b=0) a+b> plus(3,4)[1] 7> plus(3)[1] 3> plus(1:3,4:5)
[1] 5 7 7Warning message:In a + b : longer object length is not a multiple of shorter object length
>
12
Functions
> plus=function(a,b) a+b> plus(3,4)[1] 7> plus(3)
Error in plus(3) : element 2 is empty;the part of the args list of ’+’ being evaluated was:(a, b)
> plus=function(a,b=0) a+b> plus(3,4)[1] 7> plus(3)[1] 3> plus(1:3,4:5)
[1] 5 7 7Warning message:In a + b : longer object length is not a multiple of shorter object length
>
12
What is Survival Analysis?
Survival Analysis is the study of lifetimes and their distributions. It usuallyinvolves one or more of the following objectives:
• to explore the behaviour of the distribution of a lifetime.
• to model the distribution of a lifetime.
• to test for differences between the distributions of two or more lifetimes.
• to model the impact of one or more explanatory variables on a lifetimedistribution.
13
The Nature of Lifetime Data
• It’s almost always incomplete.
– It often involves right-censoring.
– It sometimes involves left-truncation.
• The methods of survival analysis allow for this incompleteness.
14
The survival Package in R
> install.packages("survival") # first time only
--- Please select a CRAN mirror for use in this session ---trying URL ’http://probability.ca/cran/bin/windows/contrib/2.10/survival_2.35-8.zip’Content type ’application/zip’ length 2445387 bytes (2.3 Mb)opened URLdownloaded 2.3 Mb
package ’survival’ successfully unpacked and MD5 sums checked
The downloaded packages are inC:\Documents and Settings\jones\Local Settings\Temp\RtmpEQ5ZaF\downloaded_packages
> library(survival)
Loading required package: splines
>
15
Creating a Survival Object
Example 1. Complete data lifetimes: 26, 42, 71, 85, 92.
> ex1.times=c(26,42,71,85,92)
> ex1.surv=Surv(ex1.times)
> ex1.surv
[1] 26 42 71 85 92
> class(ex1.surv)
[1] "Surv"
> class(ex1.times)
[1] "numeric"
>
16
Creating a Survival Object
Example 2. Right-censored lifetimes: 26, 42, 71, 80+, 80+.
> ex2.times=c(26,42,71,80,80)
> ex2.events=c(1,1,1,0,0)
> ex2.surv=Surv(ex2.times,ex2.events)
> ex2.surv
[1] 26 42 71 80+ 80+
>
17
Creating a Survival Object
Example 3. Left-truncated and right-censored lifetimes:Left-truncation time is 40 for all individuals;Event/right-censoring times are 42, 71, 80+, 80+.
> ex3.lttimes=rep(40,4)
> ex3.times=c(42,71,80,80)
> ex3.events=c(1,1,0,0)
> ex3.surv=Surv(ex3.lttimes,ex3.times,ex3.events)
> ex3.surv
[1] (40,42 ] (40,71 ] (40,80+] (40,80+]
>
18
Real Data Example
Lifetimes: Times until death of 26 psychiatric patients
Number of deaths: 14
Number of censored observations: 12
Covariates: patient age and sex (15 females, 11 males)
19
Real Data Example
The Data
patient sex age time death patient sex age time death
1 2 51 1 1 14 2 30 37 02 2 58 1 1 15 2 33 35 03 2 55 2 1 16 1 36 25 14 2 28 22 1 17 1 30 31 05 1 21 30 0 18 1 41 22 16 1 19 28 1 19 2 43 26 17 2 25 32 1 20 2 45 24 18 2 48 11 1 21 2 35 35 09 2 47 14 1 22 1 29 34 010 2 25 36 0 23 1 35 30 011 2 31 31 0 24 1 32 35 112 1 24 33 0 25 2 36 40 113 1 25 33 0 26 1 32 39 0
20
Real Data Example
Questions
• Does the lifetime distribution behave the way we expect?
• Are the lifetimes different for females and males?
• Do the lifetimes depend on age?
21
Estimating the Survival Function
We can explore the lifetime distribution by examining nonparametricestimates of the survival function.
The R function survfit allow us to do this.
> library(KMsurv) # get the data> data(psych)> attach(psych)> names(psych)
[1] "sex" "age" "time" "death"
> psych.surv=Surv(age,age+time,death) # create a survival object
> psych.fit1=survfit(psych.surv˜1) # obtain the estimates
> plot(psych.fit1,xlim=c(40,80),xlab="age",ylab="probability",+ main="Survival Function Estimates") # plot the estimates>
22
Estimating the Survival Function
40 50 60 70 80
0.0
0.2
0.4
0.6
0.8
1.0
Survival Function Estimates
age
prob
abili
ty
23
Estimating the Survival Function
Now let’s consider females and males separately.
> psych.fit2=survfit(psych.surv˜sex) # separate by sex
> plot(psych.fit2,xlim=c(40,80),xlab="age",ylab="probability",+ main="Survival Function Estimates for Males (red) and Females",+ col=c("red","blue"))
> plot(psych.fit2,xlim=c(40,80),xlab="age",ylab="probability",+ main="Survival Function Estimates for Males (red) and Females",+ col=c("red","blue"), conf.int=T)>
24
Estimating the Survival Function
40 50 60 70 80
0.0
0.2
0.4
0.6
0.8
1.0
Survival Function Estimates for Females (blue) and Males
age
prob
abili
ty
25
Estimating the Survival Function
40 50 60 70 80
0.0
0.2
0.4
0.6
0.8
1.0
Survival Function Estimates for Females (blue) and Males
age
prob
abili
ty
26
Testing for Differences
The R function survdiff allow us to test for differences between lifetimedistributions.
> survdiff(psych.surv˜sex)
Error in survdiff(psych.surv ˜ sex) : Right censored data only
> psych.surv2=Surv(time,death) # create new survival object> survdiff(psych.surv2˜sex)
Call:survdiff(formula = psych.surv2 ˜ sex)
N Observed Expected (O-E)ˆ2/E (O-E)ˆ2/Vsex=1 11 4 6.24 0.807 1.61sex=2 15 10 7.76 0.650 1.61
Chisq= 1.6 on 1 degrees of freedom, p= 0.205
>
27
Testing for Differences
The R function survdiff allow us to test for differences between lifetimedistributions.
> survdiff(psych.surv˜sex)
Error in survdiff(psych.surv ˜ sex) : Right censored data only
> psych.surv2=Surv(time,death) # create new survival object> survdiff(psych.surv2˜sex)
Call:survdiff(formula = psych.surv2 ˜ sex)
N Observed Expected (O-E)ˆ2/E (O-E)ˆ2/Vsex=1 11 4 6.24 0.807 1.61sex=2 15 10 7.76 0.650 1.61
Chisq= 1.6 on 1 degrees of freedom, p= 0.205
>
27
Fitting a Proportional Hazards Model
The model: h(t|x1, . . . , xp) = h0(t) exp(β1x1 + · · · + βpxp)
• The PH model is often used when we are interested in the impact ofthe covariates, x1, . . . , xp, but not the lifetime distributions themselves.
• We can estimate and make inferences about β1, . . . , βp without esti-mating h0.
• The R function coxph allows us to do this.
28
Fitting a Proportional Hazards Model
> psych.coxph1=coxph(psych.surv˜sex)> summary(psych.coxph1)
Call:coxph(formula = psych.surv ˜ sex)
n= 26
coef exp(coef) se(coef) z Pr(>|z|)sex 0.3900 1.4770 0.6102 0.639 0.523
exp(coef) exp(-coef) lower .95 upper .95sex 1.477 0.677 0.4466 4.884
Rsquare= 0.016 (max possible= 0.926 )Likelihood ratio test= 0.43 on 1 df, p=0.5141Wald test = 0.41 on 1 df, p=0.5227Score (logrank) test = 0.41 on 1 df, p=0.5203
29
Fitting a Proportional Hazards Model
Next we use our survival object psych.surv2, which does not involve left-truncation.
> psych.coxph2=coxph(psych.surv2˜sex)> summary(psych.coxph2)
Call:coxph(formula = psych.surv2 ˜ sex)
n= 26
coef exp(coef) se(coef) z Pr(>|z|)sex 0.7511 2.1194 0.6055 1.241 0.215
exp(coef) exp(-coef) lower .95 upper .95sex 2.119 0.4718 0.6469 6.944
Rsquare= 0.062 (max possible= 0.945 )Likelihood ratio test= 1.66 on 1 df, p=0.1981Wald test = 1.54 on 1 df, p=0.2148Score (logrank) test = 1.61 on 1 df, p=0.2046
Note that the last test is exactly that performed using survdiff.
30
Fitting a Proportional Hazards Model
Finally, consider
> psych.coxph3=coxph(psych.surv2˜age+sex)> summary(psych.coxph3)
Call:coxph(formula = psych.surv2 ˜ age + sex)
n= 26
coef exp(coef) se(coef) z Pr(>|z|)age 0.20753 1.23063 0.05828 3.561 0.00037 ***sex -0.52374 0.59230 0.73753 -0.710 0.47762---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
exp(coef) exp(-coef) lower .95 upper .95age 1.2306 0.8126 1.0978 1.380sex 0.5923 1.6883 0.1396 2.514
Rsquare= 0.553 (max possible= 0.945 )Likelihood ratio test= 20.91 on 2 df, p=2.879e-05Wald test = 14.3 on 2 df, p=0.0007866Score (logrank) test = 21.27 on 2 df, p=2.409e-05
31
Conclusions about this Example
• There is great uncertainty due to the small number of observations.
• Times until death depend on age at first admission to the hospital.
• We cannot conclude that the lifetimes are different for females andmales.
32
Fitting an Accelerated Failure Time Model
• This is a popular fully parametric model for which the lifetime distrib-ution is the same for different covariate values, except that the timescale is multiplied by a different constant.
• The R function survreg can be used to fit an AFT model.
33
Summary
• R is a flexible and free software environment for statistical computingand graphics.
• The survival package contains functions for survival analysis.
– Surv creates a survival object.
– survfit estimates (nonparametrically) the survival function.
– survdiff performs tests for differences in lifetime distributions.
– coxph fits the proportional hazards model.
– survreg fits the accelerated failure time model.
These slides are here:
http://www.stats.uwo.ca/faculty/jones/survival_talk.pdf
34