Click here to load reader
Click here to load reader
Jul 17, 2016
giving a giftgiving a giftHelp the Stat Consulting Group by
Statistical Computing SeminarsSurvival Analysis with Stata
Background for Survival AnalysisThe UIS dataExploring the data: Univariate AnalysesModel BuildingInteractionsProportionality AssumptionGraphing Survival Functions from stcox commandGoodness of Fit of the Final Model
The Stata program on which the seminar is based.The UIS_small data file for the seminar.
Background for Survival AnalysisThe goal of this seminar is to give a brief introduction to the topic of survival analysis. We will be using a smaller and slightly modified version of the UISdata set from the book "Applied Survival Analysis" by Hosmer and Lemeshow. We strongly encourage everyone who is interested in learning survivalanalysis to read this text as it is a very good and thorough introduction to the topic.
Survival analysis is just another name for time to event analysis. The term survival analysis is predominately used in biomedical sciences where theinterest is in observing time to death either of patients or of laboratory animals. Time to event analysis has also been used widely in the social scienceswhere interest is on analyzing time to events such as job changes, marriage, birth of children and so forth. The engineering sciences have also contributedto the development of survival analysis which is called "reliability analysis" or "failure time analysis" in this field since the main focus is in modeling the timeit takes for machines or electronic components to break down. The developments from these diverse fields have for the most part been consolidated intothe field of "survival analysis". For more background please refer to the excellent discussion in Chapter 1 of Event History Analysis by Paul Allison.
There are certain aspects of survival analysis data, such as censoring and non-normality, that generate great difficulty when trying to analyze the datausing traditional statistical models such as multiple linear regression. The non-normality aspect of the data violates the normality assumption of mostcommonly used statistical model such as regression or ANOVA, etc. A censored observation is defined as an observation with incomplete information. There are four different types of censoring possible: right truncation, left truncation, right censoring and left censoring. We will focus exclusively on rightcensoring for a number of reasons. Most data used in analyses have only right censoring. Furthermore, right censoring is the most easily understood ofall the four types of censoring and if a researcher can understand the concept of right censoring thoroughly it becomes much easier to understand theother three types. When an observation is right censored it means that the information is incomplete because the subject did not have an event during thetime that the subject was part of the study. The point of survival analysis is to follow subjects over time and observe at which point in time they experiencethe event of interest. It often happens that the study does not span enough time in order to observe the event for all the subjects in the study. This could bedue to a number of reasons. Perhaps subjects drop out of the study for reasons unrelated to the study (i.e. patients moving to another area and leaving noforwarding address). The common feature of all of these examples is that if the subject had been able to stay in the study then it would have been possibleto observe the time of the event eventually.
It is important to understand the difference between calendar time and time in the study. It is very common for subjects to enter the study continuouslythroughout the length of the study. This situation is reflected in the first graph where we can see the staggered entry of four subjects. Red dots denoteintervals in which the event occurred, whereas intervals without red dots signify censoring. It would appear that subject 4 dropped out after only a shorttime (hit by a bus, very tragic) and that subject 3 did not experience an event by the time the study ended but if the study had gone on longer (had morefunding) we would have known the time when this subject would have experienced an event.
clearinput subj tp censored str11 datestr1 1 0 "1 jan 1990"1 2 0 "1 mar 1991"2 1 1 "1 feb 1990"2 2 1 "1 feb 1991"3 1 1 "1 jun 1990"3 2 1 "31 dec 1991"4 1 0 "1 sep 1990"4 2 0 "1 apr 1991"end
gen date = date(datestr, "DMY")format date %dmytwoway (line subj date, connect(L))(scatter subj date if censored==0), /// ylabel(1 2 3 4) legend(order (2 "censored"))
gen time =0 if tp==1replace time= (date-date[_n-1])/30.5 if tp==2twoway (line subj time, connect(L))(scatter subj time if censored==0), /// ylabel(1 2 3 4) legend(order (2 "censored")) xlabel(0 8 12 14 19 24)
>stat >stata >seminars stata_survival
The other important concept in survival analysis is the hazard rate. From looking at data with discrete time (time measured in large intervals such asmonth, years or even decades) we can get an intuitive idea of the hazard rate. For discrete time the hazard rate is the probability that an individual willexperience an event at time t while that individual is at risk for having an event. Thus, the hazard rate is really just the unobserved rate at which eventsoccur. If the hazard rate is constant over time and it was equal to 1.5 for example this would mean that one would expect 1.5 events to occur in a timeinterval that is one unit long. Furthermore, if a person had a hazard rate of 1.2 at time t and a second person had a hazard rate of 2.4 at time t then itwould be correct to say that the second person's risk of an event would be two times greater at time t. It is important to realize that the hazard rate is anun-observed variable yet it controls both the occurrence and the timing of the events. It is the fundamental dependent variable in survival analysis.
Another important aspect of the hazard function is to understand how the shape of the hazard function will influence the other variables of interest such asthe survival function. The first graph below illustrates a hazard function with a 'bathtub shape'. This graph is depicting the hazard function for the survival oforgan transplant patients. At time equal to zero they are having the transplant and since this is a very dangerous operation they have a very high hazard (agreat chance of dying). The first 10 days after the operation are also very dangerous with a high chance of the patient dying but the danger is less thanduring the actual operation and hence the hazard is decrease during this period. If the patient has survived past day 10 then they are in very good shapeand have a very little chance of dying in the following 6 months. After 6 months the patients begin to experience deterioration and the chances of dyingincrease again and therefore the hazard function starts to increase. After one year almost all patients are dead and hence the very high hazard functionwhich will continue to increase.
The hazard function may not seem like an exciting variable to model but other indicators of interest, such as the survival function, are derived from thehazard rate. Once we have modeled the hazard rate we can easily obtain these other functions of interest. To summarize, it is important to understandthe concept of the hazard function and to understand the shape of the hazard function.
An example of a hazard function for heart transplant patients.
We are generally unable to generate the hazard function instead we usually look at the cumulative hazard curve.
use http://www.ats.ucla.edu/stat/data/uis.dta, cleargen id = IDdrop IDstset time, failure(censor)sts graph, na
The UIS dataThe goal of the UIS data is to model time until return to drug use for patients enrolled in two different residential treatment programs that differed in length(treat=0 is the short program and treat=1 is the long program). The patients were randomly assigned to two different sites (site=0 is site A and site=1 issite B). The variable age indicates age at enrollment, herco indicates heroin or cocaine use in the past three months (herco=1 indicates heroin andcocaine use, herco=2 indicates either heroin or cocaine use and herco=3 indicates neither heroin nor cocaine use) and ndrugtx indicates the number ofprevious drug treatments. The variables time contains the time until return to drug use and the censor variable indicates whether the subject returned todrug use (censor=1 indicates return to drug use and censor=0 otherwise). Let's look at the first 10 observations of the UIS data set. Note that subject 5 is censored and did not experience an event while in the study. Also notethat the coding for censor is rather counter-intuitive since the value 1 indicates an event and 0 indicates censoring. It would perhaps be more appropriateto call this variable "event".
list id time censor age ndrugtx treat site herco in 1/10, nodisplay
id time censor age ndrugtx treat site herco 1. 1 188 1 39 1 1 0 3 2. 2 26 1 33 8 1 0 3 3. 3 207 1 33 3 1 0 2 4. 4 144 1 32 1 0 0 3 5. 5 551 0 24 5 1 0 2 6. 6 32 1 30 1 1 0 1 7. 7 459 1 39 34 1 0 3 8. 8 22 1 27 2 1 0 3 9. 9 210 1 40