Event-history analysis: discrete- & continuous-time methods Society for Longitudinal & Life-course Studies Summer School, University of Amsterdam, 25-29 August 2014 Prof. dr. K. Neels, Sociology Department, University of Antwerp QASS-Programme, KULeuven
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Event-history analysis:
discrete- & continuous-time methods
Society for Longitudinal & Life-course Studies
Summer School, University of Amsterdam, 25-29 August 2014
Prof. dr. K. Neels,
Sociology Department, University of Antwerp
QASS-Programme, KULeuven
Outline
1. Introduction
2. Descriptive discrete-time methods
3. Discrete-time models
4. Descriptive continuous-time methods
5. Continuous-time models
6. Advanced topics
Introduction
1. Applied longitudinal data analysis: example
2. Modeling event-occurrence: methodological features
3. Event-history data
4. Censoring
Introduction
1. Applied longitudinal data analysis: example
Marital dissolution (South, 2001)
• Observation plan:
- 23-year study
- data on 3523 couples in different generations who married in different years
• Effect of wives’ employment on marital dissolution: 2 hypotheses
- Effect might diminish over time:
more women enter the labor force and working becomes normative
- Effect might increase:
changing mores weaken the link between marriage and parenthood
Introduction
Singer et al, 2003, v.
Marital dissolution (South, 2001)
• Results:
- Effect of wives’ employment increases over time:
risk differential is higher in 1990s than in 1970s
- Effect of wives’ employment also increases with marital duration
• Conclusions:
- research based on cross-sectional data has too often assumed that effects of
predictors like wives’ employment are constant over time, i.e. ignoring
‘EMPLOYMENT*TIME’-interaction
- Have too often assumed that effect of predictors like wives’ employment is
constant in terms of marital duration: predictors of divorce among newlyweds are
likely to differ from those among couples who have been married for years, i.e.
ignoring ‘EMPLOYMENT*DURATION’-interaction
Singer et al, 2003, v.
Introduction
Introduction
1. Applied longitudinal data analysis: example
2. Modeling event-occurrence: methodological features
Research questions involving events
• 3 methodological features:
- Target event whose occurrence is being studied:
e.g. whether and when marital dissolution occurs
- Beginning of time:
an initial starting point when no one under study has yet experienced the target
event, e.g. date of marriage
- Metric for clocking time:
a meaningful scale on which event occurrence is recorded, e.g. marital duration
Singer et al, 2003, 305-324.
Introduction
Lexis Chart: Marriage & Divorce
Hinde, 1998, 12.
Age
(in years)
26 Person A 25 Divorce 24 23 line represents 22 marital duration 21 Marriage 20
2000 2001 2002 2003 2004 2005 2006 Calendar time (in years)
Introduction
Lexis Chart: General Form
Hinde, 1998, 12.
Age
(in years)
26 Person A 25 Event Occurrence 24 23 Exposure or Duration 22 21 Censoring Entry into Risk Set 20 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Calendar time (in years)
Introduction
Research questions involving events
• Event
sharp disjunction between what precedes and what follows, transition from one
state to another state
• State Space
- States can be physical (migration, becoming home owner), psychological (healthy
or depressed) or social (married or divorced)
- Requirement for survival analysis: states are mutually exclusive (non overlapping)
and exhaustive (i.e. state space includes all possible states).
• Number of states
- Most applications focus on two states. Two-state methods are expanded later
using models for competing risks.
• Repeatable versus non-repeatable events
- Some states are non-repeatable (first job, death, …): once entered, a person can
never re-enter. Other states are repeatable (changing jobs, having children,…):
different episodes or spells can be analyzed for every person.
Singer et al, 2003, 305-324.
Introduction
The beginning of time…
• Beginning of time:
moment when everyone in the population is in one, and only one, of the possible
states (i.e. origin state); moment when persons enter risk set for transition to
destination state.
• Timing of the transition (event time):
distance from the beginning of time until event occurrence; distance between entry
into risk set and event occurrence.
• Starting points for metric:
- Birth: all studies using age as the metric of time (e.g. occurrence of death,
depression, suicidal thoughts,…
- Precipitating event: e.g. graduation for studying transition to first employment,
marriage for study of divorce, first birth for studying occurrence and timing of
second births,…
- Arbitrary start time: e.g. age 14 or 16 for analysis of occurrence of first birth
Singer et al, 2003, 305-324.
Introduction
Metric of time
• Metric
unit in which passing of time is recorded (e.g. seconds, minutes, hours, days,
months, years, decades,…)
• Continuous versus Discrete measurement
Occurrence of events can be recorded in precise units (continuous time
measurement) or coarse intervals (discrete-time measurements).
• Reasons for discrete time data:
- events can only occur at discrete time points (graduating, enrolling for higher
education, …)
- events can theoretically occur at any point in time but tend to occur at certain
intervals (e.g. leaving jobs at the end of the contract (although theoretically
possible to quit in middle of contract)
- coarse measurement, e.g. retrospectively collected duration data necessitate
collapsing duration into intervals (memory problems, rounding errors).
Singer et al, 2003, 305-324.
Introduction
Overview: discrete-time versus continuous-time methods
• Continuous-time methods
- methods assuming that time is measured exactly or in small enough discrete units
to avoid large numbers of ties
- predominant method of event history analysis in social sciences, engineering and
biostatistics
• Discrete-time methods
- methods assuming coarse measurement of time – e.g. months, years or decades –
resulting in large numbers of ties
- robust alternative to continuous-time methods
- computationally intensive
• Discrete-time versus continuous-time
Continuous-time and discrete-time data have implications for methodological aspects
of survival analysis: parameter definition, model construction, estimation and testing
Allison, 1984, 9-14; Allison, 2004, 369-385.
Introduction
Introduction
1. Applied longitudinal data analysis: example
2. Modeling event-occurrence: methodological features
3. Event-history data
Event history data
• Event History:
- longitudinal record of all the changes in qualitative variables and their timing
- continuous observation (i.e. independent of waves,…)
- if studying causes of events, histories should include data on explanatory variables
- explanatory variables may be constant in time (e.g. race, sex,…)
- explanatory variables may change in time (e.g. income, treatment,…)
- N=1773 women who had 1st child between 1968-1990
- follow-up from 1st birth until 2nd birth/end 1990
• Parallel histories:
Time-varying covariates, lagged 1 year
Second birth: FFS1991
Hinde, 1998, 12.
. tab time birth2
| RECODE of RKIDBY2
time | 0 1 | Total
-----------+----------------------+----------
0 | 104 26 | 130
1 | 102 163 | 265
2 | 53 407 | 460
3 | 42 244 | 286
4 | 33 141 | 174
5 | 31 67 | 98
6 | 35 38 | 73
7 | 23 28 | 51
8 | 23 11 | 34
9 | 31 12 | 43
10 | 17 7 | 24
11 | 18 5 | 23
12 | 18 5 | 23
13 | 27 2 | 29
14 | 18 2 | 20
15 | 8 0 | 8
16 | 8 0 | 8
17 | 8 2 | 10
18 | 8 0 | 8
19 | 5 0 | 5
21 | 1 0 | 1
-----------+----------------------+----------
Total | 613 1,160 | 1,773
Descriptive discrete-time methods
• Life Table
Tracks the event histories (“lives”) of a sample of individuals from the beginning oftime (when no one has yet experienced the target event) through the end of datacollection.
• Calender time vs Duration
- Event histories recorded in calendar time are re-arranged as a function of duration since entry into the initial state
• Time intervals
- Duration of time is divided into a number of substantially meaningful time interval
- each interval includes, i.e. [] the initial time and excludes, i.e. () the concluding time
• For each time interval:
- risk set, i.e. number who entered the interval
- number who made transition/experienced event
- number censored during the interval
Life Table
Singer et al., 2003, 326-356; Lesthaeghe, 1996.
Descriptive discrete-time methods
• Definition
- the number of people who enter each successive period
- the number eligible to experience an event in the interval
• Irreversible
- once an individual experiences an event or is censored the individual drops out of the risk set of future intervals
- everyone is retained in risk set until last moment of eligibility
• Assumption of non-informative censoring
- if censoring is non-informative risk set can be assumed to represent all individualswho would have been at risk of event occurrence if is everyone could have beenfollowed that long.
- under the assumption of non-informative censoring the experience of eachinterval’s risk set can be generalized back to the entire population
Risk Set
Singer et al., 2003, 326-356; Lesthaeghe, 1996.
Descriptive discrete-time methods
• Discrete random variable T
Values Ti indicate the time period j when individual i experienced the target event: e.g. if individual experiences event in year Ti =1
• Probability density function
Probability that individual i will experience the event in time period j:
• Discrete-time hazard h(tij)Conditional probability that individual i will experience the event in time period j, given that he or she did not experience the event in an earlier time period.
• Hazard function
The set of h(tij) is the hazard function for individual i. If individuals are notdistinguished on the basis of predictors, the hazard function for a random member ofthe population is denoted as h(t).
n eventsj represents the number of individuals who experience the target event intime period j and n at riskj represents the number of individuals at risk during timeperiod j:
• maximum likelihood estimate of the discrete-time hazard function
• discrete limit of the Kaplan-Meier estimate of the hazard for continuous-time data
• As the discrete-time hazard is a conditional probability:
Estimating discrete-time hazards
Singer et al., 2003, 326-356.
j
j
ijriskatn
eventsnth =)(ˆ
1)(ˆ0 ≤≤ ijth
Descriptive discrete-time methods
Calculation:
• h(tj): fraction of the risk set in period j experiencing target event in that period
• standard error of h(tj) obtained as standard error for a proportion
(e.g. square root of pq/N):
Rules of Thumb:
• the closer the hazard is to 0.50 the less precise the estimate; the closer to 0 or 1 the
more precise: usually hazards are small and thus estimated precisely.
• the larger the risk set, the more precise the estimate of the hazard; the smaller the
risk set, the less precise.
• Estimated standard errors are larger when fewer people are at risk. As the risk set
declines over time, estimated hazards tend to be less precise than earlier
exposures.
Standard Error of Estimated Hazard Probabilities
Singer et al., 2003, 325-356.
jriskat n
))(ˆ1)((ˆ))(ˆ(
jj
j
thththse
−=
Descriptive discrete-time methods
Second birth: life table
Hinde, 1998, 12.
Interval Risk Set Events Censoring Discrete-time
at start in interval in interval Hazard
[0,1) 1773 26 104 0,0147
[1,2) 1643 163 102 0,0992
[2,3) 1378 407 53 0,2954
[3,4) 918 244 42 0,2658
[4,5) 632 141 33 0,2231
[5,6) 458 67 31 0,1463
[6,7) 360 38 35 0,1056
[7,8) 287 28 23 0,0976
[8,9) 236 11 23 0,0466
[9,10) 202 12 31 0,0594
[10,11) 159 7 17 0,0440
[11,12) 135 5 18 0,0370
[12,13) 112 5 18 0,0446
[13,14) 89 2 27 0,0225
[14,15) 60 2 18 0,0333
[15,16) 40 0 8 0,0000
[16,17) 32 0 8 0,0000
17,18) 24 2 8 0,0833
[18,19 14 0 8 0,0000
[19,20) 6 0 5 0,0000
[20,21) 1 0 0 0,0000
[21,22) 1 0 1 0,0000
8560 1160 613
Descriptive discrete-time methods
• Plot of the discrete-time hazard
- Plot the discrete-time hazard probabilities over exposure/duration as a series ofpoints joined together by line segments.
- identify periods of high risk
- characterize the shape of the hazard function,
i.e. modeling the discrete-time function
Graphical representation of the hazard function
Singer et al., 2003, 326-356.
Largest number of events does not
necessarily occur in a period where
the hazard is high. The number ofpeople affected by the hazard in each
The survival probability S(tij) is the probability that an individual i will survive pasttime period j, i.e. probability that individual i does not experience event in time periodj or any earlier time period
• Discrete variable T:
• Survivor function
Set of S(tij) for an individual is the individual’s survivor function. When individuals arenot distinguished on the basis of predictors, the survivor function for a randommember of the population is denoted as S(t).
Life-table functions: survivor function, S(tij)
Singer et al., 2003, 326-356.
[ ]jTtS iij >= Pr)(
Descriptive discrete-time methods
• S(t0)=1
In the beginning of time when no one has yet experienced the event S(t)=1
• Behaviour of the survivor function over time:
- S(t) declines over time toward its lower bound 0: S(t) never increases!
- S(t) declines quickly if discrete-time hazard is high
- S(t) declines slowly is discrete-time hazard is low
- S(t) remains constant is discrete-time hazard = 0
- S(t) not necessarily reaches 0 by the end of the observation period
• S(t) cumulates risk over the observation period to estimate the fraction of the initial population surviving up to each successive time period:
• S(t) indicates the proportion of people exposed to each period’s hazard:
- h(t) is high when S(t) is high = many people affected
- h(t) is high when S(t) is low = few people affected
Characteristics of S(tij)
Singer et al., 2003, 326-356.
Descriptive discrete-time methods
• Direct method
- only applicable for intervals preceding first instance of censoring
- number of individuals surviving is not affected by censoring
• Indirect Method
- equally applicable in the presence of censoring
- denotes probability of surviving interval j
- denotes probability of surviving interval j
Maximum Likelihood Estimators of S(t)
Singer et al., 2003, 326-356.
set data in then
j period timeof end by theevent thedexperiencenot have n who)(ˆ =jtS
Estimated survivor function provides maximum likelihood estimates of theprobability that an individual randomly selected from the population will survivethrough each successive time period
• Extrapolation of sample experience
- observed risk sets decline as a result of censoring and event occurrence
- declines only as a result of event occurrence
- assuming independent censoring allows to estimate what would havehappened to the initial population were there no censoring
Maximum Likelihood Estimators of S(t)
Singer et al., 2003, 326-356.
)(ˆjtS
)(ˆjtS
)(ˆjtS
Descriptive discrete-time methods
• More complex than :
Survival probability in period j is estimated as the product of in period j and all
previous periods
• Greenwood’s approximation:
• Approximation unreliable for periods where risk set drops below N=20
Standard Error of Estimated Survival Probabilities
Singer et al., 2003, 326-356.
))(ˆ( jthse
∑= −
=−
++−
+−
=j
i ii
i
j
jj
j
jj
thn
thtS
thn
th
thn
th
thn
thtStSse
122
2
11
1
))(ˆ1(
)(ˆ)(ˆ
))(ˆ1(
)(ˆ...
))(ˆ1(
)(ˆ
))(ˆ1(
)(ˆ)(ˆ))(ˆ(
Descriptive discrete-time methods
Second birth: survivor function
Singer et al., 2003, 326-356.
Interval Risk Set Events Censoring Discrete-time Probability Survivor
at start in interval in interval Hazard Survival Function
- As a result of censoring sample mean is inappropriate to identify center of thedistribution of T
- estimated median lifetime identifies that value of T for which the value of theestimated survival function equals 0.50: half of the population leaves the initialstate before the estimated median lifetime, the other half leave after the medianlifetime or never leave at all
- precise estimate is obtained through linear interpolation:
Where m represents time interval where sample survivor function is just above0.50, is the value of the survivor function in that interval and is
the value of the sample survivor function in the following interval
Life-table functions: Median Lifetime
Singer et al., 2003, 326-356.
))1(()()(
50.0)(ˆ
1
mmtStS
tSmlifetimemedianEstimated
mm
m −+
−
−+=
+
)(ˆmtS )(ˆ
1+mtS
Descriptive discrete-time methods
• 3 important limitations
- estimates median lifetime merely estimates “average” lifetime: says little aboutdistribution of event times and is relatively insensitive to extreme values
- median lifetime does not necessarily reflect moment when h(tj) is particularlyelevated, merely indicates the point where S(tj) declines below 0.50
- median lifetime says little about the distribution of the hazard: identical meanlifetimes can result from dramatically different hazard functions
Life-table functions: Median Lifetime
Singer et al., 2003, 326-356.
Descriptive discrete-time methods
Second birth: FFS1991
Hinde, 1998, 12.
Interval Risk Set Events Censoring Discrete-time Probability Survivor
at start in interval in interval Hazard Survival Function
[0,1) 1773 26 104 0,0147 0,9853 1,0000
[1,2) 1643 163 102 0,0992 0,9008 0,9853
[2,3) 1378 407 53 0,2954 0,7046 0,8876
[3,4) 918 244 42 0,2658 0,7342 0,6254
[4,5) 632 141 33 0,2231 0,7769 0,4592
[5,6) 458 67 31 0,1463 0,8537 0,3567
[6,7) 360 38 35 0,1056 0,8944 0,3046
[7,8) 287 28 23 0,0976 0,9024 0,2724
[8,9) 236 11 23 0,0466 0,9534 0,2458
[9,10) 202 12 31 0,0594 0,9406 0,2344
[10,11) 159 7 17 0,0440 0,9560 0,2205
[11,12) 135 5 18 0,0370 0,9630 0,2107
[12,13) 112 5 18 0,0446 0,9554 0,2029
[13,14) 89 2 27 0,0225 0,9775 0,1939
[14,15) 60 2 18 0,0333 0,9667 0,1895
[15,16) 40 0 8 0,0000 1,0000 0,1832
[16,17) 32 0 8 0,0000 1,0000 0,1832
17,18) 24 2 8 0,0833 0,9167 0,1832
[18,19 14 0 8 0,0000 1,0000 0,1679
[19,20) 6 0 5 0,0000 1,0000 0,1679
[20,21) 1 0 0 0,0000 1,0000 0,1679
[21,22) 1 0 1 0,0000 1,0000 0,1679
8560 1160 613
Descriptive discrete-time methods
• Hazard h(t)
Unlike the probability density function, the conditional character of the hazard h(t)provides an unbiased description of how the process is evolving at each point in time
• Survivor function S(t)
- the survivor function provides the proportion surviving in the initial state at eachpoint t
- S(t) has a ‘memory’: the proportion surviving at t largely depends on themagnitude of the hazards prior to t and only marginally on the hazard at t
- the number of events occurring at each point in time is affected by the magnitudeof h(t) and the proportion of the population still at risk of experiencing the event:
small values of h(t) at the beginning of exposure may affect much larger numbersof people than large h(t) by the end of the exposure (e.g. prostate cancer).
• Median Lifetime
- intuitive description of time-to-event data
- very different types of hazard functions may have the same median lifetime
- complement with other descriptives: percentiles, S(t) at fixed durations,…
Life table functions: conclusions
Descriptive discrete-time methods
Descriptive discrete-time methods
1. Single-decrement life table
2. Person-period files & descriptive statistics
N and Risk Sets in construction of discrete-time life-table:
• h(tj) relates events in period j to risk set of period j
• N individuals are included in risk set of each period they are risk
• Individuals contribute several person-periods to discrete-time life-table
• Let N represent the number of individuals:
sum of risk sets in life-table = N*exposure
Discrete-time Event History Analysis in Practice:
• Data are rearranged from Person-format to Person-Period Format:
- Person File: N records represent data on N individuals
- Person-Period File: ‘N*Exposure’-records represent data on N individuals
Discrete-time Life-table: Person-Period File
Descriptive discrete-time methods
Principles of the Person-Period File (PPF):
a) for each unit of time that an individual is known to be at risk, a separate
observational record is created, i.e. a person-period
b) For each person-period:
- time variable indicating time-period j of the record in question
- event indicator is coded 0 for all time-periods except the last. In the last period the
event indicator is coded 0 for censored individuals and 1 for individuals
experiencing target event
- explanatory variables are assigned the values they take on in each person-year:
same value in each person period for time-constant variables; current or lagged
values for time-varying covariates
c) Person periods are pooled in a single sample. N of records in person-period file
(PPF) is equal to sum of risk sets in life table throughout observation period.
Discrete-time Life-table: Person-Period File
Singer et al., 2003, 326-356.
Descriptive discrete-time methods
Person File: Person-Period File*:
* Observation ends in 1990 as 1991 is only partially observed
Second birth: Construction of person-period file
ID BIRTHY EDUC KID1 KID2 T EVENT
1 1962 12 1988 - 2 0
3 1954 11 1978 1979 1 1
21 1951 3 1973 1979 6 1
ID BIRTHY EDUC KID1 KID2 YEAR T EVENT
1 1962 12 1988 - 1988 0 0
1 1962 12 1988 - 1989 1 0
1 1962 12 1988 - 1990 2 0
3 1954 11 1978 1979 1978 0 0
3 1954 11 1978 1979 1979 1 1
21 1951 3 1973 1979 1973 0 0
21 1951 3 1973 1979 1974 1 0
21 1951 3 1973 1979 1975 2 0
21 1951 3 1973 1979 1976 3 0
21 1951 3 1973 1979 1977 4 0
21 1951 3 1973 1979 1978 5 0
21 1951 3 1973 1979 1979 6 1
Descriptive discrete-time methods
• Discrete-time life-table using Person-Period File:
- person-period file & standard cross-tabulation routines:
generate J x 2 table (i.e. period*event)
- each row reflects risk set in period j
- row percentages correspond to estimated hazard in each period j
• Person-period file:
- PPF is also used for fitting discrete-time event history models
- conceptual basis for continuous-time models
Person-Period File: Estimation of Discrete-time Life-table
Descriptive discrete-time methods
Setting up a person-period file
* SETTING UP A PERSON-PERIOD FILE
generate npyears = time+1
expand npyears
bysort id: generate time_tv = _n - 1
generate birth2_tv = 0
replace birth2_tv = birth2 if time_tv == time
generate year_tv = yrbrnkid1 + time_tv
generate age_tv = year_tv - yrbrn
* ESTIMATE OF h(t) USING CROSSTABULATION
tab time_tv birth2_tv, row
* CALCULATING OBSERVED VALUE OF Pr(T = t | T >= t) AT EACH TIME t
* CAUSE-SPECIFIC HAZARD FUNCTIONS: OTHER DISEASES DISEASES
stset MONTH_CONT, fail(CAUSE==4)
stcox i.EDUCATION
Allison, 1982; Allison, 1984, 42-50.
Advanced Topics
Continuous-time Models for Competing Risks
• Similarity of covariate effects
null hypothesis that the set of coefficients associated with covariate effects is identical
across event types is tested by comparing goodness-of-fit statistics for the separate
models to that of the global model that does not distinguish between event types:
which is chi-square distributed with p(k-1) degrees of freedom, where p is the number
of predictors and k is the number of competing event-types.
Singer et al., 2003, 586-595.
∑ −−−−sevent type
model specificeventmodel global 22 LLLL
Advanced Topics
Continuous-time Models for Competing Risks
Overall Mortality: -2LL = 33664.643
Cause-specific mortality: -2LL = 33643.214
Cancer: -2LL = 12176.997
Circular: -2LL = 11287.542
Respiratory: -2LL = 3902.120
Other: -2LL = 6276.555
Difference in -2LL: 21.429 (df=12)
Effect of EDUCATION is significantly different for different causes of mortality
Allison, 1982; Allison, 1984, 42-50.
Advanced Topics
Competing Risks: alternative approach
• Treating competing risks as censoring may yield biased estimates ofcause-specific hazard functions: assumes that individuals would havesame risk of experiencing event had they not experienced competingevent (e.g. entry into marriage vs unmarried cohabitation)
• Alternative strategy: nested model
- fit a discrete-time of continuous-time hazard model for partnershipformation (event occurrence)
- estimate binary or multinomial logit model of partnership type forindividuals experiencing event