ST495: Survival Analysis: Maximum likelihood - …laber/L3_495.pdf · ST495: Survival Analysis: Maximum likelihood ... R I Exponential I Weibull I Gamma ... IWhy is survival analysis

Post on 11-Jul-2018

230 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

ST495: Survival Analysis:Maximum likelihood

Eric B. Laber

Department of Statistics, North Carolina State University

February 11, 2014

Everything is deception: seeking theminimum of illusion, keeping within theordinary limitations, seeking the maximum.In the first case one cheats the Good, bytrying to make it too easy for oneself to getit, and the Evil by imposing all toounfavorable conditions of warfare on it. Inthe second case one cheats the Good bykeeping as aloof from it as possible, and theEvil by hoping to make it powerless throughintensifying it to the utmost. —Franz Kafka

Last time

I Introduced parametric models commonly used in survivalanalysis; discussed their densities, hazards, survivor function,and CDFs; showed how to draw from these distributions usingR

I Exponential

I Weibull

I Gamma

I Extreme value

I Log-normal

I Log-logistic

I We also discussed location-scale models and how to use thelocation-scale framework to incorporate covariate information

I We also discussed flexible models for the hazard functionincluding piecewise constant and basis expansions

Last time: burning questions

I How to choose a distribution?

I How to estimate the parameters indexing a chosendistribution?

I How can we accommodate difference types of censoring?

I How can I use R to do the foregoing estimation steps?

Warm-up

I Explain to your stat buddy

1. Hazard function

2. How a bathtub hazard might arise

3. How an increasing hazard might arise

4. How a decreasing hazard might arise

I True or false:I (T/F) Minimum of independent Weibull r.v.’s are Weibull

I (T/F) Measles increases fecundity in goats

I (T/F) An exponential distribution would be good model formortality in humans

I Who is generally credited with discovering maximumlikelihood estimation?

Warm-up cont’d

I Some concepts and notation for todayI Let a1, . . . , an be a sequence of constants then

n∏i=1

ai = a1 × a2 × · · · × an

I Let f (x) denote a function from Rp into R then

x∗ = arg maxx

f (x)

satisfies f (x∗) ≥ f (x) for all x ∈ Rp.

I We use 1Statement to denote the function that equals one ifStatement is true and zero otherwise. Thus, 1t≤1 equals oneif t ≤ 1 and zero otherwise.

Observation schemes

I Why is survival analysis its own sub-field of statistics?I Abundance of important applications

I Fundamental contributions to statistical theory (esp. insemi-parametrics)

I Dealing partial information due to censoring

I Recall that when we only observe partial information about afailure time we say that it’s censored

I T ≥ C (Right censored)

I T ≤ L (Left censored)

I V ≤ T < U (Interval censored)

Likelihood

I If generative model is indexed by parameter θ then thelikelihood is

L(θ) ∝ P(Data; θ),

which is viewed as a function of θ with the data being fixed

I The maximum likelihood estimator is

θ̂n = arg maxθ

L(θ)

I Warm-up: Let X1, . . . ,Xn ∼ N(µ, σ2), then θ = (µ, σ2),derive L(θ), and θ̂n. Check your answer with your stat buddy.

Likelihood cont’d

I Ex. Let T1, . . . ,Tn be an iid draw from distn with densityf (t; θ) indexed by θ then

L(θ) ∝ P(Data; θ) =n∏

i=1

P(Ti = ti ; θ) =n∏

i=1

f (ti ; θ),

I Let T denote a generic observation distd according to f (t; θ).How can we use L(θ) to estimate:

I The mean of T ?

I The CDF of T ?

I The hazard of T ?

Nonparametric maximum likelihood

I Ex. Let T1, . . . ,Tn be an iid draw from distn with densityf (t). Suppose now, however, we don’t put any restrictions off (t) (other than it being a density). In this case f is our‘parameter’

L(f ) ∝ P(Data) =n∏

i=1

f (ti ),

how can we maximize this over densities f ?I Claim: Our estimated f , say f̂ should only put positive mass

on t1, . . . , tn. (Why?)

I If f̂ puts mass on t1, . . . , tn then maximizing the likelihood isequivalent to solving

maxα1,...,αn≥0

n∏i=1

αi subj. ton∑

i=1

αi = 1

I Some painful calculus shows f̂ is pmf with f (ti ) = 1/n,i = 1, . . . , n

Nonparametric maximum likelihood cont’d

I Thus f̂ (t) is a discrete distribution with f (ti ) = 1/n. Theestimated is given by

F̂ (t) =1

n

n∑i=1

1t≤ti ,

this is called the empirical distribution function (ECDF)

Computing ECDF in R

n = 50;

x = rnorm (n);

FHat = ecdf (x);

plot (FHat, xlab=’x’);

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

x

Fn(

x)

●●

●●●●

●●●●●

●●●●●●●●●

●●●●●●●

●●

●●●●

●●

●●●●●●

●●●●●●

●●

Observation schemes

Truncated estimation

I Suppose t1, . . . , tn compose a random sample from subjectswith lifetimes less than or equal to one year. The LH takesthe form:

n∏i=1

f (ti |Ti ≤ 1) =n∏

i=1

{f (ti )

F (1)

},

Why?

I Suppose we posited a parametic model f (t; θ) for T , how canwe estimate θ using left-truncated data?

I Deceptively difficult stat question: Suppose T1, . . . ,Tn aredrawn ind. from an exp(θ) distn but are truncated at one.When does the maximum likelihood estimator exist?

Right-censoring

I Recall that a censoring time is right censored at C if we onlyobserve that T > C

I Our goal in the next few slides is to derive the LH underdifferent right-censoring mechanisms

I Notation: observe {(Ti , δi )}ni=1 where Ti is the observationtime and δi is the censoring indicator

δi =

{1 Failure time observed0 Right censored

I Big result of the day: Under a variety of right-censoringmechanisms:

LH ∝n∏

i=1

f (ti )δi S(ti+)1−δi

Type I censoring

I In type I censoring each individual has a fixed (non-random)censoring time C > 0

I If T ≤ C then failure time observed

I If T > C then right-censored

I Ex. Odense Malignant Melanoma Data: n = 205 subjectsenrolled between 1962 and 1972 at Odense Dept. of PlasticSurgery had tumors and surrounding tissue removed. Patientswere followed until the death or the study concluded in 1977.Note* This data is contained in the boot package in R.

Type I censoring cont’d

I Using the book’s notation: define ti = min(Ti ,Ci ) andδi = 1Ti≤Ci

then:I If δi = 1, ti is the failure time so the information (ti , δi )

‘contributes’ to the LH is f (ti )

I If δi = 1 then ti is the censoring time so the information (ti , δi )‘contributes’ to the LH is S(ti+)

I Thus, the LH isn∏

i=1

f (ti )δi S(ti+)1−δi

Type I censoring cont’d

I In class: Suppose T1, . . . ,Tn are iid exp(θ) but subject toType I censoring, let δ1, . . . , δn denote the censoringindicators. Derive the MLE for θ.

Independent random censoring

I Assume lifetime T and censoring time C are randomvariables.

I Often more realistic

I Ex. Random study enrollment times

I Ex. Subjects moving out of town

I . . .

I Let G (t) and g(t) denote the survivor and density functionfor C resp., define ti = min(Ti ,Ci ), δi = 1Ti≤Ci

, then

f (ti , δi ) = [f (ti )G (ti+)]δi [S(ti+)g(ti )]1−δi ,

Why?

Independent random censoring cont’d

I The LH for n iid observations is(n∏

i=1

f (ti )δi S(ti+)1−δi

)(n∏

i=1

g(ti )1−δG (ti+)δi

),

note that if g(t) and G (t) does contain information aboutf (t) or S(t) then the LH is proportional to

n∏i=1

f (ti )δi S(ti+)1−δi

It’s the same LH as before!!!

Type II censoring

I Observe individuals until the rth failure is observed, so that weobserve the r smallest lifetimes t(1) ≤ · · · ≤ t(r).

I All n units start at the same time

I Follow-up stops at the time of the rth failure

I Follow-up time is random

I Using properties of order statistics, the LH is

n!

(n − r)!

(r∏

i=1

f (t(i))

)S(t(r)+)n−r ∝

n∏i=1

f (ti )δi S(ti+)1−δi

Code break

Go over mle.R in R Studio

For those who are adventurous

Counting process notation

I Goal: Show the form the LH for right-censoring applies in verygeneral settings

I For clarity we assume discrete time t = 0, 1, . . .

I Let hi (t) and Si (t) denote the hazard and survivor functionfor ith subject resp. ; further define

Yi (t) , 1Ti≥t, ith subj not censored

=

{1 ith subj. hasn′t failed or been censored at t0 Otherwise

,

if Yi (t) = 1 then we say the ith subj. is at risk at time t

Counting process notation cont’d

I Define

dNi (t) , Yi (t)1Ti=t

=

{1 if at risk and fails at t0 Otherwise

dCi (t) , Yi (t)1ith subj. censored at t

=

{1 if at risk and censored at t0 Otherwise

I Claim: {dNi (t), dCi (t), t ≥ 0} has a single 1 and the restzeros

Counting process notation cont’d

I Even more definitions:

dN(t) , (dN1(t), . . . , dNn(t))

dC(t) , (dC1(t), . . . , dCn(t))

H(t) , {(dN(s), dC(s), s = 0, 1, . . . , t − 1}

we say H(t) is the history of the survival process up to time t

Counting process notation: the likelihood

I Note that limt→∞H(t) contains all the information in thecollected data (why?), thus

P (Data) = P (dN(0)) P (dC(0)|dN(0))

×P (dN(1)|H(1)) P (dC(1)|dN(1),H(1))× · · ·

=∞∏t=0

P(dN(t)|H(t))P(dC(t)|dN(t),H(t))

Counting process notation: the likelihood cont’d

I To make the horrible expression tractable we’ll assumeconditional independence across subjects given H(t) and

P (dNi (t) = 1|H(t)) = Yi (t)hi (t),

explain this expression to your stat buddy

I We will also assume that terms inside P(dC(t)|dN(t),H(t))are not informative for the parameters in hi (t)

I When the above assumptions hold we say the censoring isnon-informative

Counting process notation: the likelihood cont’d

I Under the foregoing assumption the LH is given by

n∏i=1

∞∏t=0

hi (t)dNi (t)(1− hi (t))Yi (t)(1−dNi (t))

I To see this, we’ll consider two cases:I Case 1: ith subject’s failure time is observed at ti , then they’re

at risk at t = 0, 1, . . . , ti , and dNi (t) = 1t=ti , thus

∞∏t=0

hi (t)dNi (t)(1−hi (t))Yi (t)(1−dNi (t)) = hi (ti )

ti−1∏s=0

(1−hi (s)) = fi (ti )

I Case 2: ith subject is censored at time ti , then they’re at riskat times t = 0, 1, . . . , ti , and dCi (t) = 1t=si , thus

∞∏t=0

hi (t)dNi (t)(1−hi (t))Yi (t)(1−dNi (t)) =

ti∏t=0

(1−hi (t)) = Si (ti+1)

Counting process notation: the likelihood cont’d

I Putting it all together shows the LH is proportional to

n∏i=1

fi (ti )δi Si (ti + 1)1−δi

I Limiting arguments show the LH is the same (with Si (ti + 1)replaced by Si (ti+)) in the continuous case

I Note* we’ll see later that using the framework of partiallikelihood that maximizing the above LH is appropriate ineven more general settings

LH-based inference

I For parametric models the LH provides an efficient frameworkfor estimation and inference

I Let θ ∈ Θ ⊆ Rp index the survival distribution of interest,define

I L(θ) the LH

I `(θ) = log L(θ) the log-LH

I u(θ) = ddθ `(θ) the score function

I I (θ) = − d2

dθdθᵀ l(θ) the Fisher information

LH-based inference cont’d

I Recall maximum LH estimator

θ̂n = arg maxθ∈ΘL(θ),

solves u(θ) = 0

I Under mild regularity conditions

√n(θ̂n − θ∗) N(0, I−1(θ∗))

LH-based inference example

I Assume T1, . . . ,Tn are iid exp(λ) and subject tononinformative censoring. Let δ1, . . . , δn denote the censoringindicator. Find the MLE for λ, the Fisher information matrix,and a 95% confidence interval for λ.

I The LH, L(λ) is

n∏i=1

f (ti ;λ)δi S(ti ;λ)1−δi =n∏

i=1

λδi exp {−λtiδi} exp {−λti (1− δi )}

= λ∑n

i=1 δi exp

{−λ

n∑i=1

ti

}

I The log-LH, `(λ), is(n∑

i=1

δi

)log λ− λ

n∑i=1

ti

LH-based inference example cont’d

I The score function u(λ) is given by

u(λ) =

∑ni=1 δiλ

−n∑

i=1

ti

setting this to zero and solving yields

λ̂n =

∑ni=1 δi∑ni=1 ti

LH-based inference example cont’d

I We take the negative derivative of u(λ) to getI (λ) =

∑ni=1 δi/λ

2

I How do we get a 95% confidence interval for λ?

top related