Version 2006-Dec. 23 Implementation of Threshold Regression: Programs for SAS, R-code and STATA This version of the threshold regression program is implemented by Qing Hu, Department of Mathematical sciences – Applied Statistics, Worcester Polytechnic Institute, Worcester, MA Introduction and Acknowledgements Threshold regression refers to regression structures in first hitting time (FHT) models. In this technical report, the next section gives a brief overview of the theoretical foundations of threshold regression. The subsequent sections then describe simple programs that may be used to implement this type of regression analysis in SAS, R-code and Stata. This document was drawn heavily on the published work of Mei-Ling Ting Lee and G. A. Whitmore on the topic of threshold regression. In particular, the Stata program presented here was modeled on a version provided by Lee and Whitmore and used by them in earlier research. For an overview of threshold regression, the reader is referred to Lee M-LT, Whitmore GA (2007). Threshold regression for survival analysis: Modeling event times by a stochastic process, Statistical Science . (in press). The Basics of Threshold Regression A FHT model has two basic components: (1) a parent stochastic process with initial value { Χ ∈ Τ ∈ x t t X , ), ( } 0 ) 0 ( x X = , where T is the time space and is the state space of the process, and (2) a boundary set Χ Β , where X ⊂ Β . The initial value x 0 is assumed to lie outside of set B. The word “threshold” refers to the fact that the FHT is triggered by the parent stochastic process reaching a threshold state within the boundary set for the first time. In other words, the first hitting time of Β is the random variable S defined as follows: } ) ( : inf{ B t X t S ∈ = 1
34
Embed
Implementation of Threshold Regression: Programs for SAS ... › ... › files › TR-software-20061224.pdf · Implementation of Threshold Regression: Programs for SAS, R-code and
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Version 2006-Dec. 23
Implementation of Threshold Regression:
Programs for SAS, R-code and STATA
This version of the threshold regression program is implemented by Qing Hu, Department of Mathematical sciences – Applied Statistics, Worcester Polytechnic Institute, Worcester, MA Introduction and Acknowledgements Threshold regression refers to regression structures in first hitting time (FHT) models. In this technical report, the next section gives a brief overview of the theoretical foundations of threshold regression. The subsequent sections then describe simple programs that may be used to implement this type of regression analysis in SAS, R-code and Stata. This document was drawn heavily on the published work of Mei-Ling Ting Lee and G. A. Whitmore on the topic of threshold regression. In particular, the Stata program presented here was modeled on a version provided by Lee and Whitmore and used by them in earlier research. For an overview of threshold regression, the reader is referred to
Lee M-LT, Whitmore GA (2007). Threshold regression for survival analysis: Modeling event times by a stochastic process, Statistical Science. (in press).
The Basics of Threshold Regression A FHT model has two basic components: (1) a parent stochastic process
with initial value{ Χ∈Τ∈ xttX , ),( } 0)0( xX = , where T is the time space and is the
state space of the process, and (2) a boundary set
Χ
Β , where X⊂Β . The initial value x0 is assumed to lie outside of set B. The word “threshold” refers to the fact that the FHT is triggered by the parent stochastic process reaching a threshold state within the boundary set for the first time. In other words, the first hitting time of Β is the random variable S defined as follows: })( :inf{ BtXtS ∈=
1
The parent stochastic processes may take many forms. In this technical report, we assume
the parent stochastic process is a Wiener process { }0 ),( ≥ttX with µ as its mean
parameter, as its variance parameter, and its initial value2σ 0)0( 0 >= xX . The
boundary set is taken as the zero level of the process. Then the first hitting time S of the
boundary B has an inverse Gaussian distribution if the process mean parameter µ is
zero or if it is negative so the process tends to drift toward the zero level (the boundary). The inverse Gaussian distribution depends on the mean and variance parameter of the
underlying Wiener process (µ and ) and the initial value . Let 2σ 0x ) , ,( 02 xtf σµ
and ) , ,( 02 xtF σµ denote the probability density function (p.d.f.) and cumulative
distribution (c.d.f.) of the FHT distribution. These two functions can be written as
,2
)(exp
2) , ,( 2
20
32
00
2⎥⎦
⎤⎢⎣
⎡ +−=
rrx
r
xxtf
σµ
πσσµ for 0 ,0 , 0
2 >>∞<<∞− xσµ
and
,)(
)2
exp()(
) , ,(2
02
0
2
00
2⎥⎦
⎤⎢⎣
⎡ −−Φ−+⎥
⎦
⎤⎢⎣
⎡ +−Φ=
r
rxx
r
rxxtF
σ
µσµ
σ
µσµ
respectively, where is the c.d.f. of the standard normal distribution. If )(⋅Φ µ >0, the
FHT is not certain to occur and the p.d.f is improper. In this case,
)2
exp(1)( 20
σµx
XP −−=∞= .
When the parent process is latent (unobserved), one parameter may be fixed. For
instance, the variance parameter may be set to unity, which we choose to do in this
technical report .
2σ
Now we introduce a regression structure by expressing parameters µ and as
regression functions of covariates. Assume that there are k covariates that
0x
, , ,1 kzz
2
might be related to µ and . Let an identity function link parameter 0x µ to the
covariates as follows:
kk zz βββµ +++== 110zβ .
Similarly, let a logarithmic function link parameter to the covariates as follows: 0x
kk zzx γγγ +++== 1100 )ln( zγ .
Here ,) , , ,1( 1 kzz=z ) ,, ,( 10 ′= kββββ , and ) , , ,( 10 ′= kγγγγ . The unit element in
the covariate vector allows for a regression intercept term. Other link functions can be chosen. The general criterion is that the link function should map the parameter into the whole real line.
Suppose survival data of the form ( iit z , , iδ ), ni , , ,1= , are collected. Here iδ is an
indicator variable ( iδ =0 if the ith item is censored; iδ =1 if the ith item failed), and is
the corresponding failure or censoring time. As before, is a vector of covariate
values, in this case the covariate vector for item i. Likewise, we let µ
it
iz
i and x0i denote the values of these parameters for item i. When we apply threshold regression in survival data analysis, the state of the underlying process represents the strength of an item, and the item fails when the process reaches an adverse threshold state for the first time (assumed to be zero in this report). Thus the sample log-likelihood function can be written as
]) ,(ln)1() ,(ln[) ,(ln 001
iiiiiii
n
ii xtFxtfL µδµδ −+= ∑
=
γβ
Gradient algorithms, such as the Newton-Raphson algorithm, are efficient numerical methods for maximizing the log-likelihood function to find the estimates for regression
parameter vectors and . β γ
3
Implementing Threshold Regression in SAS Here is a sample code for implementing threshold regression with SAS: **********************************************************************;
* MANUAL DATA INPUT *;
* We can input data manually or by pasting the data from another data source
(such as a text file). Consider the following hypothetical case illustration.
A study has 49 patients diagnosed with myeloma. They are administered a drug
at one of three dose levels, 0, 1 and 2, with the dose level being randomly
assigned. Zero indicates placebo. The time from the point of randomization
to either death or censoring has been tracked. The first line of the following
input code gives the data set name, 'myeloma'. The second line is an input
statement. The variables are listed after ‘input’ in the order they appear
in each line of the data record. The data follows the command 'datalines'
and the data set ends with a semicolon like other statements. Variable ‘id’
refers to the patient’s identification number. The 'time' variable gives
the survival or censoring time in years. Variable ‘age’ is the patient’s
age in years at enrolment into the study. Variable 'gender'is an indicator
variable that is coded 1 for male patients and 0 for female patients. Variable
'treat' indicates the assigned treatment dose. Variable 'fail' has a value
of 1 for patients who died and 0 for those who were censored.
ng 'nlm' (nonlinear minimization) to obtain the estimates 'par'
> # reading data into R from an external tex> > myelomatosis<-read.table("C:/my file/myelo> > > # a> > time<-myelomatosis[,2] > ag> gender<-myelomatosis[,4] > treat<-myelomatosis[,5] > fail<-myelomatosis[,6] > > > # defining lnx0 and mu > # the first four components of vector 'par'(parameter) correspond to the 4> # coefficients for lnx0, respectively, and the last four correspond to those for mu > > lnx0<-function(par) {par[1]* age+par[2]*gender+par[3> mu> > > # transformation into functions 'd' and 'v' > > d<-function(par) {-mu(par)/exp(lnx0(par))} > v<-function(par) {exp(-2*lnx0(par))} > > > # defi> # carries > > log+ (d(par)*time-1)^2/(v(par)*time))))- su+ /sq> > > # usi
26
> # the second argument 'c(0, 0, 0, 4, 0, 0, 0, 1)' is the initial vector of par ul that for our myelomatosis example, 'nlm' (Newton-type algorithm) is very
the initial values. # 'iterlim'specifies the maximum number of iterations
: NA/Inf replaced by maximum positive value eplaced by maximum positive value
: NA/Inf replaced by maximum positive value replaced by maximum positive value
Implementing Threshold Regression in Stata Implementation of threshold regression in Stata requires two files, in addition to the data file. One file is a ‘do’ file that controls the main execution. The ‘do’ file calls in the data
t and, a little later, an ‘ado’ file. The ‘ado’ file contains the computational subroutine. entation using the myeloma case illustration. Refer to the
f the preceding section on SAS implementation for a description of e case illustration and data set.
seWe illustrate the Stata implemopening paragraph oth Do file
*The routine begins by clearing the data set and setting memory requirements.
The Stata version is set to version 7.0.
clear
set mem 200m
set mat 200
set more off
version 7.0
*The data is entered from a text file called melanoma.txt
infile input id time age gender treat fail using “melanoma.txt”
*The failure indicator and time variables are set to variable names used
in the subroutine
global f "fail"
global t "time"
*Statement ‘ml model’ is a Stata maximum likelihood routine. The model fitting
method (lf) and subroutine (treg.ado) are specified. The covariates are then
listed for each parameter as shown. The parameters are ‘lnx0’ and ‘m’. The
covariates are ‘age’, ‘gender’ and ‘treat’. Method 'lf' is a Stata gradient
(hill climbing) method. File treg.ado contains the likelihood subroutine
function. Statement ‘ml init’ allows initial values to be specified. Initial
values must include those for the regression coefficients (0 is chosen in
each case here) and one value for each intercept (1 and -1 are chosen for
lnx0 and m here). Statement ‘ml maximize’ starts the optimization.
#delimit ;
30
ml model lf treg
(lnx0: age gender treat)
(m: age gender treat);
ml init
0 0 0 1
0 0 0 -1, copy;
#delimit cr
ml maximize
*In addition to tabular output of regression results, the following optional
commands output the vector of regression coefficient estimates and their
variance-covariance matrix and correlation matrix.
matrix list e(b)
matrix list e(V)
matrix C=corr(e(V))
matrix list C
Ado file
*The following sequence of commands defines the contribution of each data