Introduction Proposed method Simulation studies Real example Conclusion References K-Adaptive Partitioning for Survival Data with an Application to SEER Soo-Heang Eo with Sungwan Bang, Seung-Mo Hong, HyungJun Cho Department of Statistics Korea University April 25, 2014
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Introduction Proposed method Simulation studies Real example Conclusion References
K-Adaptive Partitioning for Survival Data
with an Application to SEER
Soo-Heang Eo
with Sungwan Bang, Seung-Mo Hong, HyungJun Cho
Department of Statistics
Korea University
April 25, 2014
Introduction Proposed method Simulation studies Real example Conclusion References
Table of Contents
Introduction
Motivation example
Previous studies
Proposed method
Finding the best split set
Finding optimal number of subgroups
Simulation studies
Simulation design
Simulation results
Real example
Conclusion
References
Introduction Proposed method Simulation studies Real example Conclusion References
Cancer staging
cancer staging system that describes the extent of cancer in a
patient’s body. [2]
Introduction Proposed method Simulation studies Real example Conclusion References
TNM staging system
The TNM Classification of Malignant Tumors (TNM) is a cancer
staging system that describes the extent of cancer in a patient’s
body. [2]
T describes the size of the tumor and whether it has invaded
nearby tissue,
N describes distant metastasis lymph node (spread of
cancer from one body part to another).
M describes regional lymph nodes that are involved,
Introduction Proposed method Simulation studies Real example Conclusion References
Previous studies
• Hilsenbeck and Clark (1996) concerned simulation studies to
examine the effects of number of cutpoints and true marker
prognostic effect size [3].
• Contal and O’Quigley (1999) proposed the asymptotic distribution
of a re-scaled rank statistic [1].
• Lausen et al. (1994) and Hong et al. (2007) employed decision
tree methodology [8, 4].
• Hothorn and Lausen (2003) used maximally selected rank
statistics [6] to find the best cutpoints and extended to CTree [5].
• Ishwaran et al. (2009) adopted the idea of random survival forests
[7].
Most approaches to find cut-off point have been based on concepts from
binary split.
Introduction Proposed method Simulation studies Real example Conclusion References
Problem is ...
|
Node 8 Node 9
Node 20 Node 21
Node 11
Node 6 Node 7
n= 36086 n= 7331
n= 4693 n= 3288
n= 4401
n= 6183 n= 3204
med= 115 med= 80
med= 57 med= 47
med= 38
med= 23 med= 13
meta
meta
meta meta
meta
meta
<=5
<=1
<=0 <=3
<=2
<=11
Node 1
Node 2
Node 4 Node 5
Node 10
Node 3
n= 65186
n= 55799
n= 43417 n= 12382
n= 7981
n= 9387
med= 77
med= 95
med= 110 med= 47
med= 53
med= 19
Survival months
Sur
viva
l pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
0 24 48 72 96 120 144 168 192
Node 8 (# of Meta = 0)Node 9 (# of Meta = 1)Node 20 (# of Meta = 2)Node 21 (# of Meta = 3)Node 11 (# of Meta = 4,5)Node 6 (# of Meta = 6,7,...,11)Node 7 (# of Meta = 12,...,80)
Introduction Proposed method Simulation studies Real example Conclusion References
Problem is ...
(a) TSK = 148
p<0.0001
(b) TSK = 132
p<0.0001
Introduction Proposed method Simulation studies Real example Conclusion References
Our aim is to
• Divide the whole data D into K heterogeneous subgroups
D1, . . . ,DK based on the information of X .
• Overcome the limitation that some subgroups differ
substantially in survival, but others may differ barely or
insignificantly.
• Evaluate multi-way split points simultaneously and find an
optimal set of cutpoints.
• Implement the proposed algorithm into an R package kaps.
Introduction Proposed method Simulation studies Real example Conclusion References
Finding the best split set
Let Ti be a survival time, Ci a censoring status, and Xi be an
ordered covariate for the i th observation. We observe the triples
(Yi , δi ,Xi ) and define
Yi = min(Ti ,Ci ) and δi = I (Ti ≤ Ci ),
which represent the observed response variable and the censoring
indicator, respectively.
Introduction Proposed method Simulation studies Real example Conclusion References
Finding the best split set
Let χ21 be the χ2 statistic with one degree of freedom (df) for
comparing the g th and hth of K subgroups created by a split set
sK when K is given. For a split set sK of D into D1,D2, . . . ,DK ,
the test statistic for a measure of deviance can be defined as
T1(sK ) = min1≤g<h≤K
χ21 for sK ∈ SK , (1)
where SK is a collection of split sets sK generating K disjoint
subgroups.
Introduction Proposed method Simulation studies Real example Conclusion References
Finding the best split set
Then, take s∗K
as the best split set such that
T ∗1(s∗K ) = maxsK∈SK
T1(sK ). (2)
The best split s∗K
is a set of (K − 1) cutpoints which clearly
separate the data D into K disjoint subsets of the data:
D1,D2, . . . ,DK .
Introduction Proposed method Simulation studies Real example Conclusion References
Finding the best split set - Illustration
• We assumed that the patients were divided into 3 heterogeneous
groups by the number of lymph nodes (X = {0, 1, . . . , 6}).
• Each group was categorised as D1,D2 and D3.
• 3 pairs of groups (D1vs .D2,D2vs .D3, and D1vs .D3).
• We imagined existing 3 cut-off points candidates, each of which
were composed of two cutpoints; {0,1}, {0,2}, and {1,2}.
• For example, the cutpoints candidate {0, 1} mean that
D1 = {X = 0},D2 = {0 < X ≤ 1}, and D3 = {X > 1}.
• Out of the 3 pairs of the candidates, the smallest test statistic was
selected as a representative statiatic for the candidate.
Introduction Proposed method Simulation studies Real example Conclusion References
Finding the best split set - Illustration
Introduction Proposed method Simulation studies Real example Conclusion References
Finding the best split set
Introduction Proposed method Simulation studies Real example Conclusion References
Algorithm 1. Finding the best split set for given K
Step 1: Compute chi-squared test statistics χ21 for all possible pairs, g and h, of K
subgroups by sK , where 1 ≤ g < h ≤ K and sK is a split set of (K − 1)
cutpoints generating K disjoint subgroups.
Step 2: Obtain the minimum pairwise statistic T1(sK ) by minimizing χ21 for all
possible pairs, i .e., T1(sK ) = min1≤g<h≤K χ21 for sK ∈ SK ,where SK is a
collection of split sets sK generating K disjoint subsets of the data.
Step 3: Repeat Steps 1 and 2 for all possible split sets SK .
Step 4: Take the best split set s∗K
such that T ∗1(s∗K
) = maxsK ∈SKT1(sK ). When two
or more split sets have the maximum T ∗1 of the minimum pairwise statistics,
choose the best split set with the largest overall statistic T ∗K−1
.
Introduction Proposed method Simulation studies Real example Conclusion References
Finding optimal subgroups
• One of the important issues is to determine a reasonable
number of subgroups, i.e. the selection of an optimal K .
• We find an optimal multi-way split at a time for the given
number of subgroups.
• We need to choose only one of a possible number of
subgroups.
• For a data-driven objective choice, we here suggest a
statistical procedure to choose an optimal number of
subgroups.
Introduction Proposed method Simulation studies Real example Conclusion References
Finding optimal subgroups
The minimum pairwise statistics of the permuted data
pK =
R∑r=1
I (T (r )1 (s∗K ) ≥ T ∗1(s∗K ))/R, K = 2, 3, . . . ,
where T(r )1 (s∗
K) is the r th repeated minimum pairwise statistic for
the permuted data.
• The data can be reconstructed by matching their labels after
permuting the labels of X with retaining the labels of (Y , δ).
• When the permuted data are allocated into each subgroup by
s∗K
, there should be no significant differences in survival
among the subgroups.
Introduction Proposed method Simulation studies Real example Conclusion References
Finding optimal subgroups
We choose the largest number to discover as many significantly
different subgroups as possible, given that the corrected p-values
are smaller than or equal to a pre-determined significance level
K = max{K |pcK ≤ α,K = 2, 3, . . .},
where pCK
is the corrected p-values for multiple comparison because
there are (K − 1) comparisons between two adjacent subgroups
Introduction Proposed method Simulation studies Real example Conclusion References
Finding optimal subgroups - Illustration
Survival months
Sur
viva
l pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
0 24 48 72 96 120 144 168 192
G1G2
(a) K = 2TSK = 26.37
TS1 = 26.37
pK <0.0001
p1 = <0.0001
Survival months
Sur
viva
l pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
0 24 48 72 96 120 144 168 192
G1G2G3
(b) K = 3TSK = 36.80
TS1 = 7.2
pK <0.0001
p1 = 0.0073
Survival months
Sur
viva
l pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
0 24 48 72 96 120 144 168 192
G1G2G3G4
(c) K = 4TSK = 38.00
TS1 = 1.895
pK <0.0001
p1 = 0.1686
Introduction Proposed method Simulation studies Real example Conclusion References
Finding optimal subgroups - Illustration
meta
Sur
viva
l mon
ths
0 10 20 30
1248
7296
120
144
168
192
Event
Censored
Survival months
Sur
viva
l pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
0 24 48 72 96 120 144 168 192
G1
G2
G3
K
p-va
lues
of X
k
2 3 4
0.0
0.1
0.2
0.3
0.4
0.5
K
p-va
lues
of X
1
2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
Introduction Proposed method Simulation studies Real example Conclusion References
Algorithm 2. Selecting the optimal number of subgroups
Step 1: Find s∗K
and T1(s∗K
) with the raw data for each K using Algorithm 1.
Step 2: Construct the permuted data by permuting the labels of X whilst retaining
the labels of (Y , δ).
Step 3: Allocate the permuted data into each subgroup by s∗K
.
Step 4: Obtain the minimum pairwise statistic T(r )1 (s∗
K) for the permuted data.
Step 5: Repeat steps 2 to 4 R times, and then obtain
T(1)1 (s∗
K),T (2)
1 (s∗K
), . . . ,T (R)1 (s∗
K).
Step 6: Compute the permutation p-value pK for each K , i .e.,
pK =∑R
r=1 I (T (r )1 (s∗
K) ≥ T1(s∗
K))/R, K = 2, 3, . . . .
Step 7: Correct the permutation p-value pK by correcting for multiple comparisons,