APPROXIMATE LIKELIHOOD INFERENCE FOR HAPLOTYPE RISKS IN CASE-CONTROL STUDIES OF A RARE DISEASE by Zhijian Chen B.Sc. in Statistics, Peking University, 2003. a project submitted in partial fulfillment of the requirements for the degree of Master of Science in the Department of Statistics and Actuarial Science c Zhijian Chen 2006 SIMON FRASER UNIVERSITY Fall 2006 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without the permission of the author.
63
Embed
· APPROXIMATE LIKELIHOOD INFERENCE FOR HAPLOTYPE RISKS IN CASE-CONTROL STUDIES OF A RARE DISEASE by Zhijian Chen B.Sc. in Statistics, Peking University, 2003. a project submitted
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
variance 1. For each subject, the binary disease status D was simulated according to the
penetrance model in equation (3.1). Once the population was simulated, a subset of cases
and controls of specified size was randomly selected, and the single-locus genotypes and
environmental covariates of the selected subjects were collected and recorded as data. The
data set was then analyzed using PML, EE and HYBRID, respectively. We also wished to
compare the finite-sample bias in the regression parameter and standard error estimators
of PML, EE and HYBRID to the finite-sample bias of maximum likelihood when there
is no missing haplotype phase; i.e. logistic regression. The finite-sample bias of logistic
regression analysis of the phase-known data provides a baseline against which to judge the
bias of methods that analyze the data with phase ambiguity. Hence we also recorded the
haplogenotypes of all sampled subjects and obtained another data set which was analyzed
by logistic regression using the glm function of the R programming environment. We use the
notation GLM to denote maximum likelihood applied to the complete (i.e. phase-known)
data.
CHAPTER 3. SIMULATION STUDY 27
Let “h0” denote the baseline haplotype in the sample. The model fit to the data is
logit{pr(D = 1 | H, X)} = β0 + βXX +∑
hj 6=h0
βhjNhj
(H) +∑
hj 6=h0
βhjXNhj(H)X, (3.2)
where βhjand βhjX are regression coefficients associated with main effects of haplotype hj
and effects of interaction between hj and X, and where Nhj(H) counts the number of copies
of haplotype hj contained in haplogenotype H. In terms of the model used to simulate data,
given in equation (3.1), we have that βH = βh1 and βHX = βh1X ; to simplify notation we
use βH and βHX throughout. The estimates of βX , βH and βHX from GLM, PML, EE and
HYBRID, as well as their associated standard errors, were recorded.
For each value of βHX and each simulation scenario, statistical properties of the ap-
proaches were estimated from 10000 simulation replicates. In up to about a half of the
simulated data sets, either Hplus estimated h1 to be the most frequent haplotype, or one
or more of the approaches failed to converge while fitting the risk model. Such data sets
were discarded until the desired number of 10000 replicates was obtained.
3.2 Statistical Properties
We computed four commonly used measures to evaluate the performance of GLM, PML,
EE and HYBRID. Let b be the true value of a regression parameter associated with an effect
(e.g. βHX for the interaction effect). Let b be an estimator (e.g. the PML estimator) of b
and br be its realization in the rth simulation replicate. The first measure is the estimated
bias of b:
Bias(b) = ¯b− b =
1R
R∑
r=1
(br − b),
in which the summation is over all R replicates. In our current simulation study, R =
10000. The estimated bias is compared to the corresponding simulation error (described in
Appendix A of Ghadessi (2005)), and the estimator is said to be unbiased if the estimated
bias is within simulation error of zero. Let SE be the standard error associated with b,
and SEr be the standard error of br. Let SD =√
1R
∑Rr=1(br − ¯
b)2 denote the empirical
standard deviation of b. In our simulation study, SD is considered to be the nominal (true)
CHAPTER 3. SIMULATION STUDY 28
value of the standard error of b. The second measure is the estimated bias of SE:
Bias(SE) =1R
R∑
r=1
(SEr − SD),
which quantifies the accuracy of the standard error estimator of b. The standard error
estimator is said to be unbiased if the estimated bias of SE is within the corresponding
simulation error of zero. The third measure is the estimated coverage probability of the
confidence interval, (b−Zα/2SE, b + Zα/2SE) at significance level α, that includes the true
value b:
CP =1R
R∑
r=1
δ{|br − b| < Zα/2SEr},
where δ is the indicator function. In our simulation study, α = 0.05 and Zα/2 is approx-
imately 2. An acceptable estimated coverage probability of the 95% confidence interval
should be within simulation error of the nominal 95%. The fourth measure is the estimated
power of the approach to detect the presence of the effect, quantified as the probability of
rejecting the null hypothesis b = 0:
P =1R
R∑
r=1
δ{|br| > Zα/2SEr}.
3.3 Simulation Results
We first present an overview of conclusions from the simulation study, with particular
attention to results that address methodological issues raised in Chapter 2 and to results
that confirm previous studies. More detailed simulation results related to estimation of
βHX are presented next, which support the general conclusions of the simulation study.
Full simulation results for βH , βX and βHX appear in Tables C.1 - C.4 of Appendix C.
3.3.1 Overview of Simulation Conclusions
Overall, HYBRID performed the best of the three approximate score methods, with approx-
imately correct inference and good power to detect the interaction effect in all simulation
configurations.
CHAPTER 3. SIMULATION STUDY 29
We next discuss the bias and variance of risk parameter estimators, and the bias in the
standard error estimators.
Our simulations, under more moderate interaction effects than those of Spinka et al.,
show that, when haplotype ambiguity is moderate, bias in all estimators of the regression
parameters, including PML, is comparable to the finite-sample bias of logistic regression
with known haplotypes (GLM). However, when haplotype ambiguity is extreme, the PML
and EE estimators are biased relative to HYBRID. The bias of PML is likely due to its in-
correct approximation of the RML weights in these simulations. EE and HYBRID (MPSE),
on the other hand, use the correct weights because the data are simulated under popula-
tion HWP and independence of H and X. The bias of EE is likely due to the estimating
equations for the haplotype frequencies which differ from those for MPSE.
In contrast, under extreme haplotype ambiguity, the EE regression estimators were less
variable than those of HYBRID (MPSE). Recall that, unlike MPSE, the EE estimating
equations for the haplotype frequencies involve the controls only and do not depend on the
regression parameters. If regression parameter estimators are imprecisely determined, the
EE estimator of haplotype frequencies might be less variable than the MPSE estimator,
even though the MPSE estimator uses data from both cases and controls. Such decreased
variability in estimators of haplotype frequency might then translate into decreased variance
for the regression parameter estimators of EE, relative to those of MPSE.
The most striking simulation results regarding standard errors were those for EE. The
conservative standard errors for EE are almost certainly due to an error in the EE variance
calculation noted previously. Figure 3.9 shows the bias in standard errors, after excluding
those of EE, for the remaining methods. The HYBRID standard errors perform the best of
the three approximate score methods, even though the variance calculation from hapassoc
is incorrect.
3.3.2 Results for Simulation Scenarios i) and ii)
Figure 3.1 summarizes the results for the first simulation scenario, in which the first set of
haplotype frequencies was used. Based on the estimated bias of βHX , we make the following
key observations. First, the bias of estimators from PML, EE and HYBRID was upward
CHAPTER 3. SIMULATION STUDY 30
(anti-conservative) in general. Second, the biases increased as βHX increased. Third, EE
and HYBRID performed slightly better than PML, as the biases of the EE and HYBRID
estimators were within simulation error of zero when βHX = 0.1 and 0.3. The boxplots in
Figure 3.3 show that the variability in βHX estimates appeared to be smaller for EE and
HYBRID than for PML.
EE performed poorly in calculating the standard errors of the βHX estimates, as the
standard error of βHX was upward biased (Figure 3.1). The standard errors from PML
and HYBRID showed a slightly downward bias (anti-conservative) in general and exceeded
simulation errors (Figure 3.9). However, the magnitudes of the biases were small compared
to the bias of the EE estimator. Figure 3.3 also shows that standard errors from EE are
more spread-out than those from PML and HYBRID.
The estimated coverage probabilities of the 95% confidence intervals of PML and HY-
BRID were approximately 95% and within simulation errors of the nominal 95%. By con-
trast, the estimated coverage probabilities of EE were larger than 99% in general and
exceeded simulation errors. The inflation of the standard errors from EE resulted in low
power to detect the interaction effects. The estimated power to detect weak interactions was
low for all three approaches but improved for PML and HYBRID as the level of interaction
increased.
Figures 3.2, 3.4 and 3.9 show the simulation results for the second simulation scenario,
which used the same haplotype frequencies configuration as in the first simulation scenario
but with smaller sample size. Similar patterns of biases in estimation of βHX and standard
errors from PML, EE and HYBRID were observed. The magnitude of bias was bigger and
the variability in the estimates was greater than in the first simulation scenario.
3.3.3 Results for Simulation Scenarios iii) and iv)
The second set of haplotype frequencies was used in the third and fourth simulation sce-
narios. The simulation results for βHX are summarized in Figures 3.5 and 3.6, in the same
formats as those for the first two simulation scenarios.
Both EE and HYBRID showed downward bias in estimating weak interaction effects
and upward bias in estimating moderate and strong interaction effects, while PML showed
CHAPTER 3. SIMULATION STUDY 31
upward bias in estimating weak and moderate interactions and downward bias in estimating
strong interactions. The estimated bias was within simulation error for PML when βHX =
0.1 and 0.5, for EE when βHX = 0.1 and 0.3, and for HYBRID when βHX = 0.1, 0.3 and 0.5.
The variability in the estimates appeared to be smaller for EE than for PML and HYBRID,
as shown in Figures 3.7 and 3.8.
The results for the standard error showed that the estimated bias of standard errors was
downward for PML and upward for EE and HYBRID in general, with all estimated biases
exceeding simulation errors. Figure 3.7 shows similar inflation and spread of standard errors
from EE as observed in the first two simulation scenarios, and standard errors from PML
and HYBRID that are more concentrated than EE.
The 95% confidence interval from EE gave estimated coverage probabilities of around
99%, due to the highly-conservative standard errors. The coverage probabilities were slightly
below 95% for PML and were slightly above 95% for HYBRID. As expected, the estimated
power to detect interactions was much lower for EE than for PML and HYBRID. The power
for all approaches improved as βHX or sample size increased. Recall that the ability of the
single-locus genotypes to predict the number of copies of risk haplotype h1 is lower and
phase ambiguity is higher in the second set of haplotype frequencies than in the first set.
Thus, it is not surprising that the power for PML and HYBRID were much lower than for
the GLM using phase-known data, even for high levels of interaction and large sample sizes.
CHAPTER 3. SIMULATION STUDY 32
Fig
ure
3.1:
Res
ults
for
βH
Xfr
omsi
mul
atio
nsc
enar
ioi)
0.1
0.3
0.5
0.7
0.00
0
0.00
5
0.01
0
0.01
5G
LMP
ML
EE
HY
BR
ID
unbi
ased
beta
_{H
X}
Bias of beta_{HX}
0.1
0.3
0.5
0.7
−0.0
5
0.00
0.05
0.10
0.15
beta
_{H
X}
Bias of Standard Error
0.1
0.3
0.5
0.7
0.94
0.95
0.96
0.97
0.98
0.99
1.00
beta
_{H
X}
Coverage Probability
0.1
0.3
0.5
0.7
0.0
0.2
0.4
0.6
0.8
1.0
beta
_{H
X}
Power
CHAPTER 3. SIMULATION STUDY 33
Fig
ure
3.2:
Res
ults
for
βH
Xfr
omsi
mul
atio
nsc
enar
ioii)
0.1
0.3
0.5
0.7
0.00
0
0.00
8
0.01
6
0.02
4
0.03
2G
LMP
ML
EE
HY
BR
ID
unbi
ased
beta
_{H
X}
Bias of beta_{HX}
0.1
0.3
0.5
0.7
−0.0
5
0.00
0.05
0.10
0.15
0.20
beta
_{H
X}
Bias of Standard Error
0.1
0.3
0.5
0.7
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
beta
_{H
X}
Coverage Probability
0.1
0.3
0.5
0.7
0.0
0.2
0.4
0.6
0.8
1.0
beta
_{H
X}
Power
CHAPTER 3. SIMULATION STUDY 34
Figure 3.3: Boxplots of bias in estimation of βHX (upper plot) and bias in estimationof associated standard error (lower plot) from simulation scenario i)
0.1 0.3 0.5 0.7
−0.
6−
0.2
0.0
0.2
0.4
0.6
Boxplots of Bias in Estimation of Beta_{HX}
Bia
s
Beta_{HX}
0.1 0.3 0.5 0.7
−0.
6−
0.2
0.0
0.2
0.4
0.6
0.1 0.3 0.5 0.7
−0.
6−
0.2
0.0
0.2
0.4
0.6
0.1 0.3 0.5 0.7
−0.
6−
0.2
0.0
0.2
0.4
0.6 GLM
PMLEEHYBRID
0.1 0.3 0.5 0.7
−0.
04−
0.02
0.00
0.02
0.04
Boxplots of Bias in Estimation of SE
Bia
s
Beta_{HX}
0.1 0.3 0.5 0.7
−0.
04−
0.02
0.00
0.02
0.04
0.1 0.3 0.5 0.7
−0.
04−
0.02
0.00
0.02
0.04
0.1 0.3 0.5 0.7
−0.
04−
0.02
0.00
0.02
0.04
GLMPMLEEHYBRID
CHAPTER 3. SIMULATION STUDY 35
Figure 3.4: Boxplots of bias in estimation of βHX (upper plot) and bias in estimationof associated standard error (lower plot) from simulation scenario ii)
0.1 0.3 0.5 0.7
−1.
0−
0.5
0.0
0.5
1.0
Boxplots of Bias in Estimation of Beta_{HX}
Bia
s
Beta_{HX}
0.1 0.3 0.5 0.7
−1.
0−
0.5
0.0
0.5
1.0
0.1 0.3 0.5 0.7
−1.
0−
0.5
0.0
0.5
1.0
0.1 0.3 0.5 0.7
−1.
0−
0.5
0.0
0.5
1.0
GLMPMLEEHYBRID
0.1 0.3 0.5 0.7
−0.
15−
0.05
0.05
0.15
Boxplots of Bias in Estimation of SE
Bia
s
Beta_{HX}
0.1 0.3 0.5 0.7
−0.
15−
0.05
0.05
0.15
0.1 0.3 0.5 0.7
−0.
15−
0.05
0.05
0.15
0.1 0.3 0.5 0.7
−0.
15−
0.05
0.05
0.15
GLMPMLEEHYBRID
CHAPTER 3. SIMULATION STUDY 36
Fig
ure
3.5:
Res
ults
for
βH
Xfr
omsi
mul
atio
nsc
enar
ioiii
)
0.1
0.3
0.5
0.7
−0.0
1
0.00
0.01
0.02
0.03
GLM
PM
LE
EH
YB
RID
unbi
ased
beta
_{H
X}
Bias of beta_{HX}
0.1
0.3
0.5
0.7
−0.0
5
0.00
0.05
0.10
0.15
0.20
beta
_{H
X}
Bias of Standard Error
0.1
0.3
0.5
0.7
0.94
0.95
0.96
0.97
0.98
0.99
1.00
beta
_{H
X}
Coverage Probability
0.1
0.3
0.5
0.7
0.0
0.2
0.4
0.6
0.8
1.0
beta
_{H
X}
Power
CHAPTER 3. SIMULATION STUDY 37
Fig
ure
3.6:
Res
ults
for
βH
Xfr
omsi
mul
atio
nsc
enar
ioiv
)
0.1
0.3
0.5
0.7
0.00
0.01
0.02
0.03
0.04
GLM
PM
LE
EH
YB
RID
unbi
ased
beta
_{H
X}
Bias of beta_{HX}
0.1
0.3
0.5
0.7
−0.0
5
0.00
0.05
0.10
0.15
0.20
0.25
beta
_{H
X}
Bias of Standard Error
0.1
0.3
0.5
0.7
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
beta
_{H
X}
Coverage Probability
0.1
0.3
0.5
0.7
0.0
0.2
0.4
0.6
0.8
1.0
beta
_{H
X}
Power
CHAPTER 3. SIMULATION STUDY 38
Figure 3.7: Boxplots of bias in estimation of βHX (upper plot) and bias in estimationof associated standard error (lower plot) from simulation scenario iii)
0.1 0.3 0.5 0.7
−1.
0−
0.5
0.0
0.5
1.0
1.5
Boxplots of Bias in Estimation of Beta_{HX}
Bia
s
Beta_{HX}
0.1 0.3 0.5 0.7
−1.
0−
0.5
0.0
0.5
1.0
1.5
0.1 0.3 0.5 0.7
−1.
0−
0.5
0.0
0.5
1.0
1.5
0.1 0.3 0.5 0.7
−1.
0−
0.5
0.0
0.5
1.0
1.5
GLMPMLEEHYBRID
0.1 0.3 0.5 0.7
−0.
10−
0.05
0.00
0.05
0.10
0.15
Boxplots of Bias in Estimation of SE
Bia
s
Beta_{HX}
0.1 0.3 0.5 0.7
−0.
10−
0.05
0.00
0.05
0.10
0.15
0.1 0.3 0.5 0.7
−0.
10−
0.05
0.00
0.05
0.10
0.15
0.1 0.3 0.5 0.7
−0.
10−
0.05
0.00
0.05
0.10
0.15
GLMPMLEEHYBRID
CHAPTER 3. SIMULATION STUDY 39
Figure 3.8: Boxplots of bias in estimation of βHX (upper plot) and bias in estimationof associated standard error (lower plot) from simulation scenario iv)
0.1 0.3 0.5 0.7
−1.
5−
0.5
0.0
0.5
1.0
1.5
2.0
Boxplots of Bias in Estimation of Beta_{HX}
Bia
s
Beta_{HX}
0.1 0.3 0.5 0.7
−1.
5−
0.5
0.0
0.5
1.0
1.5
2.0
0.1 0.3 0.5 0.7
−1.
5−
0.5
0.0
0.5
1.0
1.5
2.0
0.1 0.3 0.5 0.7
−1.
5−
0.5
0.0
0.5
1.0
1.5
2.0
GLMPMLEEHYBRID
0.1 0.3 0.5 0.7
−0.
20.
00.
10.
20.
30.
40.
5
Boxplots of Bias in Estimation of SE
Bia
s
Beta_{HX}
0.1 0.3 0.5 0.7
−0.
20.
00.
10.
20.
30.
40.
5
0.1 0.3 0.5 0.7
−0.
20.
00.
10.
20.
30.
40.
5
0.1 0.3 0.5 0.7
−0.
20.
00.
10.
20.
30.
40.
5
GLMPMLEEHYBRID
CHAPTER 3. SIMULATION STUDY 40
Fig
ure
3.9:
Est
imat
edbi
asof
stan
dard
erro
rfo
rG
LM
,P
ML
and
HY
BR
IDaf
ter
excl
udin
gE
E
0.1
0.3
0.5
0.7
−0.0
04
−0.0
02
0.00
0
0.00
2
0.00
4G
LMP
ML
HY
BR
ID
unbi
ased
beta
_{H
X}
Bias of Standard Error
Sce
nario
i)
0.1
0.3
0.5
0.7
−0.0
15
−0.0
10
−0.0
05
0.00
0
0.00
5
beta
_{H
X}
Bias of Standard Error
Sce
nario
ii)
0.1
0.3
0.5
0.7
−0.0
15
−0.0
10
−0.0
05
0.00
0
0.00
5
0.01
0
beta
_{H
X}
Bias of Standard Error
Sce
nario
iii)
0.1
0.3
0.5
0.7
−0.0
4
−0.0
3
−0.0
2
−0.0
1
0.00
0.01
beta
_{H
X}
Bias of Standard Error
Sce
nario
iv)
Chapter 4
Conclusions and Future Work
The development of genotyping technologies makes the identification of most of the nu-
cleotide variations in the human genome possible and provides abundant data for disease
gene mapping. However, current widely used PCR-based genotyping techniques only al-
low experimenters to observe genotypes at one specific locus at a time. Therefore, when
haplotypes are among the risk factors in association studies of a disease, missing data on
genetic factors could arise due to phase ambiguity. A variety of statistical methods have
been developed within the GLM framework to relate haplotypes and non-genetic covariates
to the disease phenotype based on observed data for single-locus genotypes.
We have considered haplotype risk inference in case-control studies of a rare disease
in the presence of haplotype phase ambiguity and information on non-genetic risk factors.
We reviewed RML and compared three approximate score methods (PML, EE and MPSE)
that use approximate weights in the weighted RML score equations. We also proposed a
hybrid approach, which uses the MPSE parameter estimator and a PML variance estimator.
Our simulations adopted the two haplotype frequency configurations described in Ghadessi
(2005), which yield moderate and extreme levels of haplotype ambiguity, respectively. We
varied the sample size and considered four relatively modest levels of statistical interaction
between haplotype and non-genetic risk factors. Our simulation results were in general
agreement with those of Spinka et al., in that we showed PML is more biased than EE
or HYBRID in estimating the interaction effect. Such bias is likely due to the incorrect
PML weights, derived assuming HWP and independence of haplotypes and non-genetic
41
CHAPTER 4. CONCLUSIONS AND FUTURE WORK 42
factors in the pooled case-control sample. However, under moderate haplotype ambiguity,
the PML bias was comparable to the bias of logistic regression analysis of phase-known
data. A drawback of the EE approach is its conservative variance estimator, which leads to
conservative coverage of confidence intervals and low power to detect statistical interaction.
Overall, the hybrid approach performed best of the three approximate score methods in our
simulations.
There are several areas for future work. First, the hybird approach showed promise in
our simulations, but its variance estimator lacks justification. Implementing the correct
variance estimator would result in MPSE for rare diseases, which we could compare to HY-
BRID. Second, in simulations so far, haplotypes and the non-genetic factor were simulated
independently. Such independence is assumed in deriving the MPSE and EE weights, and
so these weights were correct for the simulated data. However, theoretical and empirical
results to date do not address the statistical properties of MPSE or EE when haplotypes
and nongenetic factors are dependent. We have begun simulations under such dependence,
but more work is required. Finally, our simulations only considered SNPs at three loci
and two sets of haplotype frequencies, and data were simulated and analyzed under a mul-
tiplicative disease risk model only. There are many other simulation configurations and
disease risk models (e.g. dominant or recessive models) that could be used to compare
these approaches.
Appendix A
Variable Probability Sampling
The sampling scheme given in Spinka et al. (2005) is called variable probability sampling
(VPS), motivated by nested case-control sampling, where a case-control sample is drawn
from a cohort of subjects. As the sampled cohort gets large, the cohort becomes a good
approximation to the general population so that the nested case-control study becomes a
good approximation to the population-based case-control study we’re trying to approximate.
Here we give an overview of VPS based on the description of Lawless et al. (1999), and
show that the conditional probability of H given X is the same under VPS and VSS.
A.1 Overview of VPS
Variable probability sampling is done in three stages:
stage 1: Sample a cohort of size nc from a population. Index subjects in the cohort by
j = 1, . . . , nc. Measure disease status D on all nc subjects in the cohort. Let M1
be the number of cases in the cohort and M0 be the number of controls. Then
Mi ∼ binomial(nc,pr(D = i)).
stage 2: Examine all nc of the subjects in the cohort and decide whether they will be
included in the case-control sample, conditional on their disease status. Subject j
with disease status i is included in the case-control sample with probability µi =
(ni/nc)/pr(D = i). Let Rj be an indicator variable with value 1 if subject j is
43
APPENDIX A. VARIABLE PROBABILITY SAMPLING 44
included in the sample and 0 otherwise. Then prvps(R = 1 | D = i) = µi, where prvps
denotes probability under VPS. In VPS, inclusion status R depends only on D and
not on covariates (H, X), so that prvps(R = 1 | H, X,D) = prvps(R = 1 | D). Hence
R and (H, X) are conditionally independent given D.
stage 3: Measure covariates (H, X) on those in the case-control sample; that is, measure
(Hj , Xj) on subjects with Rj = 1.
Let N1 be the number of cases included in the case-control sample and N0 be the number
of controls. Then Ni | Mi ∼ binomial(Mi, µi) and hence
Evps(Ni) = Evps(Evps(Ni | Mi))
= Evps (Miµi) = µiEvps(Mi) =ni/nc
pr(D = i)ncpr(D = i) = ni.
Under VPS the Ni are random and so is their sum N = N0 + N1. By contrast, under VSS
the size of the case-control sample is fixed at n.
The observed variables on the cohort members can be written as (Dj , Rj , RjHj , RjXj); j =
1, . . . , nc, to reflect the fact that (H, X) are not observed on those with R = 0. For those
of the cohort in the case-control sample, we observe (D, R = 1, RH = H, RX = X) and for
others in the cohort not in the case-control sample (D, R = 0, RH = 0, RX = 0).
Under VPS, the cohort is an iid sample from prvps(D, R,RH,RX). We then focus on D
and (RH, RX) in the subsample for which R = 1, giving an iid sample from prvps(D, RH, RX |R = 1) = prvps(D, H,X | R = 1).
A.2 Equivalence of Probabilities Under VPS and VSS
We first show that the joint distributions prvps(D, H,X | R = 1) and prv(D, H,X) are
equal. Recall that the conditional distribution of H and X given disease status D under
VSS is the same as under the true case-control sampling, which implies that prv(D =
i,H, X) = pr(H, X | D = i) ni/n. We now establish that
prvps(D = i,H, X | R = 1) = pr(H, X | D = i) ni/n, (A.1)
APPENDIX A. VARIABLE PROBABILITY SAMPLING 45
by showing
1. prvps(H, X | D = i, R = 1) = pr(H, X | D = i) and
2. prvps(D = i | R = 1) = ni/n.
Since the right-hand side of equation (A.1) is prv(D = i,H,X), it follows that prvps(D, H,X |R = 1) = prv(D, H,X) as desired.
Showing that prvps(H, X | D = i, R = 1) = pr(H, X | D = i):
First recall that, in VPS, (H, X) and R are conditionally independent given D. Hence
prvps(H, X | D, R = 1) = prvps(H, X | D) and all that remains to show is prvps(H, X |D) = pr(H, X | D). Now prvps(H, X | D = 0) and prvps(H, X | D = 1) describe the
covariate distributions in controls and in cases of the sampled cohort, respectively. Since
this cohort is drawn randomly from the population, prvps(H, X | D = 0) = pr(H, X | D = 0)
and prvps(H, X | D = 1) = pr(H, X | D = 1). In summary, prvps(H, X | D) = pr(H, X | D)
and hence prvps(H, X | D, R = 1) = pr(H, X | D).
Showing that prvps(D = i | R = 1) = ni/n:
We have
prvps(D = i, R = 1) = prvps(R = 1 | D = i)prvps(D = i) =ni/nc
pr(D = i)pr(D = i) =
ni
nc.
Hence prvps(R = 1) = n0/nc + n1/nc = n/nc, so that
prvps(D = i | R = 1) = prvps(D = i, R = 1)/prvps(R = 1) = ni/n.
The same reasoning can be used to show that prvps(D = i | R = 1) = ni/n whenever
prvps(R = 1 | D = i) = k ni/pr(D = i) for some constant k. However, only the choice
k = 1/nc leads to Evps(Ni) = ni.
Appendix B
Derivation of prv(H | X ; γv)
Recall that the conditional joint distribution of H and X given disease status is the same
under the true case-control sampling and under the variant sampling scheme:
pr(H, X | D;ϑ) = prv(H, X | D;ϑv).
Therefore, the joint distribution of H and X in the hypothetical population is given by
prv(H, X; γv) =1∑
i=0
prv(H, X | D = i;ϑv)prv(D = i)
=1∑
i=0
pr(H, X | D = i;ϑ)prv(D = i)
=1∑
i=0
pr(D = i | H, X;β0, β1)pr(H, X; γ)pr(D = i)
prv(D = i)
= pr(H, X; γ)1∑
i=0
pr(D = i | H, X;β0, β1)pr(D = i)
prv(D = i)
= pr(H, X; γ)1∑
i=0
{exp{i(β0 + z(H, X)β1)}1 + exp{β0 + z(H, X)β1}
prv(D = i)pr(D = i)
}
= pr(H, X; γ)prv(D = 0)pr(D = 0)
{1
1 + exp{β0 + z(H, X)β1}+
exp{β0 + z(H, X)β1}1 + exp{β0 + z(H, X)β1}
prv(D = 1)pr(D = 1)
pr(D = 0)prv(D = 0)
}(B.1)
The intercept term βv0 and β0 are related through βv0 = β0 + log{prv(D = 1)pr(D =
46
APPENDIX B. DERIVATION OF PRV (H | X ; γV ) 47
0)/{pr(D = 1)prv(D = 0)}} (Spinka et al. 2005). Thus