· APPROXIMATE LIKELIHOOD INFERENCE FOR HAPLOTYPE RISKS IN CASE-CONTROL STUDIES OF A RARE DISEASE by Zhijian Chen B.Sc. in Statistics, Peking University, 2003. a project submitted

APPROXIMATE LIKELIHOOD INFERENCE FOR HAPLOTYPE

RISKS IN CASE-CONTROL STUDIES OF A RARE DISEASE

by

Zhijian Chen

B.Sc. in Statistics, Peking University, 2003.

a project submitted in partial fulfillment

of the requirements for the degree of

Master of Science

in the Department

of

Statistics and Actuarial Science

c© Zhijian Chen 2006

SIMON FRASER UNIVERSITY

Fall 2006

All rights reserved. This work may not be

reproduced in whole or in part, by photocopy

or other means, without the permission of the author.

APPROVAL

Name: Zhijian Chen

Degree: Master of Science

Title of project: Approximate Likelihood Inference for Haplotype Risks in Case-

Control Studies of a Rare Disease

Examining Committee: Dr. Gary Parker

Chair

Dr. Brad McNeneySenior SupervisorSimon Fraser University

Dr. Jinko GrahamSenior SupervisorSimon Fraser University

Dr. Xiaoqiong Joan HuExternal ExaminerSimon Fraser University

Date Approved:

ii

Abstract

The standard study design to study risk factors for rare diseases is the case-control design.

Genetic association case-control studies often include haplotypes as risk factors. Haplotypes

are not always observed, though observable single-locus genotypes contain partial haplotype

information. Missing haplotypes lead to analysis of data with missing covariates. Maximum

likelihood (ML) inference is then based on solving a set of weighted score equations. How-

ever, the weights cannot be calculated exactly. We describe three methods that approximate

ML by approximating the weights: i) naive application of prospective ML (PML), which

ignores the case-control sampling design, ii) an estimating equations (EE) approach and iii)

a hybrid approach which is based on PML, but with improved weights suggested by EE. We

investigate the statistical properties of the three methods by simulation. In our simulations

the hybrid approach gave more accurate estimates of statistical interactions than PML and

more accurate standard errors than EE.

iii

Acknowledgements

I am greatly indebted to my co-supervisors Dr. Jinko Graham and Dr. Brad McNeney

for their support and guidance throughout my 2 years at SFU, and for their influence and

inspiration to me. I feel very fortunate to have been part of the statistical genetics group.

I also want to thank them, as well as Dr. Richard Lockhart, Dr. Joan Hu and Dr. Boxin

Tang, for their encouragement in my studies. I would like to express my gratitude to the

faculty and staff of the Department of Statistics and Actuarial Science at Simon Fraser

University for their devotion to the graduate programs. Many thanks to my friends and

fellow students: Lihui, Li, Celes, Linda, Cindy, Lucy, Yunfeng, Tony, Pritam, Matthew,

David, Kelly, Linnea, Dean, Gurbakhshash, Ryan, Wendell, Saman, Crystal, Eric, John,

Mark, Darcy, Simon, Darby, Jason and many more. Especial thanks to Chunfang and

Wilson for helping me settle down, and to Ji-Hyung for her patience with my questions

in the completion of my thesis. I also wish to acknowledge Dr. Keith Walley and the

iCAPTURE Center at St. Paul’s Hospital for offering me the opportunity of joining them

for a summer term. Finally, I would like to express my deep gratitude to my parents and

my sister for their love, understanding and support all the way.

iv

Contents

Approval ii

Abstract iii

Acknowledgements iv

Contents v

List of Figures vii

List of Tables viii

1 Introduction 1

1.1 Genetic Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Genetic Association Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Methods 6

2.1 PML for Cohort Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 RML for Case-Control Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 A Variant Sampling Scheme . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 MLEs via EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.3 MLEs by direct solution of score equations . . . . . . . . . . . . . . 15

2.3 Approximate Score Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.1 PML as an Approximate Score Method . . . . . . . . . . . . . . . . 16

2.3.2 MPSE as an Approximate Score Method . . . . . . . . . . . . . . . . 16

v

2.3.3 EE as an Approximate Score Method . . . . . . . . . . . . . . . . . 20

2.3.4 A PML/MPSE Hybrid Approach . . . . . . . . . . . . . . . . . . . . 21

3 Simulation Study 23

3.1 Design of Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Statistical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.1 Overview of Simulation Conclusions . . . . . . . . . . . . . . . . . . 28

3.3.2 Results for Simulation Scenarios i) and ii) . . . . . . . . . . . . . . . 29

3.3.3 Results for Simulation Scenarios iii) and iv) . . . . . . . . . . . . . . 30

4 Conclusions and Future Work 41

Appendices 42

A Variable Probability Sampling 43

A.1 Overview of VPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

A.2 Equivalence of Probabilities Under VPS and VSS . . . . . . . . . . . . . . . 44

B Derivation of prv(H | X ; γv) 46

C Simulation Results 48

Bibliography 55

vi

List of Figures

3.1 Results for βHX from simulation scenario i) . . . . . . . . . . . . . . . . . . 32

3.2 Results for βHX from simulation scenario ii) . . . . . . . . . . . . . . . . . . 33

3.3 Boxplots of bias in estimation of βHX (upper plot) and bias in estimation of

associated standard error (lower plot) from simulation scenario i) . . . . . . 34


associated standard error (lower plot) from simulation scenario ii) . . . . . . 35

3.5 Results for βHX from simulation scenario iii) . . . . . . . . . . . . . . . . . 36

3.6 Results for βHX from simulation scenario iv) . . . . . . . . . . . . . . . . . 37


associated standard error (lower plot) from simulation scenario iii) . . . . . 38


associated standard error (lower plot) from simulation scenario iv) . . . . . 39

3.9 Estimated bias of standard error for GLM, PML and HYBRID after exclud-

ing EE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

vii

List of Tables

3.1 Haplotype frequencies used in the simulations . . . . . . . . . . . . . . . . . 26

3.2 The four simulation scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Adjusted intercept β0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

C.1 Simulation results for scenario i) . . . . . . . . . . . . . . . . . . . . . . . . 49

C.2 Simulation results for scenario ii) . . . . . . . . . . . . . . . . . . . . . . . . 50

C.3 Simulation results for scenario iii) . . . . . . . . . . . . . . . . . . . . . . . . 51

C.4 Simulation results for scenario iv) . . . . . . . . . . . . . . . . . . . . . . . . 52

viii

Chapter 1

Introduction

1.1 Genetic Background

Genetic material is stored on chromosomes in the nucleus of every human cell. Each

chromosome contains a single molecule of duplex DNA along its length, with protein tightly

and complexly coiled. The units of each DNA strand are nucleotides, each of which

contains one of the four chemical bases: Adenine (A), Guanine (G), Thymine (T) and

Cytosine (C). In the duplex DNA, A and T are paired and C and G are paired. A gene

is a sequence of nucleotides along a DNA molecule that influences one or more hereditary

traits (or phenotypes). The physical position of a gene in a chromosome is called its locus

(plural loci). Homologous chromosomes are a pair of chromosomes inherited separately

from parents that contain the same genetic loci in the same order. Every normal person

has 22 pairs of non-sex chromosomes (autosomes) and a pair of sex chromosomes, with

XX for females and XY for males.

Genetic variations exist in most natural populations of organisms, and are the source

of diversity in the population. We call such genetic differences between individuals at iden-

tifiable loci DNA markers. The variant forms at a locus are called alleles. When the

alleles an individual carries at a locus are different from each other, the genotype of this

individual at this locus is said to be heterozygous, otherwise, it is said to be homozy-

gous. In a large random-mating population with no selection, mutation, or migration, the

allele frequencies and the genotype frequencies are constant from generation to generation;

1

CHAPTER 1. INTRODUCTION 2

furthermore, there is a simple relationship between allele frequencies and genotype frequen-

cies. This principle, known as Hardy-Weinberg Equilibrium (HWE), is as follows: if

the allele frequencies are p for the common allele (often denoted by 0) and q for the rare

allele (often denoted by 1) among the parent population, then the genotype frequencies for

0/0, 0/1 (or 1/0) and 1/1 among the next generation are p2, 2pq and q2, respectively, where

“/” separates alleles on the maternally inherited chromosome from those on the paternally

inherited chromosome.

The most common type of DNA marker is the single-nucleotide polymorphism

(SNP), which involves two variants at a single base pair, each of which is observed in the

general population at a frequency greater than 1%. SNPs are abundant and are distributed

approximately uniformly throughout the genome. The terms “allele” and “locus” can also

be referred to the variant forms and physical position of a SNP. Since the two DNA strands

are complementary, only the nucleotide in a single strand is read during SNP genotyping,

and single letters are used to record the alleles. At a diallelic SNP site, for example,

the genotype of an individual who has T-A base pair in one chromosome and C-G base

pair in the homologous chromosome will be recorded as heterozygote T/C (or C/T); while

the genotype of an individual who has T-A base pair at the corresponding site in both

homologous chromosomes will be homozygote T/T. The genetic constitution of an individual

chromosome is called a haplotype. A haplotype can also refer to the combination of alleles

over a sequence of loci on the same chromosome and can be treated as a “super allele”.

Meiosis is a special type of cell division that produces gametes (sperm and egg cells).

During meiosis, homologous chromosomes may exchange genetic material in a process called

crossing-over. When crossing-over events lead to gametes with alleles of different parental

origin at two loci, the gametes are said to be recombinant at these loci. The ratio of the

expected number of recombination events between two loci to the total number of gametes is

the recombination rate between these two loci. When two loci are so close on a chromosome

that recombination is almost impossible, the recombination rate between these two loci is

0. When two loci are far from each other on a chromosome, the expectation is that half

of the gametes will be recombinant, and the recombination rate between these two loci is

0.5. The tendency of nearby loci to co-segregate to the next generation leads to correlation


between alleles at nearby loci, a phenomenon called linkage disequilibrium (LD).

1.2 Genetic Association Studies

Many complex traits, such as height, weight, and susceptibility to disease, are strongly in-

fluenced by both environmental factors and genetic factors. Genetic association studies are

procedures to detect correlations between genetic variants and disease phenotypes on a pop-

ulation scale, relying on LD between genotyped markers and unknown disease loci. Due to

their high density throughout the genome and the development of genotyping technologies,

SNPs have become popular tools in mapping complex disease genes. Single SNP-based

methods are powerful approaches in detecting disease associations with genetic variants,

provided that LD between the genotyped markers and an unknown disease locus is strong.

When LD decreases, however, the power of single marker association tests may suffer be-

cause the information contained in flanking markers is ignored. There is evidence that in

some cases the combination of closely linked SNP markers on the same chromosome will be

in stronger LD with a disease pre-disposing locus, in which case haplotype based methods

may be better able to capture information on disease associations than single SNP-based

tests (Akey et al. 2001). Therefore, there has been great interest in recent years in using

haplotypes as risk factors to identify the genetic basis of complex diseases.

The determination of the pair of haplotypes (or haplogenotype) in a subject is called

haplotype phasing. Although it is possible to infer haplogenotypes through molecular meth-

ods or through genotyping additional family members, such methods are too costly and

laborious to be practical in large-scale population studies. The polymerase chain reaction

method, the current standard genotyping technology, only allows experimenters to deter-

mine the two alleles at a single locus for a subject, without specifying which chromosome

each allele is from. Therefore, for a subject who is heterozygous at more than one locus,

there is more than one possible haplotype pair consistent with the observed single-locus

genotypes, and haplotype phase for this subject is said to be ambiguous. For example, if

the observed single-locus genotypes of a subject at three SNP loci are 0/0, 0/1 and 0/1,

then there are two possible haplogenotypes this subject may carry: 000/011 or 010/001.


Recently, statistical methods have been developed to infer haplogenotypes of subjects and

estimate haplotype frequencies from single-locus genotype data on a population sample (e.g.

Excoffier and Slatkin 1995, Stephens et al. 2001). These algorithms reconstruct haplotypes

by estimating posterior probabilities for the haplogenotype of each subject, conditional on

observed single-locus genotypes. Though not strictly correct, investigators often evaluate

differences in haplotype frequencies between diseased and disease-free subjects by estimating

haplotype frequencies and performing standard chi-squared tests of association. This prac-

tice incorrectly assumes that diseased and disease-free subjects are randomly sampled from

separate randomly-mating populations and ignores the variation in the estimated haplotype

frequencies due to missing phase. Epstein and Satten (2003) developed maximum-likelihood

methods that overcome these difficulties. However, they do not consider the effects of other

risk factors (besides haplotypes) in their models of disease risk. For complex diseases with

more than one risk factor and possible interactions between risk factors, statistically sound

methods are required for more detailed models of risk.

There are three commonly used observational study designs in epidemiology: cohort

studies, cross-sectional studies and case-control studies. A cohort study is a study in

which each subject presently has certain exposures (covariates) or receives a particular treat-

ment, and is followed up over time for the outcome of interest. In cross-sectional studies,

covariates and the outcome are measured at the same point in time. Cohort and cross-

sectional studies are often referred to as prospective designs. An important requirement for

prospective studies is that the disease outcome of interest should be common; otherwise,

the number of outcomes observed will be too small for reliable statistical inference. For a

rare disease, a widely accepted design is the case-control study, which looks backwards in

time to measure exposures on subjects of known disease status. This retrospective study

design is often inexpensive and convenient.

1.3 Overview of the Thesis

In this thesis we consider the problem of inference of haplotype risks in case-control studies

of a rare disease when haplotype phase is not always observed. Missing haplotype phase


leads to logistic regression analysis of haplotype risks with missing covariate data. Maximum

likelihood (ML) inference is then based on the solution to a set of weighted complete-data

score equations. However, as discussed later in the thesis, the weights can not be calculated

exactly. We therefore describe methods that approximate ML by approximating the weights.

Ghadessi (2005) discussed the application of prospective maximum likelihood (PML; e.g.

Burkett 2002) to case-control data, and how PML approximates the correct retrospective

maximum likelihood (RML) approach. She also compared PML to an estimating equation

(EE) approach (Zhao et al. 2003) developed specifically for case-control data. However, her

simulation study involved disease risk models with haplotype effects only, and no nongenetic

effects or interactions. We extend her work in two ways. First, we discuss new methodology

(Spinka et al. 2005) published in the interim. Second, we conduct a simulation study in

which the risk model includes nongenetic covariates and haplotype-nongenetic interactions.

Our simulations extend those of Spinka et al. (2005) to consider less extreme interaction

effects that we consider to be more plausible and rare disease probabilities.

An outline of the thesis is as follows. In Chapter 2 we discuss RML and approximate

RML methods. We include a review of the material from Ghadessi that describes PML for

cohort or cross-sectional data with missing haplotype phase, RML for case-control studies

with missing haplotype phase, and the justification of PML as an approximation to RML

that uses the correct complete-data scores but approximate weights. We then describe the

modified prospective score equation (MPSE) approach of Spinka et al. and show that it is

also an approximate RML method, with a different approximation to the weights than that

of PML. We next connect MPSE with the earlier EE approach of Zhao et al. Finally, we

show how a simple modification of PML yields the MPSE estimator of regression parameters.

However, since this modification to PML does not change the way the variance is estimated,

our variance estimator is different from that suggested by Spinka et al. We therefore call our

modified PML/MPSE approach the hybrid approach. In Chapter 3 we present simulations

to compare statistical properties of PML, EE and the hybrid approach in case-control studies

of a rare disease. We could not obtain reliable software implementing MPSE and hence this

approach could not be included in our simulation study. Chapter 4 summarizes the main

conclusions and discusses directions for future research.

Chapter 2

Methods

A variety of statistical methods have been proposed for haplotype-disease association studies

using unphased genotype data on unrelated subjects. These methods are often developed

within the generalized linear model (GLM) framework, in which the distribution of the

dependent variable Y , given covariate vector Z, is in the exponential family:

f(Y | Z ; η, φ) = exp{

Y η − b(η)φ

+ c(Y, φ)}

,

where b(·) and c(·) are known functions, and η and φ are the canonical parameter and

the dispersion parameter, respectively. A link function g(µ) = η, which is monotonic and

differentiable, relates µ ≡ E[Y | Z] to the linear predictor η = β0+Zβ1, where β1 is a vector

of coefficients associated with the effects of Z. For a binary disease response, the logit link

g(µ) = log µ1−µ is the canonical link, and the dispersion parameter φ is 1. Let D denote the

status of a disease with

D =

1 presence of disease

0 absence of disease.

Let H and X respectively denote the haplogenotype factor and non-genetic (environmental)

factor. The risk associated with haplogenotype H is modeled in terms of the joint effect of

the pair of haplotypes constituting H. Thus in genetic association studies of haplotype and

6

CHAPTER 2. METHODS 7

environmental risk factors, the standard disease risk model is the logistic regression model:

logpr(D = 1 | H, X)pr(D = 0 | H, X)

= β0 + z(H, X)β1, (2.1)

where z(H, X) is a row covariate vector that codes the effects of X and H and their possible

interaction, and where β1 is a vector of log odds-ratios. For example, if only a single

risk haplotype hr and a continuous non-genetic covariate X are considered, and hr has a

multiplicative effect on the disease risk, then

z(H, X)β1 = βXX + βHNhr(H) + βHXNhr(H)X, (2.2)

where Nhr(H) denotes the number of copies of the risk haplotype hr in H, and where

β1 = (βX , βH , βHX)T is the vector of regression parameters associated with the main effect

of X, the main effect of hr and the interaction effect of X with hr, respectively.

When single-locus genotypes, instead of haplogenotypes, are observed on subjects, miss-

ing haplotype phase leads to logistic regression analysis with incomplete data. Burkett

(2002) developed a PML method for inference of trait associations with SNP-haplotypes

and non-genetic covariates, using unphased genotype data of unrelated subjects, for cohort

or cross-sectional studies. Independently, Lake et al. (2003) proposed a similar prospective

method that built on the score test approach of Schaid et al. (2002), with calculation of

standard errors slightly different from that of Burkett (2002). The prospective methods of

Burkett (2002) and Lake et al. (2003) are respectively implemented in the hapassoc package

and the haplo.stats package for the R programming environment. In this chapter, we review

material from Ghadessi (2005) which describes PML and RML, and justifies PML as an

approximation to RML that uses the correct complete-data scores but approximate weights.

We then describe the MPSE approach of Spinka et al. (2005) as another approximate RML

method, which uses a different approximation to the RML weights than PML. Next, we

connect MPSE with the earlier EE approach of Zhao et al. (2003). Finally, we describe our

modified PML/MPSE hybrid approach.


2.1 PML for Cohort Studies

Burkett (2002), Lake et al. (2003) and Burkett et al. (2004) maximize the prospective

likelihood for cohort or cross-sectional data by an EM algorithm. In this section we give a

summary of the notation and details of the EM algorithm in this context.

Index subjects in a cohort sample of size n by j = 1, . . . , n. Note that indexing co-

hort subjects by j differs from Ghadessi (2005), who used i, but is more consistent with

the indexing of case-control subjects by j in Section 2.2. Throughout, in a slight abuse of

notation, we use capital letters to denote random variables or observed values, as appropri-

ate. Let Dj , Gj , Hj and Xj respectively denote the disease status, single-locus genotypes,

haplogenotype and non-genetic covariates for the jth subject. For population-based studies

with independent subjects, the log-likelihood is the summation of individual log-likelihood

contributions from each subject. Let β = (β0, β1) denote the logistic regression parameters,

and let γ parameterize the joint distribution of H and X. Then θ = (β, γ) is a vector of

parameters that completely describes the complete-data log-likelihood

lc(θ;D,H,X ) =n∑

j=1

lcj(θ;Dj ,Hj , Xj) =n∑

j=1

log pr(Dj ,Hj , Xj ; θ),

where D is all data on the disease status of all subjects, H is the haplogenotype data

on all subjects and X is the non-genetic covariate data on all subjects. When haplotype

phase is not directly observable, the likelihood is based on the observed data (D,G,X ),

where G is the single-locus genotype data on all subjects. The missing haplotype phase

can be accounted for by the use of the EM algorithm, in which the ML estimates θ are

obtained by iteratively maximizing the conditional expectation of the complete-data log-

likelihood, given the observed data (D,G,X ) and parameter estimates θ(t) from the previous

step. Ibrahim (1990) showed that, in the GLM framework, estimation of the parameters

with missing information on categorical covariates can be reduced to iterative weighted

regression, in which the posterior weights of all consistent complete covariate values given

the observed data and θ(t) are calculated in the E-step of the EM algorithm. This method

is sometimes referred to as the EM algorithm by the method of weights. In the context

of haplotype risk inference with missing haplotype phase, the method of weights involves


extending each subject in the original sample into “pseudo-individuals” that have the same

disease status and non-genetic covariates, but different haplogenotypes, all of which are

consistent with the subject’s observed single-locus genotypes. In such an extended sample,

complete covariate information is observed on all pseudo-individuals. By properly weighting

all pseudo-individuals, the estimates of θ can be obtained iteratively, as described in detail

in Burkett (2002).

A complicating feature of logistic regression inference of haplotype risks with missing

haplotype phase is that the distribution of haplogenotypes can not be identified from data on

single-locus genotypes (e.g. Epstein and Satten 2003). The problem arises from the fact that

certain haplogenotypes are never observed on their own unambiguously. PML approaches

developed to date (Burkett 2002, Burkett et al. 2004, Lake et al. 2003), solve this iden-

tifiability problem by imposing Hardy-Weinberg proportions (HWP) for haplogenotypes.

Under independence of H and X it can be shown (e.g. Burkett 2002) that the marginal

distribution of X need not be estimated. HWP of haplogenotype frequencies means that

the probability a subject carries haplogenotype H = (hk, hk′) is given by

pr(H = (hk, hk′)) =

γ2hk

if hk = hk′

2γhkγhk′ if hk 6= hk′ ,

(2.3)

where γhkis the frequency of haplotype hk. Haplotype frequencies can be estimated from

data on single-locus genotypes, based on information provided by haplotypes within hap-

logenotypes that are observed unambiguously.

Under the assumptions of H and X independence and HWP, the parameter θ is redefined

to be θ = (β, γh) where γh is the vector of haplotype frequencies. For the jth subject,

j = 1, . . . , n, let HGj ={

Hkj ; k = 1, . . . , Kj

}denote the set of haplogenotypes consistent

with the single-locus genotypes Gj . Briefly, the E-step of the algorithm involves calculating

the conditional expectation of the complete-data log-likelihood,

Q(θ | θ(t)) =n∑

j=1

Eθ(t) [lcj(θ;Dj ,Hj , Xj) | Dj , Gj , Xj ]

=n∑

j=1

Kj∑

k=1

wjk(θ(t)) log pr(θ;Dj ,Hkj , Xj)


where the wjk(θ(t)) are the weights for each of the Kj pseudo-individuals:

wjk(θ(t)) = pr(Hkj | Dj , Gj , Xj ; θ(t))

=pr(Dj | Hk

j , Xj ;β(t))pr(Hkj ; γ(t)

h )∑Kj

k′=1 pr(Dj | Hk′j , Xj ;β(t))pr(Hk′

j ; γ(t)h )

.

The M-step is to maximize the conditional expectation of the complete-data log-likelihood,

Q(θ | θ(t)) which is, up to a constant term,

Q(θ | θ(t)) =n∑

j=1

Kj∑

k=1

wjk(θ(t)) log pr(Dj | Hkj , Xj ;β)

+n∑

j=1

Kj∑

k=1

wjk(θ(t)) log pr(Hkj ; γh). (2.4)

Standard errors of θ are based on the inverse of the observed information evaluated at θ.

The observed information can be calculated by Louis’ formula (Louis, 1982), which expresses

the observed information as the conditional expectation of the complete-data information

minus the conditional variance of the complete-data score given the observed data. In our

context, the conditional expectations are weighted sums with the same weights as those

available from the final iteration of the EM algorithm.

2.2 RML for Case-Control Data

In a case-control study, information on covariates is collected retrospectively given disease

status. Therefore, a retrospective sampling model, rather than a prospective sampling

model, describes the data. Let (H0, X0) be baseline values of haplogenotypes H and en-

vironmental covariates X, and let D = i denote the disease status within the ith disease

group; i = 0, 1. The conditional probability of (H, X) given disease status D can be shown

to be

pr(H, X | D = i) = ci(ξ, β1) exp{ξ(H, X) + iz(H, X)β1},


where ξ(H, X) = log{pr(H, X | D = 0)/pr(H0, X0 | D = 0)} is a nuisance function of H

and X, and ci(ξ, β1) is a normalizing constant (Prentice and Pyke 1979, Shin et al. 2006).

Let ϑ = (β1, ξ) parametrize the retrospective likelihood initially. It can be shown that

fitting a standard logistic regression model of disease risk is equivalent to RML analysis

for inference of log odds-ratio parameters β1, provided that covariates are fully observed

and the distribution of covariates is treated completely non-parametrically (Prentice and

Pyke 1979). However, when haplotypes are among the risk factors of interest and haplotype

phase is ambiguous, the result from Prentice and Pyke (1979) is not applicable.

In order to tackle the problem of missing haplotype phase, we first consider a variant

sampling scheme (Prentice and Pyke 1979) which is asymptotically equivalent to case-

control sampling. This scheme leads to a parametrization of the complete-data likelihood

that is useful for our problem. We next note that maximum-likelihood estimates (MLEs) for

the regression parameters may be obtained indirectly through the EM algorithm or directly

as the solutions to a set of score equations. This is important because, in later sections,

we will motivate the EM algorithm for PML as an approximation to the EM algorithm for

RML. We will also motivate the MPSE and EE estimating equations as approximations to

the RML score equations.

Slight changes in the notation are required for RML inference on case-control data with

missing haplotype phase. Let Dij , Gij , Hij and Xij respectively denote the disease status,

single-locus genotype data, haplogenotype and non-genetic covariate data for the jth subject

within the ith disease group; i = 0, 1 and j = 1, ..., ni, where ni is the number of subjects

in the ith disease group, and n = n0 + n1 is the total number of subjects in the sample.

As before, D, G and X represent disease status, single-locus genotype data and non-genetic

covariate data, respectively, of all subjects; and H is the latent haplogenotype data of all

subjects.

2.2.1 A Variant Sampling Scheme

The variant sampling scheme (VSS), is a two-stage hypothetical sampling design:

stage 1: Independently sample n disease status variates from a large hypothetical popula-

tion. In each of the n binomial trials, the probability of sampling a case is n1/n and


the probability of sampling a control is n0/n. Let N1 be the number of cases sampled

and N0 = n−N1 be the controls sampled. In a study with n subjects, the expected

number of cases and controls sampled are n1 and n0, respectively.

stage 2: Sample N0 covariate vectors from the appropriate conditional distribution of

(H, X) given D = 0 and N1 covariate vectors from the conditional distribution of

(H, X) given D = 1.

Since cases and controls respectively represent the diseased class and the disease-free class

in the underlying population, the conditional distribution of (H, X) in the second stage of

VSS is the same conditional distribution as in the true case-control sampling given disease

status. Under VSS, disease status and covariates are both random. The VSS hypothetical

population has a relative frequency n1/n of cases and n0/n of controls.

Let “prv” and “pr” denote probability density functions or mass functions, as appropri-

ate, under VSS and under true case-control sampling, respectively. From the description

above, the conditional distribution of (H, X) given disease status D under true case-control

sampling is the same as under VSS; that is pr(H, X | D ;ϑ) = prv(H, X | D ;ϑ). As shown

in Ghadessi (2005; page 18), prv(H, X | D ;ϑ) can be reparametrized as

prv(H, X | D = i ;ϑv) =prv(D = i | H, X ;βv0, β1)prv(H, X ; γv)

prv(D = i), (2.5)

where prv(D = i | H, X ;βv0, β1) is a logistic regression model with the same log odds-

ratio parameters β1 as the logistic model for a population sample, but with a different

intercept βv0 appropriate to the VSS hypothetical population; γv parameterizes the joint

distribution of H and X under VSS; and prv(D = i) is the disease risk under VSS, which

is ni/n by definition. The new parameter ϑv ≡ (β1, γv) reparametrizes ϑ = (β1, ξ), and βv0

is a function of β1 and γv. The complete-data likelihood would be a product of terms in

equation (2.5).

2.2.2 MLEs via EM algorithm

Write lc(ϑv;H,X ) = log prv(H,X | D ;ϑv) for the complete-data log-likelihood. For conve-

nience, let βv = (βv0, βT1 )T .


The Expectation Step

The conditional expectation of the complete-data log-likelihood given the observed data,

disease status and the estimates of ϑv in the tth iteration is now given by

Q(ϑv | ϑ(t)v ) = E

ϑ(t)v

[lc(ϑv;H,X ) | D,G,X ]

=1∑

i=0

ni∑

j=1

Eϑ

(t)v

[lcij(ϑv;Hij , Xij) | Gij , Xij , Dij ],

where lcij(ϑv ;Hij , Xij) = log prv(Hij , Xij | Dij ;ϑv) is the complete-data log-likelihood for

the jth subject within the ith disease group. Let HGij = {Hkij ; k = 1, . . . , Kij} be the set of

haplogenotypes that are consistent with the observed single-locus genotype data Gij . Then

Q(ϑv | ϑ(t)v ) =

1∑

i=0

ni∑

j=1

Kij∑

k=1

wijk(ϑ(t)v ) log prv(H

kij , Xij | Dij ;ϑv)

where

wijk(ϑ(t)v ) = prv(H

kij , Xij | Gij , Xij , Dij ;ϑ(t)

v ) = prv(Hkij | Gij , Xij , Dij ;ϑ(t)

v )

is the posterior weight of the pseudo-individual with complete covariate vector (Hkij , Xij),

given the observed data (Gij , Xij), disease status Dij and parameter estimates ϑ(t)v . Since

H implies G, one can write the RML weights in terms of the conditional probability of

covariates given disease status as:

wijk(ϑ(t)v ) =

prv(Hkij , Gij , Xij | Dij ;ϑ

(t)v )

prv(Gij , Xij | Dij ;ϑ(t)v )

=prv(Hk

ij , Xij | Dij ;ϑ(t)v )

prv(Gij , Xij | Dij ;ϑ(t)v )

=prv(Hk

ij , Xij | Dij ;ϑ(t)v )

∑Kij

k′=1 prv(Hk′ij , Xij | Dij ;ϑ

(t)v )

(2.5)=

prv(Dij | Hkij , Xij ;β

(t)v )prv(Hk

ij , Xij ; γ(t)v )

∑Kij

k′=1 prv(Dij | Hk′ij , Xij ;β

(t)v )prv(Hk′

ij , Xij ; γ(t)v )

.


Now write prv(H, X; γv) = prv(H | X; γv) prv(X; γvx), where γvx parameterizes the mar-

ginal distribution of X in the VSS hypothetical population. Replacing the joint distribution

of H and X in the weights by this product yields

wijk(ϑ(t)v ) =

prv(Dij | Hkij , Xij ;β

(t)v )prv(Hk

ij | Xij ; γ(t)v )

∑Kij

k′=1 prv(Dij | Hk′ij , Xij ;β

(t)v )prv(Hk′

ij | Xij ; γ(t)v )

, (2.6)

in which the marginal probability of X is canceled out. Calculation of prv(Dij | Hkij , Xij ;β

(t)v )

is straightforward using the disease risk model and current estimates of βv. However, as is

the case with population (prospective) sampling, the conditional distribution of H given X

under VSS sampling can not be estimated from the observed data without further assump-

tions because certain haplogenotypes are never observed unambiguously.

The Maximization Step

The weighted retrospective log-likelihood, with given RML weights wijk(ϑ(t)v ), is

Q(ϑv | ϑ(t)v ) =

1∑

i=0

ni∑

j=1

Kij∑

k=1


kij , Xij | Dij ;ϑv),

where ϑv = (β1, γv). By equation (2.5), Q(ϑv | ϑ(t)v ) is, up to a constant term, equal to

Q(ϑv | ϑ(t)v ) =

1∑

i=0

ni∑

j=1

Kij∑

k=1

wijk(ϑ(t)v ) log prv(Dij | Hk

ij , Xij ;βv)

+1∑

i=0

ni∑

j=1

Kij∑

k=1


kij , Xij ; γv), (2.7)

where βv = (βv0, β1). Maximization of Q would appear to be completely analogous to the

prospective case (see equation 2.4). However, recall that βv0 is a function of both β1 and

γv and so is not a free parameter. By contrast, in the prospective likelihood the intercept

term is a free parameter. Nevertheless, as shown in Appendix C of Ghadessi (2005), Q

may be maximized by treating βv0 as a free parameter. Hence, the M-step for RML is the

same as the M-step for PML.


2.2.3 MLEs by direct solution of score equations

The observed-data score function can be expressed as a weighted sum of complete-data

score functions over complete-data configurations consistent with the observed data, where

the weights are the posterior probabilities of the complete-data configurations given the

observed data (Louis 1982). The M-step of the EM algorithm solves the same weighted

sum as the score equations, but with the weights fixed at values determined by parameter

estimates from the previous iteration of the algorithm. At convergence, however, the weights

used in the final M-step will be the MLEs. Therefore, at convergence, the estimating

equations for the M-step will have the same solution as the score equations. The weighted

estimating equations in the M-step are obtained by differentiating (2.7) for fixed weights

with respect to βv to obtain

1∑

i=0

ni∑

j=1

Kij∑

k=1

wijk(ϑ(t)v )

∂

∂βvlog pr(Dij | Hk

ij , Xij ;βv).

At convergence, the M-step involves solving (for βv)

0 =1∑

i=0

ni∑

j=1

Kij∑

k=1

wijk(ϑv)∂


ij , Xij ;βv).

The RML score equations for the regression parameters must therefore be

0 =1∑

i=0

ni∑

j=1

Kij∑

k=1

wijk(ϑv)∂


ij , Xij ;βv). (2.8)

In other words, the weights in the E-step of the EM algorithm are of the same functional

form as the weights in the score equations.

2.3 Approximate Score Methods

As with the prospective likelihood, the retrospective likelihood is a function of parameters

that can not be identified from the observed data, unless assumptions are made about the

distribution of haplogenotypes. Unfortunately, approaches which formulate the likelihood


after making these assumptions appear to lack robustness (Spinka et al. 2005). Another

strategy is to derive the score equations for the regression parameters of primary interest

without making assumptions, and then, where necessary, make approximations that depend

only on identifiable parameters. We show below that PML applied to case-control data,

MPSE and EE are examples of this approximate score approach, which make different

approximations to the weights in the RML score equations (2.8). Our motivation of MPSE

is novel; Spinka et al. present it only as an alternate set of estimating equations to an

assumption-based RML approach they develop.

2.3.1 PML as an Approximate Score Method

PML is derived for prospective data assuming a random sample and assuming HWP and

independence of H and X in the population from which the random sample is drawn. As

argued previously, the case-control sample can be viewed as a random sample from the VSS

hypothetical population. Therefore, applying PML to case-control data has the effect of

approximating prv(H | X ; γv) by prv(H ; γvh), the haplogenotype frequencies in the VSS

hypothetical population that would obtain if HWP held. Here γvh is a vector of haplotype

frequencies in the VSS hypothetical population. The PML weights approximate the correct

RML weights in equation (2.6) by

wijk(ϑv) =prv(Dij | Hk

ij , Xij ;βv)prv(Hkij ; γvh)

∑Kij

k′=1 prv(Dij | Hk′ij , Xij ;βv)prv(Hk′

ij ; γvh). (2.9)

As previously noted, the M-steps of PML and RML are the same.

2.3.2 MPSE as an Approximate Score Method

MPSE was developed for diseases of any frequency in the population, with emphasis on more

common diseases. In general, the estimating equations are parametrized in terms of β0, the

population log-odds of disease in the baseline group. For rare diseases, this parametrization

would appear to be problematic, since β0 can not be estimated from case-control data unless

the population probability of disease, pr(D = 1), is known (Chatterjee and Carroll 2005).

However, as shown below, for rare diseases it turns out that MPSE does not depend on β0.


Spinka et al. (2005) describe MPSE in terms of another approximation to case-control

sampling called variable probability sampling (VPS). VPS is motivated by nested case-

control sampling. A detailed description is given in Appendix B, where we also show that

the joint probability distribution of complete data D, H and X is the same under VPS and

VSS. Therefore, we use prv to denote probability density functions or mass functions under

VPS as well as VSS.

We discuss the MPSE estimating equations for the regression parameters first, and for

population haplotype frequencies second. We also review the form of the variance-covariance

matrix of the resulting parameter estimators.

Estimation of Regression Parameters

The estimating equations for the regression parameters βv = (βv0, β1) are of the same form

as the RML score equation (2.8)

0 =1∑

i=0

ni∑

j=1

Kij∑

k=1

˜wijk(β0, βv, γh)∂

∂βvlog prv(Dij | Hk

ij , Xij ;βv), (2.10)

but with approximate weights

˜wijk(β0, βv, γh) =prv(Dij | Hk

ij , Xij ;βv)rβ0,βv(Hkij , Xij)pr(Hk

ij ; γh)∑Kij

k′=1 prv(Dij | Hk′ij , Xij ;βv)rβ0,βv(Hk′

ij , Xij)pr(Hk′ij ; γh)

, (2.11)

where

rβ0,βv(H, X) =1 + exp{βv0 + z(H, X)β1)}1 + exp{β0 + z(H, X)β1)} .

For the RML weights in the equation (2.6), prv(H | X; γv) is approximated assuming HWP

and independence of H and X in the general population, as described in Appendix B. In-

serting the approximation to prv(H | X; γv) into the RML weights in equation (2.6) gives

the MPSE weights in equation (2.11). The MPSE weights are likely to be a better approxi-

mation than the PML weights since they can be derived under assumptions (independence

of H and X and HWP in the population) which seem more reasonable than those needed

to justify PML applied to case-control data (independence of H and X and HWP in the

pooled case-control sample).


The estimating equations depend on β0 only through the weights, and the weights

˜wijk(β0, βv, γh) depend on β0 only through rβ0,βv . For a rare disease with probability of

disease that is small for all covariate values likely to be observed, exp{β0+x(H, X)β1} << 1,

so that

rβ0,βv(H, X) ≈ 1 + exp{βv0 + z(H, X)β1} ≡ rβv(H, X)

Thus, under a rare disease assumption, an estimate of β0 is not needed to estimate r, and

therefore is not needed to estimate the weights or solve the estimating equations. The

MPSE weights simplify to

˜wijk(β1, γh) =exp{Dijz(Hk

ij , Xij)β1}pr(Hkij ; γh)

∑Kij

l=1 exp{Dijz(H lij , Xij)β1}pr(H l

ij ; γh). (2.12)

In particular, the weights for the pseudo-individuals in the control group (i.e. Dij = 0)

depend on haplotype frequencies only, so that

˜w0jk(β1, γh) ≡ ˜w0jk(γh). (2.13)

Estimation of Haplotype Frequencies

In general, the weights ˜wijk(β0, βv, γh) also depend on the marginal distribution of hap-

logenotypes in the population. Under the assumption of population HWP, the distribution

of haplogenotypes is specified by haplotype frequencies γh = (γh1 , . . . , γhK). The estimating

equations for the haplotype frequencies are

0 =1∑

i=0

ni∑

j=1

Kij∑

k′=1

˜wijk′(β0, βv, γh)Nhk(Hk′

ij )

− γhk

1∑

i=0

ni∑

j=1

∑Kl=1 2γhl

rβ0,βv(H = (hk, hl), Xij)∑H′∈H pr(H ′; γh)rβ0,βv(H ′, Xij)

, k = 1, . . . , K, (2.14)

where H is the set of all haplogenotypes. The first term of equation (2.14) is the expected

count for haplotype hk in the pooled sample. These estimating equations are the score

equations for γh that result from the retrospective likelihood under the assumptions of

HWP and independence of H and X in the population (Spinka et al. 2005). Furthermore,


assuming a rare disease, we can invoke equations (2.12) and (2.13) to obtain

0 =1∑

i=0

ni∑

j=1

Kij∑

k′=1

˜wijk′(β1, γh)Nhk(Hk′

ij )

− γhk

1∑

i=0

ni∑

j=1

∑Kl=1 2γhl

rβv(H = (hk, hl), Xij)∑H′∈H pr(H ′; γh)rβv(H ′, Xij)

, k = 1, . . . , K,

=n0∑

j=1

K0j∑

k′=1

˜w0jk′(γh)Nhk(Hk′

0j) +n1∑

j=1

K1j∑

k′=1

˜w1jk′(β1, γh)Nhk(Hk′

1j)

− γhk

1∑

i=0

ni∑

j=1

∑Kl=1 2γhl

rβv(H = (hk, hl), Xij)∑H′∈H pr(H ′; γh)rβv(H ′, Xij)

, k = 1, . . . , K. (2.15)

Clearly, these estimating equations for the haplotype frequencies depend on the regression

parameters. We shall revisit this issue later in comparing MPSE to EE.

Variance Estimator for MPSE

Since the MPSE estimating equations are not score equations, the estimator of the variance

of the resulting parameter estimators is not simply the inverse of the matrix of derivatives

of the estimating equations. Following standard estimating equations theory, Spinka et al.

(2005) present a “sandwich” variance estimator that correctly accounts for the case-control

sampling. Let U(βv, γh) denote the collection of MPSE estimating equations. For a sample

of n independent individuals, U(βv, γh) is a sum over contributions from each subject. Index

subjects by the subscript j within disease state i. Then U(βv, γh) =∑1

i=0

∑nij=1 Uij(βv, γh).

The estimator is the solution (βv, γh) to U(βv, γh) = 0. Let U denote the matrix of partial

derivatives of U with respect to (βv, γh). Then the variance estimator from the MPSE

approach is

U(βv, γh)−1Var[U(βv, γh)]U(βv, γh)−1, (2.16)

where the middle term is an estimate of the variance of the estimating equations evaluated

at (βv, γh):

Var[U(βv, γh)] =1∑

i=0

ni∑

j=1

Uij(βv, γh)Uij(βv, γh)T −1∑

i=0

ni Ui(βv, γh) Ui(βv, γh)T,


with Ui(βv, γh) = 1ni

∑nij=1 Uij(βv, γh).

Case-control samples are comprised of two independent and identically distributed (iid)

samples, cases and controls, and so are not identically distributed. For future reference,

we note here that the variance estimator in equation (2.16) for case-control data differs

from its counterpart for data arising from an iid sample. The expression for Var[U(βv, γh)]

involves a weighted sum of mean-score terms∑1

i=0 ni Ui(βv, γh) Ui(βv, γh)T. In con-

trast, the analogous expression for an iid sample would involve n U(βv, γh) U(βv, γh)T,

where U(βv, γh) is the mean score. One of the regularity conditions assumed in deriving

asymptotic distributions for estimators that solve estimating equations is that the esti-

mating equation tends in probability to zero as the sample size tends to infinity (Zhao

et al. 2003); that is, n U(βv, γh) P→ 0. The EE approach ignores the mean-score term∑1

i=0 ni Ui(βv, γh) Ui(βv, γh)T

when constructing variance estimators. Ignoring this term

would be justified if the sample were iid. However, since case-control samples are not iid,

the EE variance estimators have the potential to be asymptotically conservative.

2.3.3 EE as an Approximate Score Method

Under the rare disease assumption that exp{β0 + z(H, X)β1} ≈ 0, it is straightforward

to show that the EE estimating equations for the regression parameters are the same as

those of MPSE. However, we emphasize that even though the estimating equations for the

regression parameters are the same, the estimating equations for the haplotype frequencies

are different. Since the estimators of regression parameters and haplotype frequencies jointly

solve the full set of estimating equations, the regression parameter estimators for the two

approaches will be different. Consequently, the methods are not equivalent, contrary to the

assertions of Spinka et al. (2005).

In deriving their estimating equations for the haploytype frequencies, Zhao et al. (2003)

pointed out that, under the rare disease assumption, the controls in the sample could

be treated as representative of the general population and approximated the population

haplotype frequencies needed in their weights by haplotype frequencies estimated from the


control data only. Their estimating equations for haplotype frequencies can be written as

0 = γhk− 1

2n0

n0∑

j=1

K0j∑

k′=1

˜w0jk′(γh)Nhk(Hk′

0j), k = 1, . . . , K, (2.17)

which do not depend on the regression parameters β1, unlike the corresponding estimating

equations (2.15) of Spinka et al. (2005).

Finally, as already noted, the sandwich variance matrix is computed in slightly different

ways in the two approaches. Zhao et al. (2003) appear to be incorrectly using a form of

variance estimator that would only be appropriate for an iid sample. Spinka et al. (2005)

did not mention this difference between MPSE and EE. The error in the calculation of the

EE variance estimator suggests that the EE standard errors will be biased upward.

2.3.4 A PML/MPSE Hybrid Approach

The MPSE estimating equations for regression parameters are of the same form as those

for PML except that the weights are different. Therefore we can implement the MPSE

estimating equations for regression parameters by modifying a PML approach, such as

the one implemented in the R package hapassoc. At the E-step, the MPSE weights are

calculated according to formula (2.12), assuming a rare disease so that the intercept β0 is not

required. At the M-step, regression parameter estimates are updated by solving the weighted

complete-data estimating equations for PML with given weights. However, the MPSE

weights require estimates of population haplotype frequencies rather than of haplotype

frequencies in the pooled sample, and so we must also implement the MPSE estimating

equations for population haplotype frequencies in equation (2.15). Since we’ve changed the

estimating equations for the haplotype frequencies from those implemented in the original

hapassoc, the variance calculation from hapassoc for all the parameter estimates will be

incorrect in general. However, the variance estimates for the regression parameters may be

approximately correct, provided that the observed information matrix for the observed-data

likelihood is approximately block-diagonal (i.e. βv and γh are approximately independent).

To see this, partition the information matrix of the observed data Io into four blocks Ioij ;

i = 1, 2, j = 1, 2, with Io11 corresponding to the regression parameters and information


matrix returned by hapassoc, Io22 corresponding to the haplotype frequencies and Io

12 = IoT21

corresponding to the cross-terms. To say the information matrix is approximately block

diagonal is to say that Io12 ≈ 0. In general, the variance estimator of the regression parameter

estimator is (Io−1)11. In the special case of Io12 ≈ 0, we obtain (Io−1)11 ≈ (Io

11)−1. The

variance estimator for the regression parameters should then be correct if hapassoc returns

a valid estimator of Io11.

The observed information for the regression parameters returned by hapassoc should

approximate Io11. From Louis’ equations,

Io11 = E(Ic

11 | D,G,X )− V (U c1 | D,G,X ),

where Ic11 is the complete data information for the regression parameters, U c

1 is the complete-

data score vector for the regression parameters and the expectation and variance are condi-

tional on the observed data (D,G,X ) for all subjects. The hybrid version of hapassoc uses

the correct complete-data scores and information and uses approximately correct weights

for calculating conditional expectations and variances. Therefore, the observed information

for the regression parameters returned by hapassoc should approximate Io11. However, the

calculation of Io22 in hapassoc is incorrect, as it does not reflect the estimating equations

for the haplotype frequencies that are used in the modified code. Therefore, the variance

estimates for the haplotype frequencies, which are approximately (Io22)

−1 if the information

is approximately block-diagonal, will be incorrect.

Chapter 3

Simulation Study

There have been previous simulation studies to investigate the performance of PML in

haplotype-disease association analysis of cross-sectional data (Burkett et al. 2004) and

of case-control data (Lake et al. 2003, Ghadessi 2005). Ghadessi (2005) compared PML

to EE in case-control studies of a rare disease (with pr(D = 1) = 0.0009), through two

sets of simulations with underlying disease-risk models having haplotype effects only. The

EE estimators and standard errors were more biased than those from PML in both sets of

simulations. Spinka et al. (2005) conducted simulation studies to compare PML to MPSE in

case-control studies of a relatively common disease (with pr(D=1)=0.107). Their underlying

disease-risk model included a single risk haplotype, an “environmental” factor, and large

statistical interaction between the risk haplotype and environmental factor. They showed

that, when haplotypes and the environmental factor are independent in the population,

MPSE was unbiased for the log odds-ratio parameters in the logistic model while PML was

biased.

Of all previous simulation studies, those of Spinka et al. (2005) and Ghadessi (2005)

are most relevant to the focus of this thesis. However, there are limitations of each of

these two studies. For example, the simulations of Spinka et al., involving a relatively

common disease, are less relevant than if they involved a rare disease, since the case-control

design is primarily for rare diseases. In addition, Spinka et al. considered only one value

of the statistical interaction effect, and this value was very large relative to the sizes of the

main effects of the environmental factor and risk haplotype. By contrast Ghadessi did not

23

CHAPTER 3. SIMULATION STUDY 24

include interactions at all in her disease risk model. Spinka et al. presumably selected a

large interaction effect for their study to illustrate the bias of PML applied to case-control

data. It can be shown that interaction between haplotypes H and non-genetic factors X

creates departures from independence of H and X in the cases (Shin et al. 2006), and hence

departures from independence in the case-control sample. When interaction effects are large,

so is the dependence of H and X in the case-control sample, making the PML weights poor

approximations to the true RML weights. However, for more modest interaction effects

likely to be encountered in practice, we would expect less bias in the risk estimates.

In this project we present simulations to compare the statistical properties of PML,

EE and the hybrid approach in case-control studies of a rare disease, for risk models that

include haplotype effects, non-genetic effects and statistical interactions between the two.

We could not obtain reliable software implementing MPSE and so could not assess the

performance of this approach in our simulation study. The goals of our simulations are to

compare the bias and precision of the estimators of log odds-ratios, the bias of estimators of

the standard errors, and the power for detecting haplotype-environment interactions, for the

three approaches. The PML approach and the EE approach are respectively implemented

in the package hapassoc for the R programming environment and in the software package

Hplus, both available for public use. The hybrid approach, which we denote HYBRID, was

implemented by modifying the hapassoc code. Due to limitations of Hplus, we were only

able to compare the approaches under a multiplicative risk model, in which the odds of

being affected is increased by a multiplicative factor for each copy of the risk haplotype

that replaces a baseline haplotype.

Large samples (e.g. 1000 cases and 1000 controls) are always desired in case-control

studies of rare diseases, particularly when gene-environment interactions are of primary

interest, because logistic regression has low power to detect interactions (Smith and Day

1984). Also, it is reasonable to conduct the simulations using large samples in order to

investigate the asymptotic properties of the three approaches. However, large sample sizes

can not always be achieved, due to cost and time considerations. Therefore, we are also

interested in investigating the performance of PML, EE and HYBRID for data sets of more

realistic size (e.g. 500 cases and 500 controls).


3.1 Design of Simulations

We conducted our simulations in a setting similar to that of Ghadessi (2005), in which hap-

lotypes were comprised of three SNPs and two sets of haplotype frequencies were considered

(Table 3.1). The ability of single-locus genotypes to predict haplotypes, as measured by R2h

(Stram et al. 2003), is 78.95% for the first set of frequencies and 59.78% for the second set.

We refer to these two levels of haplotype ambiguity as moderate and extreme, respectively.

Haplotype h1, which consisted of allele 0 at all three loci, was chosen to be a risk haplotype

that was positively associated with the disease. It is necessary to choose the frequency of h1

to be smaller than at least one other haplotype, because Hplus automatically chooses the

most frequent haplotype in the sample to be the baseline haplotype in fitting the logistic

regression model. We performed our simulation studies with varying sample sizes of 1000

cases and 1000 controls, and 500 cases and 500 controls. The resulting four simulation

scenarios are summarized in Table 3.2. The data were simulated from the logistic disease

model

logit{pr(D = 1 | H, X)} = β0 + βXX + βHNh1(H) + βHXNh1(H)X, (3.1)

where βX , βH and βHX were the regression coefficients associated with a main effect of an

“environmental factor” X, described below, a main effect of h1 and an interaction effect,

and where Nh1(H) counts the number of copies of haplotype h1 contained in haplogenotype

H. Throughout our simulation studies, βX and βH were fixed at 0.1 and 0.7, respectively,

but four values of βHX were considered: 0.1, 0.3, 0.5 and 0.7, representing different levels of

interactions. The intercept term β0 was adjusted for different βHX and different haplotype

frequencies (Table 3.3), so that the probability of the disease in the general population

would be around 0.0009. This disease probability is consistent with a two-year incidence

study of type 1 diabetes in Scandinavians.

In a single simulation replicate, we first generated haplogenotypes for an underlying

population, under the assumption of HWP. The population size was 1500000 for the first

and the third simulation scenarios and 750000 for the second and the fourth ones. We

then generated a continuous environmental covariate X for each subject in the population,

independent of the subject’s genetic status, using a normal distribution with mean 0 and


Table 3.1: Haplotype frequencies used inthe simulations

Haplotype FrequencySet 1 Set 2

h1 = 000 .23 .07h2 = 001 .27 .93/7h3 = 010 .15 .93/7h4 = 011 .10 .93/7h5 = 100 .10 .93/7h6 = 101 .05 .93/7h7 = 110 .05 .93/7h8 = 111 .05 .93/7

Table 3.2: The four simulation scenarios

Scenario Haplotype Sample sizefrequencies (cases/controls)

i Set 1 1000/1000ii Set 1 500/500iii Set 2 1000/1000iv Set 2 500/500

Table 3.3: Adjusted intercept β0

βHX Haplotype frequenciesSet 1 Set 2

0.1 -7.45 -7.160.3 -7.50 -7.180.5 -7.65 -7.220.7 -7.80 -7.26

variance 1. For each subject, the binary disease status D was simulated according to the

penetrance model in equation (3.1). Once the population was simulated, a subset of cases

and controls of specified size was randomly selected, and the single-locus genotypes and

environmental covariates of the selected subjects were collected and recorded as data. The

data set was then analyzed using PML, EE and HYBRID, respectively. We also wished to

compare the finite-sample bias in the regression parameter and standard error estimators

of PML, EE and HYBRID to the finite-sample bias of maximum likelihood when there

is no missing haplotype phase; i.e. logistic regression. The finite-sample bias of logistic

regression analysis of the phase-known data provides a baseline against which to judge the

bias of methods that analyze the data with phase ambiguity. Hence we also recorded the

haplogenotypes of all sampled subjects and obtained another data set which was analyzed

by logistic regression using the glm function of the R programming environment. We use the

notation GLM to denote maximum likelihood applied to the complete (i.e. phase-known)

data.


Let “h0” denote the baseline haplotype in the sample. The model fit to the data is

logit{pr(D = 1 | H, X)} = β0 + βXX +∑

hj 6=h0

βhjNhj

(H) +∑

hj 6=h0

βhjXNhj(H)X, (3.2)

where βhjand βhjX are regression coefficients associated with main effects of haplotype hj

and effects of interaction between hj and X, and where Nhj(H) counts the number of copies

of haplotype hj contained in haplogenotype H. In terms of the model used to simulate data,

given in equation (3.1), we have that βH = βh1 and βHX = βh1X ; to simplify notation we

use βH and βHX throughout. The estimates of βX , βH and βHX from GLM, PML, EE and

HYBRID, as well as their associated standard errors, were recorded.

For each value of βHX and each simulation scenario, statistical properties of the ap-

proaches were estimated from 10000 simulation replicates. In up to about a half of the

simulated data sets, either Hplus estimated h1 to be the most frequent haplotype, or one

or more of the approaches failed to converge while fitting the risk model. Such data sets

were discarded until the desired number of 10000 replicates was obtained.

3.2 Statistical Properties

We computed four commonly used measures to evaluate the performance of GLM, PML,

EE and HYBRID. Let b be the true value of a regression parameter associated with an effect

(e.g. βHX for the interaction effect). Let b be an estimator (e.g. the PML estimator) of b

and br be its realization in the rth simulation replicate. The first measure is the estimated

bias of b:

Bias(b) = ¯b− b =

1R

R∑

r=1

(br − b),

in which the summation is over all R replicates. In our current simulation study, R =

10000. The estimated bias is compared to the corresponding simulation error (described in

Appendix A of Ghadessi (2005)), and the estimator is said to be unbiased if the estimated

bias is within simulation error of zero. Let SE be the standard error associated with b,

and SEr be the standard error of br. Let SD =√

1R

∑Rr=1(br − ¯

b)2 denote the empirical

standard deviation of b. In our simulation study, SD is considered to be the nominal (true)


value of the standard error of b. The second measure is the estimated bias of SE:

Bias(SE) =1R

R∑

r=1

(SEr − SD),

which quantifies the accuracy of the standard error estimator of b. The standard error

estimator is said to be unbiased if the estimated bias of SE is within the corresponding

simulation error of zero. The third measure is the estimated coverage probability of the

confidence interval, (b−Zα/2SE, b + Zα/2SE) at significance level α, that includes the true

value b:

CP =1R

R∑

r=1

δ{|br − b| < Zα/2SEr},

where δ is the indicator function. In our simulation study, α = 0.05 and Zα/2 is approx-

imately 2. An acceptable estimated coverage probability of the 95% confidence interval

should be within simulation error of the nominal 95%. The fourth measure is the estimated

power of the approach to detect the presence of the effect, quantified as the probability of

rejecting the null hypothesis b = 0:

P =1R

R∑

r=1

δ{|br| > Zα/2SEr}.

3.3 Simulation Results

We first present an overview of conclusions from the simulation study, with particular

attention to results that address methodological issues raised in Chapter 2 and to results

that confirm previous studies. More detailed simulation results related to estimation of

βHX are presented next, which support the general conclusions of the simulation study.

Full simulation results for βH , βX and βHX appear in Tables C.1 - C.4 of Appendix C.

3.3.1 Overview of Simulation Conclusions

Overall, HYBRID performed the best of the three approximate score methods, with approx-

imately correct inference and good power to detect the interaction effect in all simulation

configurations.


We next discuss the bias and variance of risk parameter estimators, and the bias in the

standard error estimators.

Our simulations, under more moderate interaction effects than those of Spinka et al.,

show that, when haplotype ambiguity is moderate, bias in all estimators of the regression

parameters, including PML, is comparable to the finite-sample bias of logistic regression

with known haplotypes (GLM). However, when haplotype ambiguity is extreme, the PML

and EE estimators are biased relative to HYBRID. The bias of PML is likely due to its in-

correct approximation of the RML weights in these simulations. EE and HYBRID (MPSE),

on the other hand, use the correct weights because the data are simulated under popula-

tion HWP and independence of H and X. The bias of EE is likely due to the estimating

equations for the haplotype frequencies which differ from those for MPSE.

In contrast, under extreme haplotype ambiguity, the EE regression estimators were less

variable than those of HYBRID (MPSE). Recall that, unlike MPSE, the EE estimating

equations for the haplotype frequencies involve the controls only and do not depend on the

regression parameters. If regression parameter estimators are imprecisely determined, the

EE estimator of haplotype frequencies might be less variable than the MPSE estimator,

even though the MPSE estimator uses data from both cases and controls. Such decreased

variability in estimators of haplotype frequency might then translate into decreased variance

for the regression parameter estimators of EE, relative to those of MPSE.

The most striking simulation results regarding standard errors were those for EE. The

conservative standard errors for EE are almost certainly due to an error in the EE variance

calculation noted previously. Figure 3.9 shows the bias in standard errors, after excluding

those of EE, for the remaining methods. The HYBRID standard errors perform the best of

the three approximate score methods, even though the variance calculation from hapassoc

is incorrect.

3.3.2 Results for Simulation Scenarios i) and ii)

Figure 3.1 summarizes the results for the first simulation scenario, in which the first set of

haplotype frequencies was used. Based on the estimated bias of βHX , we make the following

key observations. First, the bias of estimators from PML, EE and HYBRID was upward


(anti-conservative) in general. Second, the biases increased as βHX increased. Third, EE

and HYBRID performed slightly better than PML, as the biases of the EE and HYBRID

estimators were within simulation error of zero when βHX = 0.1 and 0.3. The boxplots in

Figure 3.3 show that the variability in βHX estimates appeared to be smaller for EE and

HYBRID than for PML.

EE performed poorly in calculating the standard errors of the βHX estimates, as the

standard error of βHX was upward biased (Figure 3.1). The standard errors from PML

and HYBRID showed a slightly downward bias (anti-conservative) in general and exceeded

simulation errors (Figure 3.9). However, the magnitudes of the biases were small compared

to the bias of the EE estimator. Figure 3.3 also shows that standard errors from EE are

more spread-out than those from PML and HYBRID.

The estimated coverage probabilities of the 95% confidence intervals of PML and HY-

BRID were approximately 95% and within simulation errors of the nominal 95%. By con-

trast, the estimated coverage probabilities of EE were larger than 99% in general and

exceeded simulation errors. The inflation of the standard errors from EE resulted in low

power to detect the interaction effects. The estimated power to detect weak interactions was

low for all three approaches but improved for PML and HYBRID as the level of interaction

increased.

Figures 3.2, 3.4 and 3.9 show the simulation results for the second simulation scenario,

which used the same haplotype frequencies configuration as in the first simulation scenario

but with smaller sample size. Similar patterns of biases in estimation of βHX and standard

errors from PML, EE and HYBRID were observed. The magnitude of bias was bigger and

the variability in the estimates was greater than in the first simulation scenario.

3.3.3 Results for Simulation Scenarios iii) and iv)

The second set of haplotype frequencies was used in the third and fourth simulation sce-

narios. The simulation results for βHX are summarized in Figures 3.5 and 3.6, in the same

formats as those for the first two simulation scenarios.

Both EE and HYBRID showed downward bias in estimating weak interaction effects

and upward bias in estimating moderate and strong interaction effects, while PML showed


upward bias in estimating weak and moderate interactions and downward bias in estimating

strong interactions. The estimated bias was within simulation error for PML when βHX =

0.1 and 0.5, for EE when βHX = 0.1 and 0.3, and for HYBRID when βHX = 0.1, 0.3 and 0.5.

The variability in the estimates appeared to be smaller for EE than for PML and HYBRID,

as shown in Figures 3.7 and 3.8.

The results for the standard error showed that the estimated bias of standard errors was

downward for PML and upward for EE and HYBRID in general, with all estimated biases

exceeding simulation errors. Figure 3.7 shows similar inflation and spread of standard errors

from EE as observed in the first two simulation scenarios, and standard errors from PML

and HYBRID that are more concentrated than EE.

The 95% confidence interval from EE gave estimated coverage probabilities of around

99%, due to the highly-conservative standard errors. The coverage probabilities were slightly

below 95% for PML and were slightly above 95% for HYBRID. As expected, the estimated

power to detect interactions was much lower for EE than for PML and HYBRID. The power

for all approaches improved as βHX or sample size increased. Recall that the ability of the

single-locus genotypes to predict the number of copies of risk haplotype h1 is lower and

phase ambiguity is higher in the second set of haplotype frequencies than in the first set.

Thus, it is not surprising that the power for PML and HYBRID were much lower than for

the GLM using phase-known data, even for high levels of interaction and large sample sizes.


Fig

ure

3.1:

Res

ults

for

βH

Xfr

omsi

mul

atio

nsc

enar

ioi)

0.1

0.3

0.5

0.7

0.00

0

0.00

5

0.01

0

0.01

5G

LMP

ML

EE

HY

BR

ID

unbi

ased

beta

_{H

X}

Bias of beta_{HX}

0.1

0.3

0.5

0.7

−0.0

5

0.00

0.05

0.10

0.15

beta

_{H

X}

Bias of Standard Error

0.1

0.3

0.5

0.7

0.94

0.95

0.96

0.97

0.98

0.99

1.00

beta

_{H

X}

Coverage Probability

0.1

0.3

0.5

0.7

0.0

0.2

0.4

0.6

0.8

1.0

beta

_{H

X}

Power


Fig

ure

3.2:

Res

ults

for

βH

Xfr

omsi

mul

atio

nsc

enar

ioii)

0.1

0.3

0.5

0.7

0.00

0

0.00

8

0.01

6

0.02

4

0.03

2G

LMP

ML

EE

HY

BR

ID

unbi

ased

beta

_{H

X}

Bias of beta_{HX}

0.1

0.3

0.5

0.7

−0.0

5

0.00

0.05

0.10

0.15

0.20

beta

_{H

X}


0.1

0.3

0.5

0.7

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

beta

_{H

X}


0.1

0.3

0.5

0.7

0.0

0.2

0.4

0.6

0.8

1.0

beta

_{H

X}

Power


Figure 3.3: Boxplots of bias in estimation of βHX (upper plot) and bias in estimationof associated standard error (lower plot) from simulation scenario i)

0.1 0.3 0.5 0.7

−0.

6−

0.2

0.0

0.2

0.4

0.6

Boxplots of Bias in Estimation of Beta_{HX}

Bia

s

Beta_{HX}

0.1 0.3 0.5 0.7

−0.

6−

0.2

0.0

0.2

0.4

0.6

0.1 0.3 0.5 0.7

−0.

6−

0.2

0.0

0.2

0.4

0.6

0.1 0.3 0.5 0.7

−0.

6−

0.2

0.0

0.2

0.4

0.6 GLM

PMLEEHYBRID

0.1 0.3 0.5 0.7

−0.

04−

0.02

0.00

0.02

0.04

Boxplots of Bias in Estimation of SE

Bia

s

Beta_{HX}

0.1 0.3 0.5 0.7

−0.

04−

0.02

0.00

0.02

0.04

0.1 0.3 0.5 0.7

−0.

04−

0.02

0.00

0.02

0.04

0.1 0.3 0.5 0.7

−0.

04−

0.02

0.00

0.02

0.04

GLMPMLEEHYBRID


Figure 3.4: Boxplots of bias in estimation of βHX (upper plot) and bias in estimationof associated standard error (lower plot) from simulation scenario ii)

0.1 0.3 0.5 0.7

−1.

0−

0.5

0.0

0.5

1.0


Bia

s

Beta_{HX}

0.1 0.3 0.5 0.7

−1.

0−

0.5

0.0

0.5

1.0

0.1 0.3 0.5 0.7

−1.

0−

0.5

0.0

0.5

1.0

0.1 0.3 0.5 0.7

−1.

0−

0.5

0.0

0.5

1.0

GLMPMLEEHYBRID

0.1 0.3 0.5 0.7

−0.

15−

0.05

0.05

0.15


Bia

s

Beta_{HX}

0.1 0.3 0.5 0.7

−0.

15−

0.05

0.05

0.15

0.1 0.3 0.5 0.7

−0.

15−

0.05

0.05

0.15

0.1 0.3 0.5 0.7

−0.

15−

0.05

0.05

0.15

GLMPMLEEHYBRID


Fig

ure

3.5:

Res

ults

for

βH

Xfr

omsi

mul

atio

nsc

enar

ioiii

)

0.1

0.3

0.5

0.7

−0.0

1

0.00

0.01

0.02

0.03

GLM

PM

LE

EH

YB

RID

unbi

ased

beta

_{H

X}

Bias of beta_{HX}

0.1

0.3

0.5

0.7

−0.0

5

0.00

0.05

0.10

0.15

0.20

beta

_{H

X}


0.1

0.3

0.5

0.7

0.94

0.95

0.96

0.97

0.98

0.99

1.00

beta

_{H

X}


0.1

0.3

0.5

0.7

0.0

0.2

0.4

0.6

0.8

1.0

beta

_{H

X}

Power


Fig

ure

3.6:

Res

ults

for

βH

Xfr

omsi

mul

atio

nsc

enar

ioiv

)

0.1

0.3

0.5

0.7

0.00

0.01

0.02

0.03

0.04

GLM

PM

LE

EH

YB

RID

unbi

ased

beta

_{H

X}

Bias of beta_{HX}

0.1

0.3

0.5

0.7

−0.0

5

0.00

0.05

0.10

0.15

0.20

0.25

beta

_{H

X}


0.1

0.3

0.5

0.7

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

beta

_{H

X}


0.1

0.3

0.5

0.7

0.0

0.2

0.4

0.6

0.8

1.0

beta

_{H

X}

Power


Figure 3.7: Boxplots of bias in estimation of βHX (upper plot) and bias in estimationof associated standard error (lower plot) from simulation scenario iii)

0.1 0.3 0.5 0.7

−1.

0−

0.5

0.0

0.5

1.0

1.5


Bia

s

Beta_{HX}

0.1 0.3 0.5 0.7

−1.

0−

0.5

0.0

0.5

1.0

1.5

0.1 0.3 0.5 0.7

−1.

0−

0.5

0.0

0.5

1.0

1.5

0.1 0.3 0.5 0.7

−1.

0−

0.5

0.0

0.5

1.0

1.5

GLMPMLEEHYBRID

0.1 0.3 0.5 0.7

−0.

10−

0.05

0.00

0.05

0.10

0.15


Bia

s

Beta_{HX}

0.1 0.3 0.5 0.7

−0.

10−

0.05

0.00

0.05

0.10

0.15

0.1 0.3 0.5 0.7

−0.

10−

0.05

0.00

0.05

0.10

0.15

0.1 0.3 0.5 0.7

−0.

10−

0.05

0.00

0.05

0.10

0.15

GLMPMLEEHYBRID


Figure 3.8: Boxplots of bias in estimation of βHX (upper plot) and bias in estimationof associated standard error (lower plot) from simulation scenario iv)

0.1 0.3 0.5 0.7

−1.

5−

0.5

0.0

0.5

1.0

1.5

2.0


Bia

s

Beta_{HX}

0.1 0.3 0.5 0.7

−1.

5−

0.5

0.0

0.5

1.0

1.5

2.0

0.1 0.3 0.5 0.7

−1.

5−

0.5

0.0

0.5

1.0

1.5

2.0

0.1 0.3 0.5 0.7

−1.

5−

0.5

0.0

0.5

1.0

1.5

2.0

GLMPMLEEHYBRID

0.1 0.3 0.5 0.7

−0.

20.

00.

10.

20.

30.

40.

5


Bia

s

Beta_{HX}

0.1 0.3 0.5 0.7

−0.

20.

00.

10.

20.

30.

40.

5

0.1 0.3 0.5 0.7

−0.

20.

00.

10.

20.

30.

40.

5

0.1 0.3 0.5 0.7

−0.

20.

00.

10.

20.

30.

40.

5

GLMPMLEEHYBRID


Fig

ure

3.9:

Est

imat

edbi

asof

stan

dard

erro

rfo

rG

LM

,P

ML

and

HY

BR

IDaf

ter

excl

udin

gE

E

0.1

0.3

0.5

0.7

−0.0

04

−0.0

02

0.00

0

0.00

2

0.00

4G

LMP

ML

HY

BR

ID

unbi

ased

beta

_{H

X}


Sce

nario

i)

0.1

0.3

0.5

0.7

−0.0

15

−0.0

10

−0.0

05

0.00

0

0.00

5

beta

_{H

X}


Sce

nario

ii)

0.1

0.3

0.5

0.7

−0.0

15

−0.0

10

−0.0

05

0.00

0

0.00

5

0.01

0

beta

_{H

X}


Sce

nario

iii)

0.1

0.3

0.5

0.7

−0.0

4

−0.0

3

−0.0

2

−0.0

1

0.00

0.01

beta

_{H

X}


Sce

nario

iv)

Chapter 4

Conclusions and Future Work

The development of genotyping technologies makes the identification of most of the nu-

cleotide variations in the human genome possible and provides abundant data for disease

gene mapping. However, current widely used PCR-based genotyping techniques only al-

low experimenters to observe genotypes at one specific locus at a time. Therefore, when

haplotypes are among the risk factors in association studies of a disease, missing data on

genetic factors could arise due to phase ambiguity. A variety of statistical methods have

been developed within the GLM framework to relate haplotypes and non-genetic covariates

to the disease phenotype based on observed data for single-locus genotypes.

We have considered haplotype risk inference in case-control studies of a rare disease

in the presence of haplotype phase ambiguity and information on non-genetic risk factors.

We reviewed RML and compared three approximate score methods (PML, EE and MPSE)

that use approximate weights in the weighted RML score equations. We also proposed a

hybrid approach, which uses the MPSE parameter estimator and a PML variance estimator.

Our simulations adopted the two haplotype frequency configurations described in Ghadessi

(2005), which yield moderate and extreme levels of haplotype ambiguity, respectively. We

varied the sample size and considered four relatively modest levels of statistical interaction

between haplotype and non-genetic risk factors. Our simulation results were in general

agreement with those of Spinka et al., in that we showed PML is more biased than EE

or HYBRID in estimating the interaction effect. Such bias is likely due to the incorrect

PML weights, derived assuming HWP and independence of haplotypes and non-genetic

41

CHAPTER 4. CONCLUSIONS AND FUTURE WORK 42

factors in the pooled case-control sample. However, under moderate haplotype ambiguity,

the PML bias was comparable to the bias of logistic regression analysis of phase-known

data. A drawback of the EE approach is its conservative variance estimator, which leads to

conservative coverage of confidence intervals and low power to detect statistical interaction.

Overall, the hybrid approach performed best of the three approximate score methods in our

simulations.

There are several areas for future work. First, the hybird approach showed promise in

our simulations, but its variance estimator lacks justification. Implementing the correct

variance estimator would result in MPSE for rare diseases, which we could compare to HY-

BRID. Second, in simulations so far, haplotypes and the non-genetic factor were simulated

independently. Such independence is assumed in deriving the MPSE and EE weights, and

so these weights were correct for the simulated data. However, theoretical and empirical

results to date do not address the statistical properties of MPSE or EE when haplotypes

and nongenetic factors are dependent. We have begun simulations under such dependence,

but more work is required. Finally, our simulations only considered SNPs at three loci

and two sets of haplotype frequencies, and data were simulated and analyzed under a mul-

tiplicative disease risk model only. There are many other simulation configurations and

disease risk models (e.g. dominant or recessive models) that could be used to compare

these approaches.

Appendix A

Variable Probability Sampling

The sampling scheme given in Spinka et al. (2005) is called variable probability sampling

(VPS), motivated by nested case-control sampling, where a case-control sample is drawn

from a cohort of subjects. As the sampled cohort gets large, the cohort becomes a good

approximation to the general population so that the nested case-control study becomes a

good approximation to the population-based case-control study we’re trying to approximate.

Here we give an overview of VPS based on the description of Lawless et al. (1999), and

show that the conditional probability of H given X is the same under VPS and VSS.

A.1 Overview of VPS

Variable probability sampling is done in three stages:

stage 1: Sample a cohort of size nc from a population. Index subjects in the cohort by

j = 1, . . . , nc. Measure disease status D on all nc subjects in the cohort. Let M1

be the number of cases in the cohort and M0 be the number of controls. Then

Mi ∼ binomial(nc,pr(D = i)).

stage 2: Examine all nc of the subjects in the cohort and decide whether they will be

included in the case-control sample, conditional on their disease status. Subject j

with disease status i is included in the case-control sample with probability µi =

(ni/nc)/pr(D = i). Let Rj be an indicator variable with value 1 if subject j is

43

APPENDIX A. VARIABLE PROBABILITY SAMPLING 44

included in the sample and 0 otherwise. Then prvps(R = 1 | D = i) = µi, where prvps

denotes probability under VPS. In VPS, inclusion status R depends only on D and

not on covariates (H, X), so that prvps(R = 1 | H, X,D) = prvps(R = 1 | D). Hence

R and (H, X) are conditionally independent given D.

stage 3: Measure covariates (H, X) on those in the case-control sample; that is, measure

(Hj , Xj) on subjects with Rj = 1.

Let N1 be the number of cases included in the case-control sample and N0 be the number

of controls. Then Ni | Mi ∼ binomial(Mi, µi) and hence

Evps(Ni) = Evps(Evps(Ni | Mi))

= Evps (Miµi) = µiEvps(Mi) =ni/nc

pr(D = i)ncpr(D = i) = ni.

Under VPS the Ni are random and so is their sum N = N0 + N1. By contrast, under VSS

the size of the case-control sample is fixed at n.

The observed variables on the cohort members can be written as (Dj , Rj , RjHj , RjXj); j =

1, . . . , nc, to reflect the fact that (H, X) are not observed on those with R = 0. For those

of the cohort in the case-control sample, we observe (D, R = 1, RH = H, RX = X) and for

others in the cohort not in the case-control sample (D, R = 0, RH = 0, RX = 0).

Under VPS, the cohort is an iid sample from prvps(D, R,RH,RX). We then focus on D

and (RH, RX) in the subsample for which R = 1, giving an iid sample from prvps(D, RH, RX |R = 1) = prvps(D, H,X | R = 1).

A.2 Equivalence of Probabilities Under VPS and VSS

We first show that the joint distributions prvps(D, H,X | R = 1) and prv(D, H,X) are

equal. Recall that the conditional distribution of H and X given disease status D under

VSS is the same as under the true case-control sampling, which implies that prv(D =

i,H, X) = pr(H, X | D = i) ni/n. We now establish that

prvps(D = i,H, X | R = 1) = pr(H, X | D = i) ni/n, (A.1)

APPENDIX A. VARIABLE PROBABILITY SAMPLING 45

by showing

1. prvps(H, X | D = i, R = 1) = pr(H, X | D = i) and

2. prvps(D = i | R = 1) = ni/n.

Since the right-hand side of equation (A.1) is prv(D = i,H,X), it follows that prvps(D, H,X |R = 1) = prv(D, H,X) as desired.

Showing that prvps(H, X | D = i, R = 1) = pr(H, X | D = i):

First recall that, in VPS, (H, X) and R are conditionally independent given D. Hence

prvps(H, X | D, R = 1) = prvps(H, X | D) and all that remains to show is prvps(H, X |D) = pr(H, X | D). Now prvps(H, X | D = 0) and prvps(H, X | D = 1) describe the

covariate distributions in controls and in cases of the sampled cohort, respectively. Since

this cohort is drawn randomly from the population, prvps(H, X | D = 0) = pr(H, X | D = 0)

and prvps(H, X | D = 1) = pr(H, X | D = 1). In summary, prvps(H, X | D) = pr(H, X | D)

and hence prvps(H, X | D, R = 1) = pr(H, X | D).

Showing that prvps(D = i | R = 1) = ni/n:

We have

prvps(D = i, R = 1) = prvps(R = 1 | D = i)prvps(D = i) =ni/nc

pr(D = i)pr(D = i) =

ni

nc.

Hence prvps(R = 1) = n0/nc + n1/nc = n/nc, so that

prvps(D = i | R = 1) = prvps(D = i, R = 1)/prvps(R = 1) = ni/n.

The same reasoning can be used to show that prvps(D = i | R = 1) = ni/n whenever

prvps(R = 1 | D = i) = k ni/pr(D = i) for some constant k. However, only the choice

k = 1/nc leads to Evps(Ni) = ni.

Appendix B

Derivation of prv(H | X ; γv)

Recall that the conditional joint distribution of H and X given disease status is the same

under the true case-control sampling and under the variant sampling scheme:

pr(H, X | D;ϑ) = prv(H, X | D;ϑv).

Therefore, the joint distribution of H and X in the hypothetical population is given by

prv(H, X; γv) =1∑

i=0

prv(H, X | D = i;ϑv)prv(D = i)

=1∑

i=0

pr(H, X | D = i;ϑ)prv(D = i)

=1∑

i=0

pr(D = i | H, X;β0, β1)pr(H, X; γ)pr(D = i)

prv(D = i)

= pr(H, X; γ)1∑

i=0

pr(D = i | H, X;β0, β1)pr(D = i)

prv(D = i)

= pr(H, X; γ)1∑

i=0

{exp{i(β0 + z(H, X)β1)}1 + exp{β0 + z(H, X)β1}

prv(D = i)pr(D = i)

}

= pr(H, X; γ)prv(D = 0)pr(D = 0)

{1

1 + exp{β0 + z(H, X)β1}+

exp{β0 + z(H, X)β1}1 + exp{β0 + z(H, X)β1}

prv(D = 1)pr(D = 1)

pr(D = 0)prv(D = 0)

}(B.1)

The intercept term βv0 and β0 are related through βv0 = β0 + log{prv(D = 1)pr(D =

46

APPENDIX B. DERIVATION OF PRV (H | X ; γV ) 47

0)/{pr(D = 1)prv(D = 0)}} (Spinka et al. 2005). Thus

prv(H, X; γv) = pr(H, X; γ)prv(D = 0)pr(D = 0)

{1

1 + exp{β0 + z(H, X)β1}+

exp{β0 + z(H, X)β1}1 + exp{β0 + z(H, X)β1} exp(βv0 − β0)

}

=1 + exp{βv0 + z(H, X)β1}1 + exp{β0 + z(H, X)β1}

prv(D = 0)pr(D = 0)

pr(H, X; γ), (B.2)

which is dependent on the disease probability and the joint distribution of H and X in the

general population. The marginal distribution of X in the hypothetical population is the

sum of prv(H, X; γv) over all haplogenotypes, hence

prv(H | X; γv) =prv(H, X; γv)prv(X; γvx)

=rβ0,βv0,β1(H, X)pr(H, X; γ)∑

H′∈H rβ0,βv0,β1(H ′, X)pr(H ′, X; γ), (B.3)

where

rβ0,βv0,β1(H, X) =1 + exp{βv0 + z(H, X)β1}1 + exp{β0 + z(H, X)β1} .

Under the assumption of gene-environment independence in the population, we have

prv(H | X; γv) =rβ0,βv0,β1(H, X)pr(H; γh)∑

H′∈H rβ0,βv0,β1(H ′, X)pr(H ′; γh). (B.4)

Substituting equation (B.4) into the RML weights in equation (2.6) gives the MPSE weights

in equation (2.11).

Appendix C

Simulation Results

The full simulation results for βX , βH and βHX from the four scenarios are summarized in

four tables. The “Bias” item in the Property column indicates the estimated bias of an esti-

mator of a log odds-ratio b from an approach, and the “Bias(SE)” item is the estimated bias

of the associated standard error. “CP” and “Power” are the estimated coverage probability

of the 95% confidence interval and the estimated power for detecting an effect, respectively.

Only the model with main effects and interactions, specified in equation (3.2), was fit to

the data. This model includes main effects and interactions, and hence significance tests of

main effects would be in the presence of an interaction. Since such tests are not typically

meaningful, we present estimated power for tests of interaction (βHX) only. The simulation

error for each estimate is given in brackets after the estimate. Estimates within simulation

error of the nominal value are indicated with asterisks.

48

APPENDIX C. SIMULATION RESULTS 49

Table C.1: Simulation results for scenario i)

βHX Risk Property GLM PML EE HYBRID0.1 βX Bias 0.00033 (0.00251)∗ 0.00333 (0.00304) 0.00193 (0.00303)∗ 0.00183 (0.00303)∗

Bias(SE) -0.00118 (0.00009) -0.00402 (0.00016) 0.08309 (0.00162) -0.00532 (0.00015)CP 0.94820 (0.00443)∗ 0.94850 (0.00442)∗ 0.99080 (0.00191) 0.94640 (0.00450)∗

βH Bias 0.00790 (0.00179) 0.01066 (0.00225) 0.00938 (0.00224) 0.00925 (0.00223)Bias(SE) 0.00077 (0.00003) 0.00212 (0.00007) 0.01030 (0.00033) 0.00236 (0.00007)CP 0.95220 (0.00427)∗ 0.95820 (0.00400) 0.96500 (0.00368) 0.95750 (0.00403)

βHX Bias 0.00141 (0.00183)∗ 0.00081 (0.00239)∗ -0.00002 (0.00235)∗ 0.00021 (0.00236)∗

Bias(SE) -0.00062 (0.00005) -0.00359 (0.00011) 0.06816 (0.00130) -0.00324 (0.00011)CP 0.95090 (0.00432)∗ 0.94630 (0.00451)∗ 0.99050 (0.00194) 0.94670 (0.00449)∗

Power 0.19840 (0.00798) 0.14250 (0.00699) 0.02850 (0.00333) 0.14490 (0.00704)0.3 βX Bias -0.00012 (0.00255)∗ 0.00353 (0.00308) 0.00152 (0.00306)∗ 0.00132 (0.00306)∗

Bias(SE) -0.00061 (0.00010) -0.00250 (0.00017) 0.11068 (0.00187) -0.00374 (0.00016)CP 0.94870 (0.00441)∗ 0.94840 (0.00442)∗ 0.99500 (0.00141) 0.94660 (0.00450)∗

βH Bias 0.00397 (0.00184) 0.00522 (0.00228) 0.00579 (0.00227) 0.00578 (0.00226)Bias(SE) 0.00091 (0.00003) 0.00370 (0.00007) 0.01377 (0.00035) 0.00371 (0.00007)CP 0.95480 (0.00415) 0.95770 (0.00403) 0.96840 (0.00350) 0.95810 (0.00401)

βHX Bias 0.00329 (0.00188) 0.00412 (0.00243) 0.00144 (0.00239)∗ 0.00187 (0.00239)∗

Bias(SE) -0.00022 (0.00006) -0.00204 (0.00012) 0.08979 (0.00150) -0.00140 (0.00012)CP 0.94960 (0.00438)∗ 0.94860 (0.00442)∗ 0.99390 (0.00156) 0.94950 (0.00438)∗

Power 0.89750 (0.00607) 0.72730 (0.00891) 0.28670 (0.00904) 0.72820 (0.00890)0.5 βX Bias -0.00441 (0.00264) 0.00017 (0.00321)∗ -0.00123 (0.00315)∗ -0.00153 (0.00314)∗

Bias(SE) -0.00076 (0.00010) -0.00268 (0.00019) 0.15368 (0.00229) -0.00309 (0.00018)CP 0.95110 (0.00431)∗ 0.95080 (0.00433)∗ 0.99610 (0.00125) 0.94930 (0.00439)∗

βH Bias 0.00458 (0.00187) -0.00035 (0.00236)∗ 0.00555 (0.00234) 0.00586 (0.00233)Bias(SE) 0.00309 (0.00004) 0.00492 (0.00008) 0.01917 (0.00039) 0.00494 (0.00008)CP 0.95810 (0.00401) 0.95850 (0.00399) 0.97540 (0.00310) 0.95880 (0.00398)

βHX Bias 0.00805 (0.00198) 0.00924 (0.00257) 0.00487 (0.00250) 0.00546 (0.00249)Bias(SE) -0.00086 (0.00007) -0.00309 (0.00014) 0.12125 (0.00183) -0.00114 (0.00015)CP 0.94850 (0.00442)∗ 0.94560 (0.00454)∗ 0.99620 (0.00123) 0.94950 (0.00438)∗

Power 0.99910 (0.00060) 0.98140 (0.00270) 0.61360 (0.00974) 0.98210 (0.00265)0.7 βX Bias -0.00483 (0.00280) -0.00398 (0.00342) -0.00286 (0.00328)∗ -0.00307 (0.00325)∗

Bias(SE) -0.00055 (0.00012) -0.00253 (0.00022) 0.20046 (0.00258) -0.00058 (0.00019)CP 0.95220 (0.00427)∗ 0.95230 (0.00426)∗ 0.99870 (0.00072) 0.95340 (0.00422)∗

βH Bias 0.00352 (0.00201) -0.01720 (0.00249) 0.00059 (0.00248)∗ 0.00122 (0.00248)∗

Bias(SE) 0.00326 (0.00004) 0.00600 (0.00009) 0.02586 (0.00048) 0.00569 (0.00009)CP 0.95790 (0.00402) 0.96020 (0.00391) 0.98050 (0.00277) 0.96470 (0.00369)

βHX Bias 0.00985 (0.00213) 0.01377 (0.00275) 0.00853 (0.00261) 0.00897 (0.00260)Bias(SE) -0.00106 (0.00008) -0.00297 (0.00017) 0.15517 (0.00204) 0.00127 (0.00014)CP 0.94740 (0.00446)∗ 0.94700 (0.00448)∗ 0.99800 (0.00089) 0.95250 (0.00425)∗

Power 1.00000 (0.00000) 0.99960 (0.00040) 0.76000 (0.00854) 0.99940 (0.00049)∗ Bias, Bias(SE) or CP is within simulation error of the nominal value


Table C.2: Simulation results for scenario ii)

βHX Risk Property GLM PML EE HYBRID0.1 βX Bias 0.00005 (0.00363)∗ 0.00039 (0.00445)∗ -0.00009 (0.00437)∗ -0.00042 (0.00438)∗

Bias(SE) -0.00396 (0.00019) -0.01036 (0.00033) 0.14237 (0.00331) -0.00991 (0.00031)CP 0.94730 (0.00447)∗ 0.94530 (0.00455) 0.98910 (0.00208) 0.94610 (0.00452)∗

βH Bias 0.01997 (0.00248) 0.02868 (0.00307) 0.02660 (0.00304) 0.02632 (0.00304)Bias(SE) 0.00485 (0.00006) 0.01041 (0.00014) 0.03773 (0.00098) 0.01082 (0.00014)CP 0.95640 (0.00408) 0.96490 (0.00368) 0.97680 (0.00301) 0.96670 (0.00359)

βHX Bias 0.00178 (0.00265)∗ 0.00399 (0.00346) 0.00186 (0.00336)∗ 0.00237 (0.00337)∗

Bias(SE) -0.00200 (0.00011) -0.00690 (0.00023) 0.11630 (0.00263) -0.00408 (0.00022)CP 0.94650 (0.00450)∗ 0.94630 (0.00451)∗ 0.98960 (0.00203) 0.94740 (0.00446)∗

Power 0.11960 (0.00649) 0.09430 (0.00584) 0.01860 (0.00270) 0.09160 (0.00577)0.3 βX Bias -0.00031 (0.00368)∗ 0.00205 (0.00448)∗ 0.00317 (0.00440)∗ 0.00276 (0.00440)∗

Bias(SE) -0.00279 (0.00020) -0.00741 (0.00034) 0.17497 (0.00387) -0.00732 (0.00033)CP 0.94980 (0.00437)∗ 0.94900 (0.00440)∗ 0.99140 (0.00185) 0.95020 (0.00435)∗

βH Bias 0.01891 (0.00254) 0.02414 (0.00314) 0.02411 (0.00311) 0.02390 (0.00311)Bias(SE) 0.00490 (0.00007) 0.01117 (0.00015) 0.04094 (0.00107) 0.01120 (0.00015)CP 0.95610 (0.00410) 0.96410 (0.00372) 0.97720 (0.00299) 0.96460 (0.00370)

βHX Bias 0.00651 (0.00274) 0.00949 (0.00357) 0.00286 (0.00346)∗ 0.00363 (0.00347)Bias(SE) -0.00247 (0.00012) -0.00717 (0.00025) 0.13899 (0.00308) -0.00423 (0.00024)CP 0.95100 (0.00432)∗ 0.94540 (0.00454) 0.99190 (0.00179) 0.94600 (0.00452)∗

Power 0.62610 (0.00968) 0.44170 (0.00993) 0.12480 (0.00661) 0.44280 (0.00993)0.5 βX Bias -0.00390 (0.00377) -0.00202 (0.00462)∗ 0.00337 (0.00449)∗ 0.00258 (0.00447)∗

Bias(SE) -0.00067 (0.00021) -0.00532 (0.00037) 0.22053 (0.00427) -0.00396 (0.00038)CP 0.95040 (0.00434)∗ 0.95110 (0.00431)∗ 0.99440 (0.00149) 0.95130 (0.00430)∗

βH Bias 0.01519 (0.00270) 0.01215 (0.00337) 0.01779 (0.00332) 0.01791 (0.00332)Bias(SE) 0.00308 (0.00007) 0.00716 (0.00017) 0.04270 (0.00119) 0.00757 (0.00016)CP 0.95480 (0.00415) 0.96160 (0.00384) 0.97690 (0.00300) 0.96130 (0.00386)

βHX Bias 0.01251 (0.00284) 0.01818 (0.00373) 0.00604 (0.00358) 0.00722 (0.00358)Bias(SE) -0.00105 (0.00013) -0.00686 (0.00028) 0.17203 (0.00340) -0.00227 (0.00032)CP 0.94990 (0.00436)∗ 0.94420 (0.00459) 0.99160 (0.00183) 0.94810 (0.00444)∗

Power 0.95720 (0.00405) 0.83200 (0.00748) 0.33520 (0.00944) 0.82820 (0.00754)0.7 βX Bias -0.01069 (0.00398) -0.01409 (0.00500) -0.00232 (0.00480)∗ -0.00324 (0.00476)∗

Bias(SE) 0.00003 (0.00024)∗ -0.00946 (0.00043) 0.27108 (0.00473) -0.00640 (0.00040)CP 0.95160 (0.00429)∗ 0.94650 (0.00450)∗ 0.99470 (0.00145) 0.94940 (0.00438)∗

βH Bias 0.01052 (0.00285) -0.00764 (0.00353) 0.01232 (0.00350) 0.01295 (0.00349)Bias(SE) 0.00545 (0.00009) 0.00943 (0.00019) 0.05259 (0.00122) 0.00993 (0.00019)CP 0.95870 (0.00398) 0.96370 (0.00374) 0.98460 (0.00246) 0.96320 (0.00377)

βHX Bias 0.02092 (0.00308) 0.03136 (0.00410) 0.01379 (0.00386) 0.01479 (0.00385)Bias(SE) -0.00301 (0.00016) -0.01221 (0.00033) 0.20727 (0.00372) -0.00431 (0.00033)CP 0.94930 (0.00439)∗ 0.93750 (0.00484) 0.99150 (0.00184) 0.95030 (0.00435)∗



Table C.3: Simulation results for scenario iii)

βHX Risk Property GLM PML EE HYBRID0.1 βX Bias 0.00266 (0.00353)∗ 0.01217 (0.00497) 0.00532 (0.00474) 0.00673 (0.00497)

Bias(SE) -0.00128 (0.00016) -0.01021 (0.00038) 0.06581 (0.00293) -0.01373 (0.00036)CP 0.94990 (0.00436)∗ 0.94720 (0.00447)∗ 0.95040 (0.00434)∗ 0.94110 (0.00471)

βH Bias 0.00252 (0.00283)∗ 0.00233 (0.00430)∗ 0.10597 (0.00358) 0.00001 (0.00427)∗

Bias(SE) -0.00161 (0.00006) 0.00238 (0.00020) -0.00429 (0.00059) 0.00339 (0.00020)CP 0.94710 (0.00448)∗ 0.95380 (0.00420)∗ 0.89020 (0.00625) 0.95410 (0.00419)∗

βHX Bias -0.00040 (0.00286)∗ 0.00130 (0.00463)∗ -0.00276 (0.00386)∗ -0.00221 (0.00442)∗

Bias(SE) -0.00203 (0.00011) -0.01062 (0.00031) 0.10739 (0.00293) -0.00038 (0.00033)CP 0.94860 (0.00442)∗ 0.94360 (0.00461) 0.98490 (0.00244) 0.95250 (0.00425)∗

Power 0.10520 (0.00614) 0.07030 (0.00511) 0.02430 (0.00308) 0.06330 (0.00487)0.3 βX Bias 0.00060 (0.00352)∗ 0.02051 (0.00486) -0.00037 (0.00471)∗ 0.00743 (0.00487)

Bias(SE) -0.00014 (0.00016)∗ -0.00413 (0.00038) 0.10020 (0.00336) -0.00799 (0.00037)CP 0.95180 (0.00428)∗ 0.95330 (0.00422)∗ 0.96240 (0.00380) 0.94860 (0.00442)∗

βH Bias 0.00282 (0.00280) -0.00543 (0.00425) 0.10520 (0.00359) -0.00176 (0.00423)∗

Bias(SE) 0.00122 (0.00006) 0.00633 (0.00020) 0.00868 (0.00072) 0.00734 (0.00020)CP 0.95210 (0.00427)∗ 0.96010 (0.00391) 0.90670 (0.00582) 0.95990 (0.00392)

βHX Bias 0.00524 (0.00285) 0.00922 (0.00448) 0.00302 (0.00389)∗ 0.00204 (0.00433)∗

Bias(SE) 0.00053 (0.00011) -0.00203 (0.00032) 0.13068 (0.00336) 0.00697 (0.00036)CP 0.95340 (0.00422)∗ 0.95110 (0.00431)∗ 0.98760 (0.00221) 0.95770 (0.00403)Power 0.56800 (0.00991) 0.27820 (0.00896) 0.13530 (0.00684) 0.26000 (0.00877)

0.5 βX Bias 0.00441 (0.00353) 0.04084 (0.00492) -0.00999 (0.00473) 0.01493 (0.00490)Bias(SE) 0.00126 (0.00016) -0.00419 (0.00039) 0.15706 (0.00393) -0.00820 (0.00037)CP 0.95230 (0.00426)∗ 0.95080 (0.00433)∗ 0.97540 (0.00310) 0.94570 (0.00453)∗

βH Bias -0.00156 (0.00287)∗ -0.02588 (0.00431) 0.09173 (0.00364) -0.01256 (0.00434)Bias(SE) 0.00068 (0.00007) 0.00514 (0.00020) 0.03227 (0.00093) 0.00553 (0.00020)CP 0.95390 (0.00419)∗ 0.95650 (0.00408) 0.94700 (0.00448)∗ 0.95710 (0.00405)

βHX Bias 0.00655 (0.00296) 0.00247 (0.00457)∗ 0.01035 (0.00394) 0.00279 (0.00446)∗

Bias(SE) -0.00160 (0.00013) -0.00468 (0.00034) 0.16268 (0.00372) 0.00459 (0.00039)CP 0.94890 (0.00440)∗ 0.94590 (0.00452)∗ 0.99080 (0.00191) 0.95200 (0.00428)∗

Power 0.93910 (0.00478) 0.62040 (0.00971) 0.35470 (0.00957) 0.61840 (0.00972)0.7 βX Bias 0.00478 (0.00371) 0.05732 (0.00512) -0.03260 (0.00471) 0.01857 (0.00502)

Bias(SE) -0.00491 (0.00017) -0.00848 (0.00041) 0.22289 (0.00435) -0.01140 (0.00039)CP 0.94470 (0.00457) 0.94490 (0.00456) 0.98320 (0.00257) 0.94740 (0.00446)∗

βH Bias -0.00333 (0.00297) -0.04869 (0.00445) 0.08578 (0.00381) -0.01693 (0.00451)Bias(SE) -0.00060 (0.00007) -0.00003 (0.00021)∗ 0.05940 (0.00114) 0.00223 (0.00021)CP 0.95000 (0.00436)∗ 0.94480 (0.00457) 0.96140 (0.00385) 0.95180 (0.00428)∗

βHX Bias 0.01060 (0.00313) -0.00782 (0.00467) 0.02696 (0.00396) 0.00990 (0.00459)Bias(SE) -0.00495 (0.00014) -0.00620 (0.00038) 0.19813 (0.00419) 0.00514 (0.00044)CP 0.94390 (0.00460) 0.94660 (0.00450)∗ 0.99210 (0.00177) 0.95500 (0.00415)Power 0.99850 (0.00077) 0.87320 (0.00665) 0.55620 (0.00994) 0.87890 (0.00652)

∗ Bias, Bias(SE) or CP is within simulation error of the nominal value


Table C.4: Simulation results for scenario iv)

βHX Risk Property GLM PML EE HYBRID0.1 βX Bias 0.00266 (0.00514)∗ 0.00954 (0.00737) -0.00113 (0.00679)∗ 0.00605 (0.00714)∗

Bias(SE) -0.00588 (0.00033) -0.02287 (0.00079) 0.13934 (0.00495) -0.01869 (0.00074)CP 0.94800 (0.00444)∗ 0.94690 (0.00448)∗ 0.96380 (0.00374) 0.94550 (0.00454)∗

βH Bias 0.01093 (0.00399) 0.01364 (0.00622) 0.15348 (0.00515) 0.00754 (0.00613)Bias(SE) 0.00001 (0.00013)∗ 0.00262 (0.00041) 0.01857 (0.00149) 0.00622 (0.00042)CP 0.95380 (0.00420)∗ 0.95420 (0.00418) 0.91140 (0.00568) 0.95750 (0.00403)

βHX Bias 0.00317 (0.00419)∗ 0.01122 (0.00702) 0.00408 (0.00566)∗ 0.00336 (0.00645)∗

Bias(SE) -0.00694 (0.00022) -0.02952 (0.00065) 0.17586 (0.00484) -0.00123 (0.00073)CP 0.94410 (0.00459) 0.93920 (0.00478) 0.98530 (0.00241) 0.95600 (0.00410)Power 0.07780 (0.00536) 0.05830 (0.00469) 0.01590 (0.00250) 0.04340 (0.00408)

0.3 βX Bias 0.00290 (0.00513)∗ 0.01840 (0.00739) -0.00329 (0.00686)∗ 0.01248 (0.00712)Bias(SE) -0.00353 (0.00033) -0.02180 (0.00081) 0.16555 (0.00531) -0.01577 (0.00075)CP 0.95180 (0.00428)∗ 0.94840 (0.00442)∗ 0.96330 (0.00376) 0.94760 (0.00446)∗

βH Bias 0.00454 (0.00401) -0.00332 (0.00617)∗ 0.14524 (0.00514) -0.00043 (0.00613)∗

Bias(SE) 0.00113 (0.00013) 0.00653 (0.00041) 0.03476 (0.00164) 0.00863 (0.00041)CP 0.95220 (0.00427)∗ 0.95790 (0.00402) 0.92640 (0.00522) 0.95890 (0.00397)

βHX Bias 0.01088 (0.00422) 0.02668 (0.00690) 0.01062 (0.00567) 0.00874 (0.00641)Bias(SE) -0.00527 (0.00024) -0.02263 (0.00066) 0.19426 (0.00529) 0.00444 (0.00080)CP 0.94990 (0.00436)∗ 0.94230 (0.00466) 0.98630 (0.00232) 0.95660 (0.00408)Power 0.32410 (0.00936) 0.16950 (0.00750) 0.06450 (0.00491) 0.13860 (0.00691)

0.5 βX Bias 0.00587 (0.00524) 0.03295 (0.00759) -0.01407 (0.00687) 0.02095 (0.00727)Bias(SE) -0.00690 (0.00034) -0.02757 (0.00082) 0.21657 (0.00603) -0.02158 (0.00081)CP 0.95130 (0.00430)∗ 0.94720 (0.00447)∗ 0.97270 (0.00326) 0.94630 (0.00451)∗

βH Bias 0.00366 (0.00407)∗ -0.02054 (0.00638) 0.13758 (0.00531) -0.00539 (0.00635)∗

Bias(SE) 0.00203 (0.00014) -0.00097 (0.00042) 0.05635 (0.00201) 0.00262 (0.00046)CP 0.95350 (0.00421)∗ 0.95350 (0.00421)∗ 0.94360 (0.00461) 0.95480 (0.00415)

βHX Bias 0.01367 (0.00434) 0.03298 (0.00705) 0.02134 (0.00583) 0.01455 (0.00663)Bias(SE) -0.00666 (0.00026) -0.02696 (0.00070) 0.21995 (0.00595) 0.00136 (0.00131)CP 0.94480 (0.00457) 0.93730 (0.00485) 0.98540 (0.00240) 0.95270 (0.00425)∗

Power 0.69010 (0.00925) 0.37840 (0.00970) 0.18060 (0.00769) 0.33450 (0.00944)0.7 βX Bias 0.00719 (0.00534) 0.05515 (0.00780) -0.03772 (0.00692) 0.03493 (0.00744)

Bias(SE) -0.00777 (0.00035) -0.03033 (0.00087) 0.26294 (0.00604) -0.02584 (0.00119)CP 0.94870 (0.00441)∗ 0.94600 (0.00452)∗ 0.97570 (0.00308) 0.94330 (0.00463)

βH Bias -0.00436 (0.00419) -0.05427 (0.00640) 0.11930 (0.00540) -0.01986 (0.00642)Bias(SE) 0.00149 (0.00014) -0.00045 (0.00042) 0.09126 (0.00232) 0.00497 (0.00045)CP 0.95430 (0.00418) 0.95120 (0.00431)∗ 0.95870 (0.00398) 0.95570 (0.00412)

βHX Bias 0.01775 (0.00450) 0.01673 (0.00723) 0.03438 (0.00597) 0.01418 (0.00698)Bias(SE) -0.00772 (0.00029) -0.03328 (0.00078) 0.24170 (0.00602) -0.00733 (0.00182)CP 0.94670 (0.00449) 0.93360 (0.00498) 0.98680 (0.00228) 0.94800 (0.00444)∗


Bibliography

Akey J, Jin L, Xiong M (2001). Haplotypes vs single marker linkage disequilibrium tests:

what do we gain? European Journal of Human Genetics, 9:291–300.

Burkett K (2002). Logistic Regression with Missing Haplotypes. Master’s thesis, Simon

Fraser University, Burnaby, BC.

Burkett K, Graham J, McNeney B (2006). hapassoc: Software for Likelihood Inference of

Trait Associations with SNP Haplotypes and Other Attributes. Journal of Statistical

Software, 16(2).

Burkett K, McNeney B, Graham J (2004). A note on inference of trait associations with

SNP haplotypes and other attributes in generalized linear models. Human Heredity,

57:200–206.

Chatterjee N, Carroll RJ (2005). Semiparametric maximum-likelihood estimation exploiting

gene-environment independence in case-control studies. Biometrika, 92:399–418.

Epstein MP, Satten GA (2003). Inference on haplotype effects in case-control studies using

unphased genotype data. American Journal of Human Genetics, 73:1316–1329.

Excoffier L, Slatkin M (1995). Maximum-likelihood estimation of molecular haplotype fre-

quencies in a diploid population. Molecular Biology and Evolution, 12:921–927.

Ghadessi M (2005). A comparison of two logistic regression approaches for case-control

data with missing haplotypes. Master’s thesis, Simon Fraser University, Burnaby, BC.

Ibrahim JG (1990). Incomplete data in generalized linear models. Journal of the American

Statistical Association, 85:765–769.

53

BIBLIOGRAPHY 54

Lake SL, Lyon H, Tantisira K, Silverman EK, Weiss ST, Laird NM, Schaid DJ (2003).

Estimation and tests of haplotype-environment interaction when linkage phase is am-

biguous. Human Heredity, 55:56–65.

Lawless JF, Kalbfleisch JD, Wild CJ (1999). Semiparametric methods for response-selective

and missing data problems in regression. Journal of the Royal Statistical Society, B,

61:413–438.

Louis TA (1982). Finding the observed information matrix when using the EM algorithm.

Journal of the Royal Statistical Society, B, 44:226–233.

Prentice RL, Pyke R (1979). Logistic disease incidence models and case-control studies.

Biometrika, 66:403–411.

Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA (2002). Score tests for

association between traits and haplotypes when linkage phase is ambiguous. American

Journal of Human Genetics, 70:425–434.

Shin J, McNeney B, Graham J (2006). Likelihood inference in case-control studies of

a rare disease under independence of genetic and continuous non-genetic covariates.

Submitted.

Smith PG, Day NE (1984). The design of case-control studies: the influence of confounding

and interaction effects. International Journal of Epidemiology, 13:356–365.

Spinka C, Carroll RJ, Chatterjee N (2005). Analysis of case-control studies of genetic and

environmental factors with missing genetic information and haplotype-phase ambiguity.

Genetic Epidemiology, 29:108–127.

Stephens M, Smith NJ, Donnelly P (2001). A new statistical method for haplotype recon-

struction from population data. American Journal of Human Genetics, 68:978–989.

Stram DO, Haiman CA, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Pike MC

(2003). Choosing haplotype-tagging SNPS based on unphased genotype data using a

preliminary sample of unrelated subjects with an example from the Multiethnic Cohort

Study. Human Heredity, 55:27–36.

BIBLIOGRAPHY 55

Zhao LP, Li SS, Khalid N (2003). A method for the assessment of disease associations with

single-nucleotide polymorphism haplotypes and environmental variables in case-control

studies. American Journal of Human Genetics, 72:1231–1250.

· APPROXIMATE LIKELIHOOD INFERENCE FOR HAPLOTYPE RISKS IN CASE-CONTROL STUDIES OF A RARE DISEASE by Zhijian Chen B.Sc. in Statistics, Peking University, 2003. a project submitted

Documents