Top Banner
Sequential Imputation and Linkage Analysis DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Zachary Skrivanek, B.S., M.S. ***** The Ohio State University 2002 Dissertation Committee: Shili Lin, Adviser Mark Irwin Steven MacEachern Approved by Adviser Department of Statistics
104

Sequential imputation and multipoint linkage analysis

Mar 06, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sequential imputation and multipoint linkage analysis

Sequential Imputation and Linkage Analysis

DISSERTATION

Presented in Partial Fulfillment of the Requirements for

the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Zachary Skrivanek, B.S., M.S.

* * * * *

The Ohio State University

2002

Dissertation Committee:

Shili Lin, Adviser

Mark Irwin

Steven MacEachern

Approved by

AdviserDepartment of Statistics

Page 2: Sequential imputation and multipoint linkage analysis

c�

Copyright by

Zachary Skrivanek

2002

Page 3: Sequential imputation and multipoint linkage analysis

ABSTRACT

Multilocus calculations using all available information on all pedigree members are im-

portant for linkage analysis. Exact calculation methods in linkage analysis are limited in

either the number of loci or the number of pedigree members they can handle. In this the-

sis, we propose a Monte Carlo method for linkage analysis based on sequential imputation.

Unlike exact methods, sequential imputation can handle both a moderate number of loci

and a large number of pedigree members. Sequential imputation does not have the prob-

lem of slow mixing encountered by Markov chain Monte Carlo methods because of high

correlation between samples from pedigree data. This Monte Carlo method is an applica-

tion of importance sampling in which we sequentially impute ordered genotypes locus by

locus and then impute inheritance vectors conditioned on these genotypes. The resulting

inheritance vectors together with the importance sampling weights are used to derive a con-

sistent estimator of any linkage statistic of interest. The linkage statistic can be parametric

or nonparametric; we focus on nonparametric linkage statistics. We showed that sequential

imputation can produce accurate estimates within reasonable computing time. Then we

performed a simulation study to illustrate the potential gain in power using our method for

multilocus linkage analysis with large pedigrees. We also showed how sequential imputa-

tion can be used in haplotype reconstruction, an important step in genetic mapping. In all

ii

Page 4: Sequential imputation and multipoint linkage analysis

of the applications of sequential imputation we can incorporate interference, which often is

ignored in linkage analysis due to computational problems. We demonstrated the effect of

interference on haplotyping and linkage analysis. We have implemented sequential impu-

tation for multilocus linkage analysis in a user-friendly software package called SIMPLE

(Sequential Imputation for Multi-Point Linkage Estimation). SIMPLE currently can esti-

mate LOD scores, IBD sharing statistics and haplotype configuration probabilities for both

simple and complex pedigrees with or without interference.

iii

Page 5: Sequential imputation and multipoint linkage analysis

This is dedicated to my father, Kenneth Skrivanek, for his unwavering support.

iv

Page 6: Sequential imputation and multipoint linkage analysis

ACKNOWLEDGMENTS

I thank my advisors Mark Irwin and Shili Lin for their enormous dedication and contri-

bution to my research at Ohio State University.

The Collaborative Study on the Genetics of Alcoholsim (COGA) (H. Begleiter, SUNY

HSCB principal Investigator, T. Reich, Washington University, Co-Principal Investigator)

includes nine different centers where data collection, analysis, and/or storage takes place.

The nine sites and Principal Investigators and Co-Investigators are: Indiana Univeristy (T.-

K. Li, J. Nurnberger Jr., P.M. Conneally, H. Edenberg); Univeristy of Iowa (R. Crowe, S.

Kuperman); University of California at San Diego (M. Schuckit); University of Connecticut

(V. Hesselbrock); State University of New York, Health Sciences Center at Brooklyn (B.

Porjesz, H. Begleiter); Washington University in St. Louis (T. Reich, C.R. Coninger, J.

Rice, A. Goate); Howard University (R. Taylor); Rutgers University (J. Tischfield); and

Southwest Foundation (L. Almasy). This national collaborative study is supported by the

NIH Grant U10AA08403 from the National Institute on Alcohol Abuse and Alcoholism

(NIAAA). GAW11 was supported by NIH grant GM31575.

v

Page 7: Sequential imputation and multipoint linkage analysis

VITA

March 17,1970 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Born - New York, USA

1992 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.S. Industrial & Labour Relations,Cornell University

1992-1994 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Tour Guide, Costa Rica

1997 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.S. Statistics, Ohio State University

1997-1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Market Analyst, Nationwide

1998-2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Biostatistician, Abbott Laboratories

1994-present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Graduate Teaching Associate,Ohio State University.

PUBLICATIONS

Research Publications

Z. Skrivanek, S. Lin, M. Irwin, “Linkage Analysis with Sequential Imputation”. Depart-ment of Statistics, Ohio State University, Technical Report No. 689. August, 2002.

FIELDS OF STUDY

Major Field: statistics

Studies in Linkage Analysis: Prof. Shili Lin and Mark Irwin

vi

Page 8: Sequential imputation and multipoint linkage analysis

TABLE OF CONTENTS

Page

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Chapters:

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Genetics Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Genotypes and Phenotypes . . . . . . . . . . . . . . . . . . . . 11.1.2 Meiosis and Distance . . . . . . . . . . . . . . . . . . . . . . . 31.1.3 Pedigree Data & Inheritance Vectors . . . . . . . . . . . . . . . 6

1.2 Linkage Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 NPL Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.1 Scoring Functions . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.2 The Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

vii

Page 9: Sequential imputation and multipoint linkage analysis

2. Sequential Imputation For NPL Analysis . . . . . . . . . . . . . . . . . . . . . 17

2.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 The Null Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3 The Software Package . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4 Computational Requirements . . . . . . . . . . . . . . . . . . . . . . . 212.5 Accuracy of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.6 To Reweight or Not? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3. Power Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2 Type I error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4. Haplotyping: An Application . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5. Interference Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.1 Study Design for Simulation . . . . . . . . . . . . . . . . . . . . . . . . 565.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.3.1 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.3.2 Precision/Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 60

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.1 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.2 Interference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.3 Quantitative Trait Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 74

Appendices:

A. SIMPLE documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

viii

Page 10: Sequential imputation and multipoint linkage analysis

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

ix

Page 11: Sequential imputation and multipoint linkage analysis

LIST OF TABLES

Table Page

1.1 Penetrances for the ABO locus. . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Time & memory requirements for 1,000 imputations. . . . . . . . . . . . . 23

2.2 Gametes of children in last generation. . . . . . . . . . . . . . . . . . . . . 31

3.1 Power estimates for a single pedigree. . . . . . . . . . . . . . . . . . . . . 37

3.2 Sample size estimates for models I & II. . . . . . . . . . . . . . . . . . . . 40

3.3 Sample size estimates for model III. . . . . . . . . . . . . . . . . . . . . . 41

3.4 Type I error rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1 Haplotype configuration probabilities with N=100,000 imputations. . . . . 50

4.2 Haplotype configuration probabilities. . . . . . . . . . . . . . . . . . . . . 51

4.3 Haplotypes for 1001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.1 Power estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2 Difference of LOD scores. . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3 Maximum LOD score locations. . . . . . . . . . . . . . . . . . . . . . . . 66

x

Page 12: Sequential imputation and multipoint linkage analysis

LIST OF FIGURES

Figure Page

1.1 Crossover. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Pedigree with a loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Pedigree with genotypes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 Pedigree used for the validation study. . . . . . . . . . . . . . . . . . . . . 25

2.2 Scores from the validation study. . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Scores from the validation study. . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 Pedigree used in reweighting study. . . . . . . . . . . . . . . . . . . . . . 30

2.5 Scores with and without reweighting. . . . . . . . . . . . . . . . . . . . . . 32

2.6 Scores with reweighting for sample sizes N=1K, 5K, 25K, 125K. . . . . . . 33

3.1 Pedigree structure for the power study. . . . . . . . . . . . . . . . . . . . . 36

3.2 Power curves for thresholds 2.33 and 3.09. . . . . . . . . . . . . . . . . . . 38

3.3 Power curves for thresholds 3.72 and 4.27. . . . . . . . . . . . . . . . . . . 39

4.1 Pedigree used in haplotyping study. . . . . . . . . . . . . . . . . . . . . . 48

5.1 Pedigree used in interference study. . . . . . . . . . . . . . . . . . . . . . 57

xi

Page 13: Sequential imputation and multipoint linkage analysis

5.2 An example of LOD scores from data with interference. . . . . . . . . . . . 62

xii

Page 14: Sequential imputation and multipoint linkage analysis

CHAPTER 1

INTRODUCTION

1.1 Genetics Background

This section will provide the necessary concepts and terminology in genetics to under-

stand this thesis. For a more elaborate discussion of these concepts the reader may refer to

Lange (1997).

1.1.1 Genotypes and Phenotypes

Most human cells contain 1 pair of sex chromosomes and 22 homologous pairs of auto-

somal chromosomes. In each pair, one chromosome was inherited from the mother and the

other from the father. A locus (plural loci) is a site on a chromosome. The different poly-

morphisms (or forms) that a locus can take on are called alleles and their corresponding

population frequencies are known as allele frequencies. The pair of alleles at an autosomal

locus (one from the paternal chromosome and the other from the maternal chromosome)

determines a genotype. If the alleles are the same, the genotype is homozygous otherwise

it is heterozygous. A genotype combined with the parental origin of each allele is called an

ordered genotype. A set of alleles on a single chromosome (i.e. they are in-phase) is called

a haplotype.

1

Page 15: Sequential imputation and multipoint linkage analysis

Markers are well characterized loci whose location, number of alleles and allele fre-

quencies are known. The markers which we will consider can be genotyped, i.e. their

genotypes can be determined through laboratory techniques on blood or tissue samples.

They are used in mapping genes in linkage analysis among other applications. Genes are

a type of locus with a function or a gene product. Genes may influence the phenotype, i.e.

an observed trait of a person. These traits may be quantitative or qualitative. Quantitative

traits include height, blood pressure and bone mineral density. Examples of qualitative

traits are affectation status of a disease, eye color and hair color. An allele for a disease

gene which is associated with affectation status is called a mutant allele

A penetrance for a qualitative trait is the probability of a phenotype given a genotype.

For example, a well known qualitative trait is the blood type in humans governed by the

ABO locus which resides on the long arm of chromosome 9 at band q34. This gene de-

termines detectable antigens on the surface of red blood cells. There are 3 alleles at this

locus: A, B and O. The order of the alleles within a genotype does not have an effect on the

phenotype. So a paternal A allele and maternal B allele has the same effect as a paternal

B allele and maternal A allele. There are 6 possible genotypes and 4 possible phenotypes:

A, B, AB and O. The Table 1.1 details the relationship between the genotypes and the phe-

notypes for the ABO locus. In this example, the trait, blood type, is fully penetrant (i.e.

the penetrances are 0 or 1). Penetrances may not always be 0 or 1, but may lie somewhere

between the two extremes. For example, the penetrance for a single copy of the mutant

BRCA1 allele at age 70 has been estimated to be approximately 60% (Warner et al., 1999).

2

Page 16: Sequential imputation and multipoint linkage analysis

PhenotypesGenotypes Type A Type B Type AB Type O

AA 1 0 0 0AB 0 0 1 0AO 1 0 0 0BB 0 1 0 0BO 0 1 0 0OO 0 0 0 1

Table 1.1: Penetrances for the ABO locus.

1.1.2 Meiosis and Distance

Meiosis is the process of forming gametes. During meiosis homologous pairs self-

replicate to give rise to two sister chromatids each of which are connected to each other at

the centromere. These homologous chromosomes (consisting of 2 chromatids each) align

together to form a bundle of 4 chromatids. The homologs are bound by bands of protein

known as the synaptonemal complex (Cummings, 1997). As homologs begin to separate

from each other one or more areas between non-sister chromatids remain in contact at

locations known as chiasmata (chiasma singular). It is believed that at these chiasmata ex-

change of genetic material between non-sister chromatids occurs through a process known

as crossing over. Each chromatid in the pair of sister chromatids is assumed to have a���������chance of participating in each crossover. It is further assumed that the probabil-

ity of a chromatid participating in one crossover is independent of previous crossovers for

that same chromatid, i.e. there is no chromatid interference. This assumption is largely

supported by empirical data (Zhao et al., 1995a).

3

Page 17: Sequential imputation and multipoint linkage analysis

There is, however, considerable evidence that the occurrence of a chiasma suppresses

the occurrence of another chiasma nearby (Weeks et al., 1993). This phenomenon is known

as (positive) chiasma interference or (positive) crossover interference or simply interfer-

ence.

Figure 1.1: Simplified depiction of meiosis with crossing over.

Not all crossovers may be observed due to the discrete nature of genetic data. We

only observe the phenotypes at loci on the chromosome, not on an actual interval of the

chromosome where the crossovers would occur. In fact, for two adjacent loci it is only

possible to observe a recombination of genetic material resulting from an odd number of

crossovers (the exact number not known). Consider the simplified depiction of crossover

during meiosis in Figure 1.1. The black chromatids are from one parent and the white

chromatids are from the other parent. Before separating, the two non-sister chromatids

exchange genetic material at the 2 points indicated in the picture (I in Figure 1.1). After

the homologous pairs separate (II in Figure 1.1) the sister chromatids eventually divide into

4 separate chromatids (III in Figure 1.1) which are allocated to 4 gametes. If the second

or third chromatid is passed on to the offspring we may observe a recombination between

the loci indicated by A and B in the picture since there was an odd number of crossovers

4

Page 18: Sequential imputation and multipoint linkage analysis

between A and B involving these two chromatids. Whereas if we just had information on

A and C there is no recombination to observe on any of the chromatids since there were an

even number of crossovers between these two loci on all of the chromatids.

The probability of a recombination between two loci is called the recombination frac-

tion. The recombination fraction, � , is bounded between 0 and � if there is no chromatid

interference. This can be easily seen with Mather’s formula. Let �� �� ��� be the random

number of chiasmata between two loci � and � . Mather’s formula is (Lange, 1997, p.

207): ��� �� P �� � �� ����� ���The proof follows from the definition of recombination and aforementioned assump-

tions. When two loci are linked (i.e. they are on the same chromosome) the recombination

fraction is less than � . Otherwise they are unlinked and the recombination fraction is � .The recombination fraction between two loci reflects their distance between each other

in the genome. The closer two loci are the smaller the recombination fraction between the

two loci. The recombination fraction is not additive, however. To get this desired property

we use the genetic distance.

The genetic distance, � , between two loci is defined as the expected number of crossovers

between the two loci on a chromatid, ie ��� � E !"� �� �#�%$ . This metric has the advantage that

it is additive. The genetic distance, � , is measured in units called Morgans (or centiMor-

gans, cM, by multiplying the Morgan by 100).

5

Page 19: Sequential imputation and multipoint linkage analysis

Map functions have been derived which map a recombination fraction to a genetic dis-

tance. These include the Haldane map function which assumes no interference (Haldane,

1919) and map functions derived from count-location models (Karlin and Liberman, 1978).

An alternative to map functions is to model the chiasma process along the chromatid

bundle directly by a stationary renewal process in the genetic distance metric. For example,

we may model the distance between adjacent chiasmata as &'&'�)( +* ,.- �0/ . The parameter 1can be considered an intensity parameter of interference. When 1 =0, the point process

corresponds to the Poisson process and there is no interference. As 1 increases so does the

level of interference.

The recombination model for this point process was derived by a number of authors

(Zhao et al., 1995b). Given a set of ordered loci and their corresponding (genetic) distances

one can compute recombination probabilities under the ( +* ,.- �0/ model for any intensity

parameter 1 (Zhao et al., 1995b). Zhao et al. (1995b) showed that this model fits a wide

variety of recombination data well. Lin and Speed (1996) showed that the ( model with

intensity parameter 1 =4 fits human pedigree data at least as well as, if not better than,

competing map functions.

1.1.3 Pedigree Data & Inheritance Vectors

The data that we analyze in linkage analysis consists of pedigrees and information

on the individuals in the pedigrees. A pedigree contains members who are related either

through marriage or kinship. Founders are members whose parents are not included in the

pedigree and nonfounders are the rest of the members. By convention, nonfounders have

both of their parents in the pedigree. A pedigree has a “loop” if there is an individual in

6

Page 20: Sequential imputation and multipoint linkage analysis

the pedigree such that you can trace a path from that individual to connecting members and

eventually come back to the same member by a different path. For example, consider the

looped pedigree in Figure 1.2. Starting with member 7 you can trace a path from 7 to 4

to 5 to 8 and then back to 7 again. Pedigrees with at least one loop will be referred to as

“complex”; those without one will be called “simple”.

Figure 1.2: Pedigree with a loop.

The data on the individuals in a pedigree, 2 , consists of their disease status, covariates

(e.g. age, weight, etc...) and marker data. In linkage analysis, pedigrees are ascertained or

included in the study based on a certain criteria such as number of affecteds. We partition

the data into 23�546287:9<;>=@? , the marker information, 2A7 , and the information on the trait of

interest, ;B= .7

Page 21: Sequential imputation and multipoint linkage analysis

By convention males are symbolized by squares and females are symbolized by circles.

The pedigree in Figure 1.3 has 7 members genotyped at 3 markers. Members 1, 2 and 4 are

founders, the rest are nonfounders.

Figure 1.3: Pedigree with genotypes.

The inheritance information in a pedigree can be completely described by a set of inher-

itance vectors. The inheritance vector C3�D4E1 � 9GF � 9IHJHJHI901#K�9+F�KL? is a binary representation

of the inheritance information at a location in the genome for each of the M nonfounders.

The &0NEO nonfounder is assigned 2 bits, 1�P and FQP , corresponding to the genetic information

inherited from the father and mother. Each bit is either 1 or 0 depending on whether the

allele was inherited from the grandmother or grandfather, respectively. The inheritance

8

Page 22: Sequential imputation and multipoint linkage analysis

distribution, P 40CSR 2A7T? , is the distribution of the inheritance vectors conditioned on the ob-

served marker data, 2A7 . The inheritance vector at the first locus for nonfounders 3, 5, 6

and 7 in Figure 1.3 is (?,?, ?,0, ?,1, ?,0). The ‘?’ indicates that the inheritance bit cannot

be determined. Only the ancestral origins of the maternal alleles (the maternal inheritance

bits) for the children in the third generation (5, 6 and 7) can be determined. The genotypes

of the parents and grandparents of an individual are required (but not necessarily sufficient)

to determine his/her inheritance bits.

We can further determine that allele � at the first locus for persons 5 and 7 are copies of

the same maternal allele. In this case they share the grandpaternal allele from their mother.

We say that this allele is shared Identical By Descent (IBD). On the other hand, although

persons 5 and 7 both inherited a � allele from their father, it is not clear whether this allele is

shared IBD, since one could be grandmaternal and the other could be grandpaternal. There

was an observed recombination between the first and second loci in member 7’s maternal

gamete indicated by an ‘x’. So the maternal gamete for individual 7 was a combination of

the two chromosomes in her mother. The first locus was grandpaternal and the next two

loci were grandmaternal. As a result, 5 and 7 do not share any other maternal alleles IBD.

The concepts of recombination and IBD play an important role in linkage analysis.

1.2 Linkage Analysis

Linkage analysis assesses whether a locus of interest is linked to a set of markers. That

is, it tests whether the locus is on the same chromosome as a set of markers. The hypotheses

9

Page 23: Sequential imputation and multipoint linkage analysis

being tested are:

H UWV disease gene not linked

H XYV disease gene linked

The LOD score is a popular parametric statistic in linkage analysis. It is the logarithm

base 10 of the likelihood of the disease gene at a specific location linked to the markers

(an alternative hypothesis scenario) divided by the likelihood of the disease gene not linked

to the markers (the null case). Traditionally, a LOD score above 3 is used as a criteria for

linkage.

Linkage analysis extracts inheritance information from pedigree data to evaluate the

cosegregation of marker and trait alleles. Thus it is important to utilize available infor-

mation on multiple markers and all pedigree members. Unfortunately, algorithms for exact

analysis are computationally limited in either the number of markers or the number of pedi-

gree members they can handle. Peeling and the Hidden Markov Model (HMM) approaches

are two such exact methods that are most frequently used.

Peeling (Elston and Stewart, 1971; Cannings et al., 1978) is a computational algorithm

that successively aggregates inheritance information from pedigree members. The algo-

rithm scales linearly with the number of pedigree members, but exponentially with the

number of loci. Genotype elimination (Lange and Goradia, 1987; O’Connell and Weeks,

1999) and set-recoding (O’Connell and Weeks, 1995) have been proposed to reduce the

computational requirements so that data from more loci can be processed jointly. Despite

these improvements, peeling is still limited in the number of loci that it can handle.

10

Page 24: Sequential imputation and multipoint linkage analysis

The HMM methods model the underlying inheritance pattern as an inhomogeneous

Markov chain with each entry of the transition matrix being a function of the recombina-

tion fraction between adjacent loci (Lander and Green, 1987). The key to the algorithm is

the assumption of no genetic interference. In contrast to peeling, the HMM method scales

linearly with the number of loci, but exponentially with the number of pedigree members.

Many improvements have been made to reduce computational requirements so that more

pedigree members can be analyzed. Properties of the transition matrix (Kruglyak et al.,

1995) and symmetries in founder phases (Kruglyak et al., 1996) were exploited to reduce

the amount of calculations. Fast Fourier transformations (Kruglyak and Lander, 1998)

further speed up calculations. Using observed genotypes to reduce the inheritance space

(Markianos et al., 2001a) and to form equivalence classes (Markianos et al., 2001b) allows

for potentially more pedigree members. Idury and Elston (1997) describe a ‘divide and

conquer’ algorithm which speeds up some of the calculations and allows for sex-specific

recombination without any computational penalty. This ‘divide and conquer’ method was

incorporated into the software package Merlin (Abecassis et al., 2002) which also uses

an approximation method to expand the size of the pedigree it can handle in some cases.

Other algorithmic improvements such as efficient tree traversal were made to the HMM

algorithm and incorporated into Allegro (Gudbjartsson et al., 2000). However, even with

these improvements, the HMM formulation inevitably scales exponentially with the num-

ber of pedigree members.

Monte Carlo methods have been proposed to overcome these computational limita-

tions. Two major approaches of Monte Carlo methods to linkage analysis are Markov

11

Page 25: Sequential imputation and multipoint linkage analysis

chain Monte Carlo (MCMC) and sequential imputation. MCMC algorithms can be de-

signed such that they scale linearly in both the number of loci and the number of pedigree

members (Thompson, 2000). Thus, MCMC is an extremely powerful estimation method

that can practically deal with any number of loci and pedigree of arbitrary size and com-

plexity (Luo et al., 2001). However, due to strong dependencies among realizations of the

Markov chain, convergence can be slow (Thompson, 2000).

Sequential imputation is another Monte Carlo method that has been successfully ap-

plied to a variety of areas (Blake et al., 2001; Bergman, 2001). Irwin et al. (1994) illus-

trated how to use sequential imputation in linkage analysis to calculate the likelihood (and

hence LOD scores), utilizing the peeling algorithm for a single locus, which results in an

algorithm that also scales linearly in both the number of loci and the number of pedigree

members. For pedigrees that are not very complex (i.e., single-locus peelable), sequen-

tial imputation is expected to be more efficient computationally than MCMC methods in

most circumstances. However, it should be noted that sequential imputation is not meant

to be a replacement for MCMC, as it cannot handle very complex pedigrees, such as the

1544-member Hutterite pedigree successfuly dealt with using MCMC methods (Luo et al.,

2001).

1.3 NPL Statistics

In this dissertation we will extend sequential imputation to nonparametric linkage anal-

ysis. This is an important step forward in making sequential imputation a viable alternative

12

Page 26: Sequential imputation and multipoint linkage analysis

for linkage analysis, as nonparametric linkage analysis is frequently more suited for ana-

lyzing complex traits whose underlying genetic model is unknown or unclear. We will now

describe nonparametric linkage (NPL) statistics.

NPL statistics measure IBD sharing among affecteds at a locus and compare the ob-

served sharing to what would be expected if the locus was not linked to a disease locus.

NPL statistics make no explicit assumptions about the trait model, hence they are nonpara-

metric. If the sharing significantly exceeds the expected value under the null model then

there is evidence of linkage.

The NPL statistics are based on a scoring function which scores the amount of IBD

sharing there is among affecteds. The scoring function is designed to give higher scores

under linkage than no linkage.

1.3.1 Scoring Functions

A scoring function, S VZ� S 4'C[9+;B=@? for inheritance vector C and observed disease phe-

notypes ;B= , measures the amount of IBD sharing among the affecteds. Whittemore and

Halpern (1994) presented two scoring functions, S , X\P%]\^ and S X`_a_ , which are popular today

in linkage analysis. S, X\P%]\^ assigns � 9b�c or 0 to each pair of affecteds that share 2, 1 or 0

alleles IBD, respectively, and then takes the average of the scores from all possible pairs in

a group of affecteds to score IBD sharing in the entire pedigree. For example, suppose two

sibs are affected in a pedigree and have the following inheritance vector (1,0, 1,0), which

implies that they both inherited the grandmaternal allele from their father and the grandpa-

ternal allele from their mother. Therefore they share two alleles IBD and would contribute� to the numerator of the score for the pedigree. The scores for all pairs of affecteds are

13

Page 27: Sequential imputation and multipoint linkage analysis

added together and the sum is divided by � X � , where d is the number of affecteds in the

pedigree. S, X\P%]\^ gives increasing scores as the number of alleles shared IBD between a pair

of affecteds increases.

In contrast, S X`_a_ gives increasing scores as the number of affecteds sharing an allele IBD

increases. It is defined as (Kruglyak et al., 1996):

S � �fe XAg Oh \ij Plk ��m P`4'n�?po!q"H

where n is a collection of alleles obtained by choosing one allele from each of these affected

individuals, and m P`4'n�? denotes the number of times that the &'NEO founder allele appears in n(for &r� � 9JHJHJH.9 ��s ) where

sis the number of founders. The sum is taken over the

� Xpossible ways to choose n , where d is the number of affecteds.

The (raw) score, S, from a pedigree is then standardized by the mean and standard

deviation under no linkage, S U and V U , respectively, to form the standardized score Z:

Z � S�

S UtV U H (1.1)

1.3.2 The Statistic

Rarely can the inheritance vector C be determined completely given the data in a pedi-

gree, 2 . Instead, we derive the expected value of the score conditioned on 2u7 , defined

previously as the marker data (Kruglyak et al., 1996):

E S 4'C[9+;>=@?vR 287w$>� g�x S 4'C[9+;>=@? P 40CSR 287y?.H (1.2)

We note that, if we add genetic parameters for the disease model to the score function,

the statistic in the form (1.2) becomes a parametric statistic. In fact, the familiar LOD

14

Page 28: Sequential imputation and multipoint linkage analysis

score is included in this class (Kruglyak et al., 1996). For ease of notation we will letzS V%� E S 40C{9+;>=@?|R 287S$ . This should not be confused with the sample mean, however, since

this mean is derived with respect to the inheritance distribution, P 4'CSR 2{7T? .Following Kruglyak et al. (1996) we standardize the expected raw score by the same

mean and standard deviation used in equation (1.1):zZ � z

S�

S UtV U H

We note that this is the correct null mean since

S Uy� E S $�� E E S R 287w$}$09but the null variance is actually conservative since

V Uy� Var S $� Var E S R 287S$}$b~ E Var S R 287�$�$�Var E S R 287�$�$

and strict inequality will always hold unless 2[7 determines C (and hence determines S) in

which case zS � E S 4'C{9<;>=p?vR 2�7w$B� S 4'C{9<;>=@?S� S H

Using the null variance of S 40C{9+;B=p? as a substitute for the null variance of E S R 2[7�$ was

suggested by Kruglyak et al. (1996) as the “perfect data approximation”. The variance of

E S R 287w$ is difficult to calculate analytically. It could be estimated via simulation, but we

will not pursue that here.

15

Page 29: Sequential imputation and multipoint linkage analysis

Suppose a data set contains M pedigrees with scoreszS � 9IHJHJHI9 zS K and null means and

variances S U+� � 9JHJHJH.9 S U+� K and V U+� � 9JHIHJHI9 V U+� K , respectively. We can standardize the sum of the

raw scores by � KPEk � zS P � � KPEk � S U+� P� � KPEk � V U+� P HThis statistic has a null mean of 0 and variance � �

and is asymptotically normal. This

standardization was suggested by McPeek (1999) and considered “optimal” (under certain

conditions and a criterion of power). We will use this standardization throughout this paper.

16

Page 30: Sequential imputation and multipoint linkage analysis

CHAPTER 2

SEQUENTIAL IMPUTATION FOR NPL ANALYSIS

The idea is to estimate the linkage statistic in equation (1.2) via sequential imputation

instead of calculating it exactly. Sequential imputation is an application of importance

sampling. We first impute ordered genotypes sequentially locus by locus via single-locus

peels. We then simulate inheritance vectors conditioned on these multilocus genotypes.

The inheritance vectors along with the importance sampling weights can be used to estimate

any linkage statistic of the form given in equation (1.2).

2.1 The Algorithm

We decompose the marker data, 2[7 , further into the information we have on the Fmarkers, 287����|; � 9JHIHJH�9�;�7�� . We denote the ordered genotypes at the F markers, �v� � 9IHJHJHI9��7�� , as � . We peel the first locus and impute the ordered genotypes at this locus (step

1). We sequentially impute the ordered genotypes of the rest of the loci locus by locus

conditioned on the previously imputed genotypes and then form the importance sampling

weight (steps 2 and 3). Then we simulate the inheritance vector C at a particular location

given the simulated ordered genotypes at the F markers (step 4). Finally, we calculate the

score using C (step 5).

17

Page 31: Sequential imputation and multipoint linkage analysis

Step 1. Calculate P 46; � ? and simulate � � from P 46� � R ; � ? .Step 2. For ��� � 9JHIHJH�9+F we carry out the following steps:

(a) Calculate P 4�; N R ; � 9G� � 9JHJHJH�9+; N e � 9G� N e � ? .(b) Derive P 46� N R ; � 9G� � 9JHJHIHI9+; N e � 9G� N e � 9+; N ?pH(c) Simulate � N from P 46� N R ; � 9G� � 9IHJHJH.9+; N e � 9G� N e � 9+; N ? .

Step 3. Form ��4���?�� P 46; � ?�� 7N k P 4�; N R ; � 9+� � 9JHIHJHI9+; N e � 9+� N e � ? .Step 4. Simulate C at a location of interest according to P 4'C�R ��? , where � are the

ordered genotypes simulated in steps 1-3. Note that P 40CSR ��? =P 4'CSR �w9+2{7�? .Step 5. Calculate the score S 4'C[9+;�=G? .

Steps 1 to 5 are carried out N times to form ��46� � ?.9JHIHJHI9G��46� N ? and S 4'C � 9+;>=p?.9IHJHJH.9 S 4'C N 9<;��.? .The probability calculations and the simulations in steps 1 through 2 are done by means of

single locus peeling and sampling using reverse peeling (Ploughman and Boehnke, 1989;

Ott, 1989).

Irwin et al. (1994) show that the sampling distribution of the ordered genotypes,

P = 46��R 287y? , satisfies:

P = 46��R 287y?S� P 4��WR 287y? P 4�287�?��4���? H (2.1)

18

Page 32: Sequential imputation and multipoint linkage analysis

From this equality it follows:

E

x � �� S 4'C[9+;>=@?`��46��?vR 287�$� g�x�g � S 4'C[9+;>=p?`��46��? P 4'CSR �w9+287�? P = 4���R 287y?� g�x S 4'C[9+;>=@? g � P 4'C�R �w9+287�? P 46��R 2�7T? P 46287y?��46��? ��46��?� g x S 4'C[9+;>=@? g � P 4'C[9+��R 287�? P 46287T?� g x S 4'C[9+;>=@? P 4'CSR 287T? P 462�7T?� P 4�287�? E x S 4'C[9+;>=p?vR 287S$'H

This result and the fact that the average of the weights is an unbiased estimator of P 462u7T?(Irwin et al., 1994) gives us a consistent estimator for the linkage statistic in (1.2):�

E

x S 4'C{9<;>=@?\$B� Ng � k � S 40C � 9+;>=p? ��46� � ?��4'~�? 9 (2.2)

where ��40~�?�� �N� k � ��46� � ? . So the estimate is a weighted average of the scores.

The only disease data that we use to calculate the nonparametric IBD scores (step 5 in

the algorithm) is the affectation status. To calculate the score, S 4'C{9<;A=@? , we first assign each

of the founders two unique labels, known as IBD states. We pass the IBD states down the

pedigree using the simulated inheritance vector. We then measure the amount of IBD states

in common amongst the affecteds via the IBD statistics.

2.2 The Null Distribution

The IBD statistic measures the amount of IBD sharing. If the amount of sharing among

the affecteds is significantly more than what would be expected under random segregation

19

Page 33: Sequential imputation and multipoint linkage analysis

and independent assortment, then there is evidence of linkage. Therefore it is necessary to

measure the mean and variance of the scores under random segregation and independent

assortment, the null case. To estimate the null mean and variance we simply pass the IBD

states through the pedigree with 50% probability of a particular state being passed on to

an offspring and calculate the score. We repeat this process many times to get a sample of

the scores from the null distribution. The mean and variance of this sample give unbiased

estimates of the null mean and variance. Furthermore, the null distribution can be used to

estimate the exact p-values. We then standardize the estimated score by the null mean and

null standard deviation to form the standardized statistic:�zZ . Furthermore, the simulated

scores under the null distribution are used to estimate the exact p-value. We note that this

leads to conservative estimates of the standardized statistic and p-value as pointed out by

Kruglyak et al. (1996)

2.3 The Software Package

We have implemented sequential imputation for linkage analysis in a software package

called SIMPLE (Sequential Imputation for MultiPoint Linkage Estimation). The nonpara-

metric IBD statistics currently available in SIMPLE include the score functions S X<_�_ and

S, X`PZ]\^ , plus others as well. Furthermore, SIMPLE can calculate LOD scores. SIMPLE

takes input files with the same format as those used in GENEHUNTER, enabling the user

to easily switch to SIMPLE if the pedigree is too large to be handled by GENEHUNTER

in its entirety. The software is freely available from Ohio State University’s Statistical Ge-

netics’ web site. The URL and documentation for the software is provided in appendix

A.

20

Page 34: Sequential imputation and multipoint linkage analysis

2.4 Computational Requirements

Producing the weights and ordered genotypes (steps 1-3) takes the majority of the com-

puting time. To complete a single imputation we need to do a single locus peel for each

marker and then do reverse peeling (Ploughman and Boehnke, 1989; Ott, 1989) to simulate

the ordered genotypes. So the complexity and memory requirements are the same as those

required to do F single locus peels. The key difference in computational cost between this

algorithm and a standard peeling algorithm for linkage analysis such as that implemented

in LINKAGE (Lathrop et al., 1984) is that we are only doing a single locus peel at a time,

so the calculations are linear in the number of markers. Efficiencies in peeling algorithms

can be applied to the peeling step here to improve the overall efficiency. Currently some

genotype elimination has been implemented in SIMPLE to achieve such efficiencies. As in

peeling, this stage is sensitive to missing data.

In step 4 in the algorithm, we simulate the inheritance vector at a location of interest,

conditioned on the simulated ordered genotypes. For one imputation this involves simulat-

ing inheritance bits for two times the number of nonfounders, resulting in the calculations

being linear in the number of pedigree members. The computational time required for cal-

culating the score (step 5 in the algorithm) depends on its complexity; see Markianos et

al. (2001a) for a detailed discussion. Missing data has no effect on either of these last two

steps since they are conditioned on complete ordered genotypes.

The memory is most influenced by the number of loci being analyzed. This is because

we store the joint recombination probabilities across all loci, leading to the storage be-

ing exponential in the number of loci being analyzed. In steps 1 through 3 we store the

21

Page 35: Sequential imputation and multipoint linkage analysis

recombination probabilities for just the markers. Whereas in steps 4 and 5 we store the

recombination probabilities for the markers plus a location of interest. These probabilities

are stored for all locations where the statistics are to be estimated.

We now present a summary of results for time and memory requirements in analyz-

ing a small, medium and large pedigree, respectively. We chose the first three pedigrees

(pedigrees 1, 2 and 3) that were presented in a simulated data set from Genetics Analy-

sis Workshop 12. The small, medium and large pedigrees have 52, 86 and 100 members,

respectively. They have 15, 17 and 34 members with missing data. Eight markers, with

6-8 alleles each and an average heterozygosity of .77, were analyzed . We ran SIMPLE

for 1,000 imputations and estimated S, X`P%]�^ . GENEHUNTER was not capable of analyz-

ing any of these pedigrees without seriously reducing the number of pedigree members.

GENEHUNTER would have had to drop 24 (46%), 50 (58%) and 58 (58%) members in

the small, medium and large pedigrees, respectively, to be able to analyze them. We used

version 2.1.3 of GENEHUNTER here and throughout this paper.

We conducted the study on a Sun Blade 100 with an Ultrasparc IIe 500 mHz processor.

This study can be used as a rough guideline to the time and memory requirements for using

SIMPLE. The results are shown in Table 2.1. In this table we show the time and memory

requirements to process all 8 markers for 1,000 imputations in steps 1-3. Since the number

of points where linkage statistics are estimated depends on the user, we report the time and

memory requirements per point in steps 4 and 5. Because the computational time grows

linearly with the number of imputations, to estimate the time for analyzing these pedigrees

with 2,000 imputations, for example, it would be approximately twice the reported times.

22

Page 36: Sequential imputation and multipoint linkage analysis

On the other hand, the memory is not affected by the number of imputations. For steps 4

and 5 the computational time and memory grow linearly with the number of points to be

analyzed. For example, to estimate the time and memory to analyze these 8 markers with

5 points between each pair of adjacent markers (43 points in total), multiply the reported

time and memory by 43.

Steps 1-3 Steps 4 & 5Ped size Time (hr:min) Memory (MB) Time (sec) Memory (MB)Small 1:37 4.3 .57 .42Medium 1:41 4.1 1.33 .42Large 3:47 7.5 1.61 .42

Table 2.1: We report the time and memory requirements to complete 1,000 imputations ofsteps 1-3 and 4 & 5 of the algorithm (including the calculation of the estimate) for eightmarkers in each of three pedigrees of sizes small (52 members), medium (86 members) andlarge (100 members). Results are reported per disease location for steps 4 & 5. Note thatthe time units are different for steps 1 -3 and steps 4 & 5.

The time and memory requirements to produce the weights and ordered genotypes

(steps 1-3) for the small and medium pedigrees were similar. Though the medium pedi-

gree was substantially larger than the small pedigree, they both had a comparable amount

of missing data. This would explain why they took similar amount of time and memory

to be analyzed. On the other hand, the large pedigree had twice as much missing data and

therefore took more than twice as long and almost twice as much memory as the other two

pedigrees to be analyzed. The memory requirements to simulate the inheritance vectors

(step 4), calculate the scores (step 5) and form the weighted estimates were the same for

all three pedigrees. This is expected since the number of loci (8 markers and 1 point) being

23

Page 37: Sequential imputation and multipoint linkage analysis

analyzed was the same for all three pedigrees. On the other hand, the time increased as

the size of the pedigree increased since the number of inheritance vectors to be simulated

increased accordingly.

2.5 Accuracy of Estimates

We did a number of validation studies of SIMPLE using GENEHUNTER to verify that

the scores were being estimated accurately within reasonable computing time. The scores

were always quite close to the true scores produced by GENEHUNTER. Of course the ac-

curacy is a function of the number of imputations. To estimate the necessary sample size

to reach a certain desired accuracy one may run SIMPLE for a small number of imputa-

tions (say 100) to estimate the sampling variability (which is automatically calculated in

SIMPLE). From this estimate one can calculate the necessary number of imputations.

To illustrate the accuracy of SIMPLE, we analyzed pedigree 76 of the COGA (Collab-

orative Studies on the Genetics of Alcoholism) data set from Genetics Analysis Workshop

11. We removed three members so GENEHUNTER could analyze it. The pedigree is

shown in Figure 2.1. Note that it has a marriage loop. There are fourteen members in

the (reduced) pedigree with four founders. Eight markers are used from chromosome one:

D1S1613, D1S550, D1S532, D1S1588, D1S1631, D1S1675, D1S534, D1S1595. They

have nine to twelve alleles with an average heterozygosity of .75. The markers are spaced

11.2, 8.4, 18.1, 12.5, 11.9, 9.0 and 9.8 cM apart. Two founders (14%) are missing all of

their marker data. In addition, seven other members (50%) are missing data on D1S1631,

two members (14%) are missing data on D1S534 and three members (21%) are missing

data on other markers.

24

Page 38: Sequential imputation and multipoint linkage analysis

Figure 2.1: Pedigree used for validation study. The individuals marked with a slash markhave no marker data nor information on disease phenotypes.

The linkage statistics S, X`P%]�^ and S X`_�_ were estimated at five locations between each ad-

jacent pairs of markers, using both GENEHUNTER and SIMPLE with 5,000 imputations.

As can be seen from the plots in Figure 2.2 the estimated standardized scores produced by

SIMPLE were quite close to the true scores produced by GENEHUNTER. The scores plus

and minus 3 standard errors are plotted in Figure 2.3.

25

Page 39: Sequential imputation and multipoint linkage analysis

Pai

rs

OOOOOOOOOOOOO O O O O O OOOOOOOOOOOOOOOOOOOOOOOOO

0.0 11.2 19.6 37.7 50.2 62.1 71.1 80.9

01

23

All

OOOOOOOOOOOOO OO O O O OOOOOOOOOOOOOOOOOOOOOOOOO

0.0 11.2 19.6 37.7 50.2 62.1 71.1 80.9

01

23

Location on Chrom 1 (cM)

Figure 2.2: Scores produced by GENEHUNTER and SIMPLE are plotted by the line andcircles, respectively. S, X`PZ]\^ are plotted in the top frame and S X<_�_ are plotted in the bottomframe. The markers are indicated by the extended tick marks and the locations in cM areindicated on the x-axis of the bottom plot.

26

Page 40: Sequential imputation and multipoint linkage analysis

Pai

rs

0.0 11.2 19.6 37.7 50.2 62.1 71.1 80.9

01

23

All

0.0 11.2 19.6 37.7 50.2 62.1 71.1 80.9

01

23

Location on Chrom 1 (cM)

Figure 2.3: Scores produced by GENEHUNTER and SIMPLE are plotted by the line anddots (with vertical bars indicating 3 standard errors above and below estimated scores),respectively. S, X\P%]\^ are plotted in the top frame and S X`_a_ are plotted in the bottom frame.The markers are indicated by the extended tick marks and the locations in cM are indicatedon the x-axis of the bottom plot.

27

Page 41: Sequential imputation and multipoint linkage analysis

2.6 To Reweight or Not?

In the methods described above, we simulate the inheritance vectors (step 4) at ev-

ery location of interest (usually the entire chromosome in which the markers reside) and

then estimate the statistic using the simulated inheritance vectors. Alternatively, we could

simulate inheritance vectors at only a few locations of the chromosome and estimate the

linkage statistics at neighboring locations by reweighting, another importance sampling

idea exploited by Irwin et al. (1994). For instance, suppose that inheritance vectors were

simulated at position �fU . We can estimate the statistic at a nearby location, say � � , by:�E�G x S 4'C[9+;>=p?\$�� ���40~�? Ng � k � S 4'C � 9+;>=@?\���\¡+� �   4'C � 9+� � ?pH (2.3)

where

���\¡+� �   4'C[9+��?�� P �G �4'C�R ��?P �\¡v4'C�R ��? ��46��? (2.4)� P �   4'C�R �w9+287�?P � ¡ 4'C�R �w9+287�? ��46��? (2.5)� P �   4'C�R �w9+287�?P �\¡v4'C�R �w9+287�? P 46��R 287y?

P = 46��R 287y? P 46287T?� P �   4'C[9+��R 287�?P =� ¡ 4'C[9+��R 287�? P 4�287y?.H (2.6)

As pointed out previously, conditioned on the ordered genotypes, the inheritance vectors

are independent of the observed marker data, 2{7 (equation (2.5)). For ease of notation we

will drop the subscripts �bUI9@� � in the notation for this new weight. The reweighted statistic

in equation (2.3) is a consistent estimator of the linkage statistic at � � . To see why this is a

28

Page 42: Sequential imputation and multipoint linkage analysis

consistent estimator we note:

E �\¡x � � S 4'C[9+;>=@?\��4'C[9+��?\$�� g�x�g � S 4'C{9<;>=@?`��4'C{9<��? P =�\¡ 4'C[9+��R 287y?� g x g � S 4'C{9<;>=@? P 462�7T? P �@ .4'C{9<�WR 287�?

P =�`¡ 4'C{9<�WR 287�? P =�\¡ 40C{9+��R 2�7�?� P 46287T? g�x S 4'C[9+;>=@? g � P �G �4'C[9+��R 287�?� P 46287T? g x S 4'C[9+;>=@? P �   40CSR 287y?� P 46287T? E �  x S 4'C[9+;>=@?vR 287w$'HBy the fact that the mean of the weights is an unbiased estimator of the probability of

the data (Irwin et al., 1994) and the above results, the importance sampling estimate in

equation (2.3) is consistent.

The main issue with importance sampling is not bias but rather variance (Irwin et al.,

1994). To illustrate the potential problems with reweighting we will use a pedigree with 3

generations, 7 siblings in the last generation as shown in Figure 2.4. Two of the sibs are

labeled as affecteds. There are 4 markers spread evenly over 30 cM. The inheritance bits for

the gametes at the 4 markers can be determined in all 7 children in the last generation. (Here

“gamete” is used loosely to refer to the inherited chromosome upon which the markers are

located.) The gametes are listed in Table 2.2. Children 9 through 12 had gametes a and b.

Child 13 had gametes b and d. Child 7 had gametes d and ¢ and child 8 had gametes ¢ and� . There were 10 observed recombinations between the second and third markers and no

observed recombinations elsewhere.

We analyzed this pedigree with S, X`P%]�^ using SIMPLE with a sample size of N � � 9 �����at 21 interior points with reweighting and without reweighting. The reweighting was done

29

Page 43: Sequential imputation and multipoint linkage analysis

Figure 2.4: Pedigree used to illustrate reweighting. The two affecteds are shaded.

by imputing the inheritance vector in the middle of each interval and estimating scores in

the interior of the interval using equation (2.3). The scores on the markers were estimated

by imputing the inheritance vector at the marker and estimating the score. They were also

estimated by reweighting from the scores simulated in the middle of each flanking interval.

So there were 2 to 3 estimates (including the reweighted estimate(s) from the adjacent

interval(s) and the non-reweighted estimate at the marker) for the scores at each marker

depending on whether there were 1 or 2 flanking intervals, respectively. The scores without

reweighting were estimated by simulating the inheritance vector at each point of interest

and estimating the score according to equation (2.2). We estimated the scores 10 times

with each method to capture the variability of the scores. The scores are plotted for both

methods in Figure 2.5. The curve corresponding to the true scores (which was calculated

30

Page 44: Sequential imputation and multipoint linkage analysis

Label Freq Gameted 5 0 0 1 1m 5 1 1 0 0¢ 2 1 1 1 1� 2 0 0 0 0

Table 2.2: Gametes of children in last generation.

with GENEHUNTER) is also included. The scores without reweighting are tightly packed

around the true scores. We see that the reweighted estimates do well in the first and last

intervals but do poorly in the middle interval as the scores approach the flanking markers. In

the middle interval the true scores form a diagonal between the scores at the two markers

but the 10 estimated curves grossly diverge from the diagonal as they approach the two

markers.

The performance of the reweighted scores at interior points in the middle interval close

to the flanking markers was much poorer than anywhere in the other two intervals because

the variability was much higher at these points. In all 3 intervals the variability increases as

the location of the reweighted scores approaches a flanking marker. But 10 gametes have

observed recombinants in the middle interval whereas there are no observed recombinants

in any other interval. As a result a recombination is guaranteed between ��U and one of the

two adjacent markers in all of the simulations for these recombinant gametes. On the other

hand a recombination is much less probable between the middle of the first or last intervals

and the adjacent markers. Assuming no interference the probability of a recombination

31

Page 45: Sequential imputation and multipoint linkage analysis

Location(cM)

0 10 20 30

02

46

Reweight

Pai

rs

Location(cM)

0 10 20 30

02

46

No Reweight

Figure 2.5: The left plot is of the scores estimated with reweighting (10 estimates) and theright plot is of the scores estimated without reweighting (10 estimates). The curves withoutreweighting are very close to the truth.

32

Page 46: Sequential imputation and multipoint linkage analysis

between �fU and each of the two adjacent markers in an outside interval was approximately

.0025 for each of the recombinant gametes.

The variability of the reweighted scores will decrease as the sample size increases. For

the same pedigree data, we took 10 independent estimates of the scores for 1000, 5000,

25000 and 125000 imputations. The curves for the reweighted scores are plotted in Figure

2.6. The variability between the curves decreases as the number of imputations increases,

as expected, and the estimated scores converge to the truth.

Pai

rs

0 10 20 30

02

46

1K

0 10 20 30

02

46

5K

Pai

rs

0 10 20 30

02

46

25K

Location(cM)

0 10 20 30

02

46

125K

Location(cM)

Figure 2.6: 10 scores using reweighting are plotted for sample sizes N=1K, 5K, 25K and125K.

33

Page 47: Sequential imputation and multipoint linkage analysis

The sequential imputation proposed by Irwin et al. (1994) for estimating the likelihood

saved a lot of computational time since the implementation translated directly to less single

locus peels for the disease gene. In this case, however, there is no such clear advantage of

using reweighting, and hence the practice is not adopted here.

34

Page 48: Sequential imputation and multipoint linkage analysis

CHAPTER 3

POWER STUDY

To illustrate the potential benefit to multipoint linkage analysis by processing all pedi-

gree members of a large pedigree, we performed a simulation study. We used the S , X\P%]\^statistic to analyze the full pedigree shown in Figure 3.1 with SIMPLE and then the same

pedigree was analyzed using GENEHUNTER, which needed to discard some members of

the pedigree. The pedigree had 37 members, 11 of whom were founders and 5 members

had missing marker and disease data. The ascertainment criteria was that at least one sib in

each of the seven sibships in the last generation had to be affected.

We used 6 markers with equally frequent alleles for each marker. The markers were

spaced 15 cM apart. We simulated the marker and disease data under three disease models.

In all three cases the disease data was simulated at a locus in the middle of the marker map

at 37.5 cM. In model I, the penetrances for genotypes aa, Aa and AA were 0, .9 and .95

with a disease allele frequency P(A) �£H � . In model II, the penetrances were .05, .4 and .6

with a disease allele frequency .05. In the third model the penetrances were .05, .5 and .7

with a disease allele frequency of .3.

35

Page 49: Sequential imputation and multipoint linkage analysis

Figure 3.1: Pedigree structure for the power study. The individuals marked with a slashwill have no marker nor disease data.

Five hundred pedigrees were simulated under all three models. GENEHUNTER had to

drop between 14 (38 %) to 20 (54%) members in order to process the pedigrees. To estimate

power for a single pedigree, we calculated the proportion of pedigrees that had a maximum

score exceeding a certain threshold. Four thresholds levels were entertained: 2.33; 3.09;

3.72 and 4.27, as suggested by Kruglyak et al. (1996). These thresholds correspond to

asymptotic significance levels .01, .001, .0001 and .00001, respectively.

From the initially simulated pedigrees, we re-sampled, with replacement, 500 data sets

of size ¤ , with ¤ ranging from 2 to 50 pedigrees for each of the three models. We estimated

powers by the proportion of data sets with standardized scores that exceeded the threshold

values.

36

Page 50: Sequential imputation and multipoint linkage analysis

3.1 Results

The results for a single pedigree are summarized in Table 3.1. The power estimates are

all low since the data set only consists of a single pedigree. The power is consistently higher

under SIMPLE versus GENEHUNTER under all three models and all threshold levels.

Model I Model II Model IIILevel SIMPLE GH SIMPLE GH SIMPLE GH

.01 44% 40% 38% 26% 21% 19%.001 26 24 23 12 11 7

.0001 15 10 15 5 5 3.00001 8 3 10 3 2 1

Table 3.1: Power estimates for a single pedigree. Power was defined as the percentage ofpedigrees that exceeded certain thresholds. The thresholds used for asymptotic significancelevels of .01, .001, .0001 and .00001 were 2.33, 3.09, 3.72 and 4.27, respectively.

The power estimates under all three models for the data sets with different pedigree

sizes are plotted in Figure 3.2 for thresholds 2.33 and 3.09 and Figure 3.3 for thresholds

3.72 and 4.27. We added a spline smooth curve to each of the plots. The power esti-

mates increase as the sample sizes increase, as expected. The power under SIMPLE is also

consistently above the power under GENEHUNTER.

For the first two models, we calculated the minimal sample sizes needed, based on

the spline smooth curve to reach 50%, 65% and 80% power for each of the threshold

levels: 2.33; 3.09; 3.72 and 4.27. The results are summarized in Table 3.2. Since the

power was much weaker for the third model we reported the corresponding results for

37

Page 51: Sequential imputation and multipoint linkage analysis

O

OOOOOOOOOOOOOOOOOOOOOOO

0.0

0.2

0.4

0.6

0.8

1.0

5 15 25

Pow

er

Model I

+

++

++++

+++++++++++++++++

OO

OOOOO

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO

10 25 40

Model II

+++++++

+++++

+++++

++++++++++++++++++++++

threshold = 2.33 ( nominal level .01 )

OOOOOOOO

OOOOOO

OOOOOOOO

OOOOOOO

OOOOOO

OOOOOOOOOO

OOO

10 30 50

Model III

++++++++

+++++

++++++++++++

+++++++++++++++

++++++++

OOO

O

OOO

OOOOOOOOOOOOOOOOO

0.0

0.2

0.4

0.6

0.8

1.0

5 15 25

Pow

er

Model I

No. Peds

+++

+

+++

+++

++++++++++++++

O

OOO

OOOOO

OOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

No. Peds

10 25 40

Model II

++++

+++++

++++

++++

++++++

++++++++++

++++++

threshold = 3.09 ( nominal level .001 )

OOOOOOOO

OOOOOO

OOOOOOOO

OOOOOOO

OOOOOOO

OOOOOOOOOOOO

No. Peds

10 30 50

Model III

++++++++++++++

++++++++

+++++++

+++++++

++++++++++

++

Figure 3.2: Power curves for SIMPLE (solid line and ‘ ¥ ’) and GENEHUNTER (dashedline and ‘+’) based on a thresholds of 2.33 and 3.09 for all three genetic models.

38

Page 52: Sequential imputation and multipoint linkage analysis

OO

OOOOO

OOO

OOOOOOOOOOOOOO

0.0

0.2

0.4

0.6

0.8

1.0

5 15 25

Pow

er

Model I

++

+

++

++

+++

++++

++++++++++

OOO

O

OOOOO

OOOO

OOOO

OOOOOOOOOOOOOOOOOOOOOO

10 25 40

Model II

+++++++

+++

+++

++++

++++

++++++

++++++

++++++

threshold = 3.72 ( nominal level .0001 )

OOOOOOOOOOOOOOO

OOOOOOO

OOOOOOO

OOOOOOOOO

OOOOOOO

OOO

10 30 50

Model III

++++++++++++++++++++++

+++++++

++++

+++++++++++++

++

OOOOOOO

OOO

OOOO

OOOOOOOOOO

0.0

0.2

0.4

0.6

0.8

1.0

5 15 25

Pow

er

Model I

No. Peds

+++++++

+++

+

+++++++++++

++

OOO

OOO

OOOOOO

OOOOO

OOOOO

OOOOOOOOOOOOOOOOO

No. Peds

10 25 40

Model II

++++++

+++++

++++++

++++

+++++

++++++

++++++

+

threshold = 4.27 ( nominal level .00001 )

OOOOOOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOO

OOOOO

No. Peds

10 30 50

Model III

++++++++++++++++++++++++++++++++++

++++++++++++++

Figure 3.3: Power curves for SIMPLE (solid line and ‘ ¥ ’) and GENEHUNTER (dashedline and ‘+’) based on a thresholds of 3.72 and 4.27 for all three genetic models.

39

Page 53: Sequential imputation and multipoint linkage analysis

powers 40%, 50% and 65% at thresholds 2.33 and 3.09 for this latter model. The results

for this latter model are summarized in Table 3.3. For model I, SIMPLE requires slightly

less number of pedigrees to achieve the same power as GENEHUNTER. For model II,

SIMPLE requires approximately half the number of pedigrees as GENEHUNTER. In the

third model, GENEHUNTER needs approximately 50% more pedigrees than SIMPLE to

achieve the same power. In all three models, the reduction in the number of pedigrees

necessary to achieve the given powers using SIMPLE versus GENEHUNTER grows as the

desired power increases and as the threshold becomes more stringent.

Model I Model IIPower Level SIMPLE GH SIMPLE GH50% .01 2 2 2 550% .001 4 5 5 1150% .0001 6 7 8 1750% .00001 8 10 11 2465% .01 3 4 4 865% .001 6 6 7 1565% .0001 8 10 11 2365% .00001 11 13 13 2980% .01 5 6 6 1280% .001 7 10 11 2180% .0001 11 14 14 2980% .00001 14 17 18 37

Table 3.2: Sample size estimates for models I & II. For nominal significance levels of .01,.001, .0001 and .00001 we report the minimal sample size necessary (based on a splinefit) to achieve 50%, 65% and 80% power. The thresholds used for asymptotic significancelevels of .01, .001, .0001 and .00001 were 2.33, 3.09, 3.72 and 4.27, respectively.

40

Page 54: Sequential imputation and multipoint linkage analysis

Power Level SIMPLE GH40% .01 11 1840% .001 28 4250% .01 17 2650% .001 36 -65% .01 26 3665% .001 48 -

Table 3.3: Sample size estimates for model III. For nominal significance levels of .01 and.001 we report the minimal sample size necessary (based on a spline fit) to achieve 40%,50% and 65% power. The thresholds used for asymptotic significance levels of .01 and.001 were 2.33 and 3.09, respectively. The cases marked by ‘-’ indicate that the requiredsample size is greater than 50.

3.2 Type I error

We studied the type I error rates for a data set of 15 pedigrees, which was chosen to

reflect a realistic situation. To estimate type I error we simulated the genotypes for 10,000

pedigrees using the same pedigree structure and missing data pattern used in the previous

power study (Figure 3.1), fixing the last generation as all affected. From these initially

simulated pedigrees we re-sampled 2,000 data sets of size 15 pedigrees with replacement.

We then calculated the proportion of data sets with standardized scores exceeding each

of four thresholds to estimate the type I error rates. The results for both SIMPLE and

GENEHUNTER are shown in Table 3.4. GENEHUNTER dropped 17 (46%) members in

each of the pedigrees simulated. The estimated type I error rates were close to the nominal

significance levels.

41

Page 55: Sequential imputation and multipoint linkage analysis

Nominal EmpiricalLevel SIMPLE GH.01 .008 .005.001 .0005 .003.0001 0 .002.00001 0 0

Table 3.4: For nominal levels of .01, .001, .0001 and .00001 we report the estimated typeI error rates for a sample of 15 pedigrees. The thresholds used for asymptotic significancelevels of .01, .001, .0001 and .00001 were 2.33, 3.09, 3.72 and 4.27, respectively.

3.3 Discussion

One advantage of this method over the HMM is that it can process larger pedigrees

which can lead to an increase in power. We demonstrated the potential gain in power in our

simulation study using S, X\P%]\^ and three genetic models, although the magnitude of power

gains varied from model to model. Substantial power gains are observed under models II

and III, while the gains under model I are minimal. The different levels of power gains in

the three models are due to the differences in the amount of IBD information carried by the

affected individuals dropped. We note that using MCMC methods would yield comparable

results as these methods can process the same data as sequential imputation. However, we

would expect sequential imputation to be more efficient than MCMC for pedigrees that are

not too complex, such as the pedigrees studied.

We would expect the gains in power to be even greater with S X<_�_ due to the nature of

the statistic. Unlike S, X\P%]\^ , S X`_a_ gives increasing scores to the larger number of affected

pedigree members sharing an allele IBD. Since GENEHUNTER often discards affected

42

Page 56: Sequential imputation and multipoint linkage analysis

members, we would expect this to adversely affect the power to a greater degree with S X`_a_than with S, X\P%]\^ . One drawback of using S X<_�_ , however, is the computational intensity of its

current implementation. Markianos et al. (2001b) have addressed this issue and proposed

a method to reduce the computational burden.

43

Page 57: Sequential imputation and multipoint linkage analysis

CHAPTER 4

HAPLOTYPING: AN APPLICATION

Haplotype reconstruction of many markers in pedigrees plays an important role in lo-

calizing disease causing genes. Haplotype reconstruction is the attempt to reconstruct the

haplotypes in pedigrees given genotype information, 2{7 . Often the genotype informa-

tion does not determine haplotypes exactly due to missing data or low heterozygosity of

markers. Various methods have been proposed to reconstruct haplotypes. In a recently

published article (Qian and Beckmann, 2002), a six-rule algorithm for the reconstruc-

tion of minimum-recombinant haplotype (MRH) configurations in pedigrees was proposed.

The authors compared their rule based method to Tapadar’s evolution-based MRH method

(Tapadar et al., 2000). Neither method, however, explores the entire haplotype space nor

do they provide the probabilities of haplotype configurations. The rule based MRH method

is further limited to pedigrees with “informative” or “partially informative” members. A

pedigree that is missing genotype information for two mating founders, for example, could

not be analyzed with this method. This places severe and often unrealistic restrictions on

the application of this method.

44

Page 58: Sequential imputation and multipoint linkage analysis

An alternative method is to derive a set of highly probable haplotype configurations

given the marker data. Determining these posterior probabilities by exact methods for

large pedigrees with many markers can be computationally infeasible, however. Monte

Carlo methods offer a viable solution to this computational challenge. Markov chain Monte

Carlo methods have been implemented for haplotype reconstruction (Lin and Speed, 1997;

Sobel et al., 1996), but they are subject to slow convergence due to high correlation between

samples. We propose to use sequential imputation to determine the haplotype configura-

tions with the highest posterior probabilities. Haplotype reconstruction with sequential

imputation is easy to implement and computationally efficient.

Furthermore, sequential imputation can easily incorporate crossover interference in de-

termining the haplotype configuration probabilities. Interference plays an influential role in

the formation of gametes and modeling it is essential to accurately determine haplotypes.

Yet the aforementioned MHR methods can not incorporate it in their derivations, and exact

probability methods often ignore it due to computational difficulties when analyzing many

loci, expecially in the presence of missing data. We will describe the methodology for esti-

mating the haplotype configuration probabilities via sequential imputation and then apply

the methodology to a real data set using different models of interference.

4.1 Methodology

The ordered genotypes, � , determine a haplotype configuration, H. To estimate the pos-

terior probability of a haplotype configuration, H, we first sample the ordered genotypes,� = �v� � 9JHIHJHI9G�¦7�� , sequentially, conditioned on the marker data, and store the appropriate

45

Page 59: Sequential imputation and multipoint linkage analysis

weight, ��46��? . We do this N times and then take a weighted average of the sample re-

alizations that yield ordered genotypes which correspond to a haplotype configuration to

estimate its probability.

To get a sample of size N of the ordered genotypes and weights we follow steps 1

through 3 in chapter 2 for N imputations to get ordered genotypes � � 9JHJHIHI9+� N with cor-

responding weights ��4�� � ?.9IHJHJHI9+��46� N ? . We form the estimator of P 4 H R 2A7T? , the posterior

probability of haplotype configuration H:�P 4 H R 2�7T?S� Ng PEk � I 46�8P�� H ? ��46�8P§?��4'~�? 9

where I 4�;B? is the indicator function, i.e. I 46;B?"� �if ; is true, otherwise I 46;B?"� �

, and��40~�?�� �NPEk � ��46&�? is the sum of the weights.

Using equation (2.1) we get:

E I 4��¨� H ?\��46��?|R 287�$� g � I 46�©� H ?\��4���? P = 46��R 287�?� g � I 46�©� H ?\��4���? P 46��R 2�7T? P 46287T?��46��?� P 4�287y? g � I 46�ª� H ? P 46��R 287T?� P 4�287y? P 4 H R 287y?p9

Since ��46��? is an unbiased estimator of P 4�2[7y? , by the above results our estimator is consis-

tent.

46

Page 60: Sequential imputation and multipoint linkage analysis

4.2 Results

We used SIMPLE to derive the most probable haplotype configurations from data of

a published pedigree in a study of Episodic ataxia/myokymia by Litt et al. (1994) and

compared the results to those given in other articles. This pedigree is shown in Figure 4.1.

It consists of 27 members genotyped at 9 markers on chromosome 12. Two founders, 2001

and 1011, are completely missing their marker information. The marker names with respect

to their physical order are D12S91, D12S100, CACNL1A1, D12S372, pY2/1, pY21/1,

KCNA5, D12S99, and S12S93. The estimated recombination fractions (Dausset et al.,

1990; Litt et al., 1994) between the 9 markers (in the physical order of the markers) were

.01, .01, .03, .01, .02, .01, .01 and .01. The number of alleles at each marker (in the physical

order of the markers) were 6, 6, 4, 4, 6, 4, 8 and 6.

Although the heterozygosities at each of the markers were published in Litt et al. (1994)

the actual allele frequencies were not provided. The genotypes for the 2 founders with

missing information can be unambiguously determined at 8 of the 9 markers in founder

2001 and 6 of the 9 markers in founder 1011. The genotype at the 9 NEO marker for founder

2001 is ambiguous. By inspection of the data it can be seen that at least one of the alleles

is 4. By further inspection it can be seen that only a genotype of 4 4 would lead to no

recombinations between the 8 NEO and the 9 NEO markers and any other genotype would imply a

recombination which has a probability of .01. Since alleles used in genetic studies usually

have frequencies greater than .01 we argue that this genotype is most likely 4 4. Similarly,

we argue that the ambiguous genotypes at the 1 ^ N , 4 NEO and 9 NEO markers for founder 1011

are most likely 3 3, 3 3 and 4 4, respectively. For this reason we assume that the markers

47

Page 61: Sequential imputation and multipoint linkage analysis

Figure 4.1: Four haplotype configurations, A-D, from pedigree in episodic ataxia study.A single arrow indicates a recombination between the two adjacent loci. A double arrowindicates two distinct possible locations for the recombination. This image was obtainedon November 3, 2002 from http://www.journals.uchicago.edu/AJHG/journal/issues/v70n6/013591/fg2.h.gif 48

Page 62: Sequential imputation and multipoint linkage analysis

have equal allele frequencies since this assumption will have little impact on the estimated

probabilities of the haplotype configurations because most of the missing genotypes of the

founders can be unambiguously determined by the genotype information of their children

and the ambiguous genotypes are most likely as described above, regardless of the allele

frequencies.

We analyzed the pedigree in Figure 4.1 under 7 different models of interference and

compared the results to those given by authors who have studied the same pedigree. We

used the ( model of interference with intensity parameters 1 =0 (no interference), 1, 2, 3, 4,

5 and 6. We sampled N=100, 500, 1,000 (1K) and 100,000 (100K) ordered gentoypes under

each interference model. The haplotype configurations being reported all have a posterior

probability greater than 1% under at least one of the interference models for N=100,000.

100,000 imputations took 4 minutes to draw on a linux machine with an AMD 1800+MP

processor and 3 GB of RAM. The results for N=100,000 imputations are summarized in ta-

ble 4.1. The results for all other sample sizes are summarized in table 4.2. The empty cells

correspond to configurations with probabilities less than 1%. The configurations (‘cfg’)

will now be described.

The 4 most probable haplotype configurations under no interference (1 =0) were C, D,

A and B (see Figure 4.1) with posterior probabilities .28, .27, .07 and .07, respectively. All

other haplotype configurations had posterior probabilities of less than 1%. These matched

the configurations derived by Qian and Beckmann (2002). Note that they derived compa-

rable relative frequencies (.4, .4, .1 and .1 for configurations C, D, A and B, respectively)

49

Page 63: Sequential imputation and multipoint linkage analysis

Intensity Parameter 1cfg 0 1 2 3 4 5 6A .07 .05 .02B .07 .05 .02C .28 .21 .09 .02D .27 .21 .09 .02A � .01 .03 .05 .05 .05 .05B � .01 .03 .05 .05 .05 .05C � .05 .15 .20 .21 .21 .21D � .05 .14 .19 .20 .21 .21A

.01 .02 .02 .02 .02B

.01 .02 .02 .02C

.02 .05 .06 .06 .07 .07D

.02 .05 .06 .07 .07 .07

Table 4.1: Haplotype configuration probabilities with N=100,000 imputations.

based on the estimated recombination fractions. Sobel et al. (1996) found that configu-

ration B was the most probable configuration using simulated annealing. Lin and Speed

(1997) analyzed the pedigree in Figure 4.1 using a Gibbs-Jump algorithm with only a sam-

ple size of 100. They found configurations C and B as most probable with posterior proba-

bilities of .41 and .09, respectively, and any other configuration had a posterior probability

of less than 3%. Even when we took a sample of size 100 with SIMPLE assuming no

interference (see Table 4.2) we estimated the 4 most probable haplotype configurations as

C, D, A and B with posterior probabilities .22, .22, .07 and .05, respectively. All other

configurations had a probability of less than 1%.

Although the rule-based MRH method yielded the same haplotype configurations in

this particular example, it does not provide their posterior probability nor does it explore

50

Page 64: Sequential imputation and multipoint linkage analysis

Intensity Parameter 1N cfg 0 1 2 3 4 5 6

A .07 .04 .03B .05 .01C .22 .19 .06D .22 .30 .13A � .02 .03 .04 .04 .03 .03B � .04 .06 .02 .09 .02 .02

100 C � .07 .22 .11 .30 .11 .10D � .04 .25 .10 .22 .10 .10A

.14 .15 .15B

C

.28 .30 .30D

.14 .15 .15A .07 .03 .02B .07 .06 .01C .31 .19 .08 .02D .21 .20 .10A � .01 .04 .04 .05 .06 .06B � .02 .02 .03 .01 .03 .03

500 C � .06 .15 .21 .20 .23 .24D � .08 .12 .21 .21 .20 .19A

.03B

.03 .05C

.01 .07 .05 .14 .05 .05D

.01 .10 .05 .11 .10A .07 .05 .02B .06 .04 .02C .28 .20 .09 .02D .29 .24 .09 .02A � .01 .04 .05 .04 .05 .05B � .04 .05 .06 .05 .05

1K C � .06 .13 .18 .22 .20 .20D � .07 .15 .18 .17 .20 .20A

B

.03 .06 .05 .06 .06C

.05 .02 .09 .02 .02D

.02 .03 .12 .09 .12 .13

Table 4.2: Haplotype configuration probabilities.

51

Page 65: Sequential imputation and multipoint linkage analysis

the entire haplotype space. Furthermore the MRH method can not incorporate interference

which can lead to drastically different results.

Under positive interference there were 8 additional configurations that appeared with

posterior probabilities exceeding 1%. All 8 configurations differed from one of the 4 con-

figurations derived by Qian and Beckmann (2002) in only the haplotype of the founder

labeled 1001. For each of these configurations, we will call the corresponding configura-

tion that differs with it in only the haplotype of 1001 its conjugate.

The different haplotypes for founder 1001 are given in table 4.3. The haplotype d for

1001 was common in all 4 configurations derived by Qian and Beckmann (2002): A, B,

C, D. We label the new configuration , x « , where x is the conjugate of x « , and ‘#’ is the

haplotype for 1001. So A

has conjugate configuration A with haplotype 2 for person 1001.

Label Haplotyped 1 3 9 5 6 3 3 1 63 3 9 4 6 4 4 5 6

1 1 3 9 5 6 4 4 5 63 3 9 4 6 3 3 1 6

2 1 3 9 5 6 3 4 5 63 3 9 4 6 4 3 1 6

Table 4.3: Haplotypes for 1001.

Configurations A, B, C, and D all had at least 5 recombinations: 3 single recombi-

nations and 1 double recombination. Whereas the other 8 configurations had at least 6

recombinations, but they were all single recombinations.

52

Page 66: Sequential imputation and multipoint linkage analysis

This difference in recombination patterns was solely due to the haplotype of member

1001. The haplotype d for 1001 observed in configurations A, B, C and D yielded one

double recombination in his child 100 and no recombinations in the other two children

(102 and 1) in all 4 configurations. In contrast, the haplotypes 1 and 2 observed in the other

8 configurations (under interference) yielded a single recombination in all three children

of 1001 but no double recombinations. The recombination patterns observed in the other

members were identical to each configuration’s conjugate.

Positive interference makes multiple recombinations less probable than under no inter-

ference. The stronger the positive interference the less probable are multiple recombina-

tions. When we did not account for interference (1 =0) only configurations A, B, C and D

had posterior probabilities greater than 1%. As the positive interference increased, i.e. as1 increased from 0 to 6, the configurations with only single recombinations became more

probable, even though there were more recombinations over all. In fact, for 1�¬®­ the

posterior probabilities for configurations A, B, C and D were all less than 1% and the other

8 configurations all had posterior probabilities greater than 1%. In light of the fact that

interference is known to occur in humans, it is important to take it into account. Ignoring it

can lead to drastically different results as observed in this exercise.

4.3 Discussion

Though the rule-based MRH method finds all configurations with the minimum number

of observed recombinations, it does not distinguish between single versus multiple recom-

binations on the same chromosome nor does it account for the varying distances between

markers when minimizing the number of recombinations. Furthermore, it does not explore

53

Page 67: Sequential imputation and multipoint linkage analysis

the entire haplotype space nor provide posterior probabilities for the haplotype configura-

tions that it finds.

Probability methods are advantageous over this rule-based MRH method because they

are not limited by missing data and they are based on the probability of the haplotype

configurations given the pedigree data. They also provide more flexibility to an investigator

by giving a set of haplotype configurations with probabilities attached to them from which

the investigator can choose from based on his/her own expertise (Lin and Speed, 1997).

Unlike exact probability methods, the computational cost of sequential imputation scales

linearly in both the number of pedigree members and markers.

MCMC methods will yield the same results as sequential imputation after large enough

samples are drawn. But we expect sequential imputation to be more efficient than most

MCMC methods for simple pedigrees such as the one studied in Figure 4.1.

None of the methods used to reconstruct the haplotypes in Figure 4.1 accommodated

crossover interference except for sequential imputation. It is impossible to include it in the

rule-based MRH method and computationally infeasible to adequately include it in exact

probability methods for many markers. None of the MCMC methods that were mentioned

in this article accounted for it (though it could be done). Yet interference plays an influ-

ential role in determining haplotype configurations. Indeed, the ( model with intensity

parameter 1 =4 has been found to fit well with human pedigree data (Lin and Speed, 1996)

and yet under this model none of the haplotype configurations that were derived by the

other methods had a posterior probability greater than 1%!

54

Page 68: Sequential imputation and multipoint linkage analysis

CHAPTER 5

INTERFERENCE STUDY

There is strong evidence that positive interference occurs during meiosis in humans

(Kwiatkowski et al., 1993). Accounting for interference is important when carrying out

multipoint analysis. Simulation studies have shown increased efficiencies in exclusion

mapping and gene ordering when accounting for interference using the ( model in small

human pedigrees (Lin and Speed, 1999).

We did a simulation study to examine the effect on power, precision and accuracy in

gene mapping when interference is ignored in a large pedigree with missing data. Inter-

ference was modeled by the ( model. The factors that we considered were the disease

model (the penetrances and allele frequencies of the disease gene), the location of the dis-

ease gene, the number of markers, the number of alleles of the markers and the genetic

distances between the markers. We assumed that the order of the markers was known and

there were no genotyping errors.

We used a single large pedigree in all of our simulations. Since the pedigree was large

and there were multiple markers exact methods could not be used to analyze this data with

interference. We used sequential imputation to analyze the data instead.

55

Page 69: Sequential imputation and multipoint linkage analysis

5.1 Study Design for Simulation

We considered two genetic models for the disease data: complex and dominant. The

complex model had penetrances of .05, .5 and .75 for the homozygous normal, heterozy-

gous and homozygous mutant genotypes with a mutant allele frequency of 30%. The dom-

inant model had penetrances of 0, .9 and .95 for the homozygous normal, heterozygous and

homozygous mutant genotypes with a mutant allele frequency of 10%.

For both disease models we simulated data for 8 markers and varied the number of

alleles between 4 and 8, and the distances between the equally spaced markers over 1, 5

and 10 Centimorgans and the location of the disease gene from the middle of the set of

markers to outside of the set of markers. (When the disease gene was simulated outside

of the set of markers it was simulated at a distance equal to the marker interval width for

that model. So if the width was 5 cM it was simulated 5 cM outside of the set of markers.)

In addition, we repeated the same 12 configurations for the dominant disease model but

used 4 markers instead of 8. So there were 36 configurations in total. Under all of these

configurations the meioses were simulated under the ( model for the chiasma process with

intensity parameter 1Q�¯­ .The pedigree used in the study contained 35 members over 3 generations. The pedigree

structure is shown in Figure 5.1. There were 6 individuals designated as missing which is

indicated by the slash through their individual symbol. Neither their affectation status nor

marker data were used in the analyses. A pedigree was ascertained if at least one member

was affected in each of the sibships of the last generation.

56

Page 70: Sequential imputation and multipoint linkage analysis

Figure 5.1: The pedigree used in interference study. The slash indicates that the markerand disease data will be missing for that individual.

5.2 Methodology

As in previous chapters we partition the data into the F°~ � loci, 23���|; � 9JHJHJH�9+;�7�9+;>=�� .Where 287����|; � 9JHJHJH�9+;�7�� is the data on the F markers and ;�= is the data on the affectation

status for the pedigree members. To estimate the LOD scores we followed the algorithm

described in Irwin et al. (1994). We employed the reweighting technique suggested by

Irwin et al. (1994) to reduce the computational cost in estimating the likelihood of the

disease at different locations. Since the computational cost in this application of sequential

imputation involved peeling the disease locus, the savings were substantial. We used the ( model of the chiasma process with intensity parameter 1Q�°­ to account for interference.

For each pedigree we sampled the ordered genotypes N � � 9 ����� times. We estimated

the likelihood of the disease in the middle of each marker interval. We used reweighting

to estimate the likelihood at 6 interior points (equally spaced) and each of the flanking

57

Page 71: Sequential imputation and multipoint linkage analysis

markers. We then calculated the estimate of the LOD scores at each marker and the 6

interior points within each marker interval.

We did a preliminary study to determine the number of pedigrees necessary to achieve

sufficient power. The disease model that we used in the power analysis was a fully pene-

trant dominant disease gene with a mutant allele frequency of 10%. The disease gene was

simulated in the middle of a marker map with 8 markers. The 8 markers each had 8 equally

frequent alleles. The meioses were simulated under the ( model with interference inten-

sity parameter 1±�²­ . We simulated 100 pedigrees under this model for the disease and

marker data using the pedigree (with the same missing data pattern as described previously)

in Figure 5.1.

Over 75% of the pedigrees yielded LOD scores over 3 under the assumption of no inter-

ference. We resampled 5 pedigrees from the simulated pedigrees 100 times and combined

their LOD scores (thus estimating the power for a data set of 5 pedigrees). 100% of these

data sets had LOD scores over 3. We decided that a data set of 5 pedigrees should be

adequate to get sufficient power under the more realistic genetic models that we already

described.

When the disease gene was simulated in the middle of the marker map the disease re-

gion was defined as the marker interval containing the disease gene, including both flanking

markers. When the disease gene was simulated outside of the marker map the disease re-

gion was defined as half of the adjacent interval to where the disease gene was simulated,

including the adjacent marker. To compare power we calculated the percentage of data sets

that had maximum LOD scores exceeding 3 in the disease region under both interference

58

Page 72: Sequential imputation and multipoint linkage analysis

models for each data configuration. These percentages were our power estimates. If the

LOD scores under interference were more powerful than under no interference we would

expect these estimates to reflect that difference.

To compare precision we looked at the behavior of the LOD scores outside of the dis-

ease region when the maximum LOD scores were found in the disease region. When the

LOD score is calculated outside of the disease region we would want it to be lower than

inside the disease region. If incorporating interference makes the analysis more precise,

we would hope that the corresponding LOD scores to be lower under interference than no

interference outside of the disease region. To quantify this we calculated the mean dif-

ference in LOD scores (no interference - interference) at all interior points outside of the

disease region. (We excluded estimates at the markers since these at times were -infinity.)

Within each configuration we took the mean of these mean differences. For simplicity we

will refer to this statistic as the mean LOD difference. We would hope the mean LOD

differences to be positive and large outside the disease region if incorporating interference

improves precision. We only considered data sets where the maximum LOD score under

both interference models occurred in the disease region. For relative comparison we also

calculated the mean LOD difference inside the disease region for each data configuration.

To evaluate accuracy we examined the location of the maximum LOD scores. If the

analysis under interference is more accurate, the maximum LOD scores should occur more

often in the disease region under interference than no interference. To assess this we ex-

amined the joint distribution of the location of the maximum LOD scores for each data

configuration relative to the disease region. For each data configuration, we calculated the

59

Page 73: Sequential imputation and multipoint linkage analysis

percentage of data sets where the maximum LOD scores under both interference models

occurred in the disease region, where only one occurred in the disease region and where

neither occurred in the disease region.

5.3 Results

For each of the 36 data configurations described in the previous section we generated

100 pedigrees. We resampled 100 data sets of 5 pedigrees within each configuration and

estimated the LOD scores under interference (using the correct ( model) and also under

no interference at the markers and 6 interior points. All other parameters were consistent

with the true data/disease model for each analysis.

5.3.1 Power

The power estimates were comparable under both interference models. We see from

Table 5.1 that the power estimates did not differ from each other by more than 5 percentage

points for any data configuration. The disease model and data configuration of course

affected the power. As expected the power was much higher under the dominant model

than complex model and the power tended to increase as the marker intervals became finer.

But the difference in power between the interference models was never substantial.

5.3.2 Precision/Accuracy

While the powers across the entire marker map were quite similar, there were substan-

tial differences in terms of precision and accuracy under the two interference models.

60

Page 74: Sequential imputation and multipoint linkage analysis

Complex Dominantm=8 m=8 m=4

Loc n W I NI I NI I NImid 4 1 8% 8% 100% 100% 100% 100%mid 4 5 6 6 97 97 98 98mid 4 10 11 13 98 98 97 97out 4 1 9 9 98 98 98 98out 4 5 12 13 73 73 58 63out 4 10 2 2 52 53 48 50mid 8 1 6 6 100 100 100 100mid 8 5 31 32 95 95 99 99mid 8 10 8 7 100 100 98 98out 8 1 5 5 93 94 97 97out 8 5 16 16 75 76 84 85out 8 10 0 0 41 42 30 33

Table 5.1: Power estimates. The percentage of data sets that had a maximum LOD scoregreater than 3 is reported. Loc=location of disease gene (‘mid’=middle of marker map,‘out’=outside of marker map), n=number of alleles, W=width of interval between markers,m=number of markers.

61

Page 75: Sequential imputation and multipoint linkage analysis

When the maximum LOD score under both interference models occurred within the

disease region the scores were often very close within that region, but outside of the region

the scores under interference tended to drop off faster. The LOD scores for a typical data

set where this occurred is shown in Figure 5.2. This data set came from a simulation done

with the dominant disease simulated in the middle of 8 markers with 8 alleles each and a 10

cM width between markers. We can see from Figure 5.2 that the scores were very close in

the middle interval where the disease gene is located, but outside of this interval the scores

under interference were relatively much lower than the scores under no interference.

10 20 30 40 50 60 700 20 40 60

-15

-10

-50

510

15

0 20 40 60

-15

-10

-50

510

15LO

D

Location (cM)

Figure 5.2: An example of LOD scores from data with interference. The solid and dottedlines correspond to LOD scores estimated with and without interference, respectively.

As a result, the mean LOD differences were much closer in magnitude inside the disease

region versus outside the disease region for many of the data configurations. The mean

62

Page 76: Sequential imputation and multipoint linkage analysis

LOD difference was as high as 1.32 outside the disease region compared to .03 inside the

disease region. Roughly half of the non-zero mean LOD differences were negative inside

the disease region, i.e. the mean of those mean LOD scores under interference was higher

than the mean of the mean LOD scores under no interference. We reported the mean LOD

differences in Table 5.2 for both inside the disease region and outside the disease region

for each data configuration. We see that the mean LOD differences were all non-negative

outside of the disase region. And this difference was consistently greater in magnitude than

the mean LOD difference inside the disease region. Thus, by this measure of precision,

incorporating interference in linkage analysis tended to lead to more precise estimates of

the disease location than not incorporating interference.

The apparent increase in precision depended on a number of factors. There was a

negligible effect on precision for the complex model, whereas in the dominant model with

8 markers the increase in precision was at times substantial depending on the number of

alleles and number of markers. When the disease was simulated in the middle for the

dominant model the increase in accuracy was consistently higher for 8 markers versus 4

markers. The precision increased in all but one case for 8 alleles relative to 4 alleles. (For

width 1 cM, the mean LOD difference was .1 for 4 alleles versus .09 for 8 alleles.) As the

width of the marker interval increased the precision also tended to increase.

When the disease was simulated outside of the marker map the effects were less clear.

But in general the precision increased as the marker interval increased, holding all other

factors constant.

63

Page 77: Sequential imputation and multipoint linkage analysis

Complex Dominantm=8 m=8 m=4

Loc n W Outside Inside Outside Inside Outside Insidemid 4 1 .00 .00 .10 .00 .00 .00mid 4 5 .02 -.01 .49 .06 .16 .01mid 4 10 .03 -.01 1.06 -.05 .50 -.04out 4 1 .00 .00 .20 -.01 .45 .00out 4 5 .02 .02 1.20 .03 .39 -.02out 4 10 .03 -.04 1.22 .10 .62 -.16mid 8 1 .00 .00 .09 .00 .02 .00mid 8 5 .01 .00 .74 -.01 .22 .00mid 8 10 .03 -.01 1.16 .00 .67 .00out 8 1 .00 .00 .09 -.01 .10 .00out 8 5 .02 .00 1.32 .03 .51 .00out 8 10 .02 .02 1.25 .06 1.24 .21

Table 5.2: Difference of LOD scores (no interference - interference) inside the diseaseregion (‘Inside’) and outside the disease region (‘Outside’). These differences are reportedfor data sets in which the maximum LOD scores under both interference models occurredin the disease region (‘RR’). Loc=location of disease gene (‘mid’=middle of marker map,‘out’=outside of marker map), n=number of alleles, W=width of interval between markers,m=number of markers.

64

Page 78: Sequential imputation and multipoint linkage analysis

The accuracy also increased when accounting for the presence of interference. The

maximum LOD scores tended to occur more often in the disease region under interference

than no interference. The difference in the rates ranged from 4 percentage points below to

26 percentage points above when comparing interference to no interference.

The joint distribution of the locations of the maximum LOD scores for interference and

no interference is summarized in Table 5.3. ‘RR’ is when the maximum LOD score evalu-

ated under interference and no interference both occur in the disease region. ‘RW’ is when

the maximum LOD score occurs in the disease region under interference, but not under no

interference. ‘WR’ is when the maximum LOD score occurs in the disease region under

no interference, but not under interference. And the remaining category (when neither of

the maximum LOD scores occur in the disease region) can be inferred from the other 3

categories.

As with precision, the improvements in accuracy are most apparent for the dominant

model. In the dominant model, the largest factor affecting the increase in accuracy was the

width of the marker intervals when the disease was simulated outside of the marker map.

When the disease was simulated in the middle the improvement in accuracy was virtually

negligible.

5.4 Discussion

Goldstein et al. (1995) demonstrated substantial gains in efficiency were possible in

exclusion mapping and gene ordering for completely informative data with the ( model

of interference. Lin and Speed (1999) showed that large gains (though not as substantial)

65

Page 79: Sequential imputation and multipoint linkage analysis

Complex Dominantm=8 m=8 m=4

Loc n W RR RW WR RR RW WR RR RW WRmid 4 1 30% 0% 1% 55% 0% 4% 68% 1% 2%mid 4 5 42 1 2 81 3 2 82 5 1mid 4 10 54 6 4 93 1 1 96 1 1out 4 1 34 0 0 53 2 1 69 2 0out 4 5 37 6 6 48 22 0 34 18 0out 4 10 39 22 1 28 24 0 27 18 0mid 8 1 37 0 1 59 3 2 80 2 0mid 8 5 68 3 2 97 0 1 93 0 0mid 8 10 71 2 2 87 1 2 95 0 1out 8 1 35 0 1 56 8 0 56 0 0out 8 5 33 4 4 74 7 0 56 17 0out 8 10 15 1 3 46 26 0 23 23 0

Table 5.3: Maximum LOD scores by locations. ‘RR’ is when the maximum LOD scoreevaluated under interference and no inteference both occur in the disease region. ‘RW’is when the maximum LOD score occurs in the disease region under interference, butnot under no interference. ‘WR’ is when the maximum LOD score occurs in the dis-ease region under no interference, but not under interference. Loc=location of diseasegene (‘mid’=middle of marker map, ‘out’=outside of marker map), n=number of alleles,W=width of interval between markers, m=number of markers.

66

Page 80: Sequential imputation and multipoint linkage analysis

could also be achieved with more realistic small human pedigree data (sizes 7 and 10) for

the same problems.

We addressed the effect of accounting for interference on power, precision and accuracy

in linkage analysis using a large pedigree (35 members) with missing data. We found

that although power was not affected, the precision and accuracy were improved when

accounting for interference. While the LOD scores were very similar inside the disease

region (when the maximum LOD scores both occurred in the disease region under both

interference models) they were often relatively much lower outside of the disease region

under interference.

Furthermore, the maximum LOD scores tended to occur more often in the disease re-

gion under interference than no interference. Both measures of precision and accuracy

under the complex model showed very little difference whereas the dominant model often

showed substantial differences. The improvement increased with the number of alleles and

the number of markers. This can be explained by the increased informativeness of the data.

As the number of alleles and markers increased the data became more informative and the( model fit the data even better and thus became more precise and accurate. Likewise,

with the dominant model the data was more informative for linkage than with the complex

model and hence the ( model was more precise and accurate. Similarly, Lin and Speed

(1999) found increases in relative efficiency for the number of alleles but not for the num-

ber of markers. This may be due to the fact that they considered pedigrees with no missing

data (other than phase information). With missing data, as in this simulation study, the

67

Page 81: Sequential imputation and multipoint linkage analysis

number of markers has more of an effect on the informativeness of the data as evidenced

by this simulation study.

As the data became more informative for linkage the power also increased, as expected.

The ( model did not show any increased power relative to the no interference model, how-

ever. This apparent paradox can be explained by the nature of linkage analysis and the ( model. When the data was informative for linkage there was more evidence of recombina-

tion between the disease gene and points outside the disease region than inside the disease

region. This lead to more multiple recombinations outside of the disease region than inside.

For the distances analyzed, multiple recombinations are much less probable under the ( model with 1³�´­ than under no interference. When there are less recombinations the two

probabilities are more similar. Inside the disease region, when there was no evidence of

recombination, the probabilities under the ( model (1µ�¶­ ) were similar to the probabili-

ties under no interference and hence the LOD scores were similar. Whereas outside of the

disease region, when there was strong evidence for recombination, the probabilities were

distinct; with the probabilities under the ( model being lower. Hence while power was

similar, the ( model proved to be more precise and accurate.

Another apparent paradox was that the relative accuracy increased as the marker in-

terval widths increased. The informativeness of the data actually decreases as the interval

width increases. Hence the power decreases and one would likewise expect from arguments

similar to above that the precision and accuracy should also decrease. But the precision and

accuracy of the ( model relative to the no interference model actually increased when the

widths increased from 1 to 10 cM. At 1 cM there was virtually no difference in accuracy.

68

Page 82: Sequential imputation and multipoint linkage analysis

This is due to the fact that the ( model yields very similar probabilities to the no in-

terference model at such small distances (Lin and Speed, 1999). Whereas at 10 cM the

probabilities are more distinct and the resulting difference between the two models can be

seen. Increases in efficiency were similarly found by Lin and Speed (1999) when going

from 5 to 10 cM.

69

Page 83: Sequential imputation and multipoint linkage analysis

CHAPTER 6

DISCUSSION

Exact methods for linkage analysis are limited computationally in either the size of the

pedigree or the number of loci they can handle. As a result, investigators often have to

reduce the size of the data they are studying. This is usually done by a combination of

reducing the number of markers, subsetting a pedigree or splitting a pedigree into two or

more separate pedigrees. This can lead to a loss of power and even bias.

Monte Carlo methods are an important alternative to exact methods. They can often

handle both large pedigrees and a large number of loci. Markov chain Monte Carlo meth-

ods have been successfully implemented in large and complex pedigrees with many loci.

But due to the nature of pedigree data the MCMC samples are highly correlated and con-

vergence can be slow.

For simple pedigrees or pedigrees with at most 1 or 2 loops, sequential imputation

is a viable alternative to MCMC methods. Since it draws independent samples from its

sampling distribution sequential imputation is expected to be more efficient than MCMC

methods for many problems. Of course if the pedigree is too complex such that a single

70

Page 84: Sequential imputation and multipoint linkage analysis

locus peel is impossible then sequential imputation is not even viable and MCMC methods

are the appropriate choice.

Sequential imputation has been implemented to estimate LOD scores for simple pedi-

grees. We extended sequential imputation to handle complex pedigrees and showed how

it can be used to estimate any IBD sharing statistic of the form given in equation (1.2).

We compared the power of analyzing entire pedigrees with sequential imputation versus

analyzing reduced pedigrees with GENEHUNTER, a popular software package which cal-

culates linkage statistics exactly. We showed that analyzing the entire pedigree can lead

to substantial increases in power versus reducing them, which GENEHUNTER had to do

since the pedigrees were too large to process with its exact method.

We also incorporated the ( model of the chiasma process to carry out linkage analysis.

This model can lead to varying degrees of interference, from no interference to arbitrarily

strong positive interference. Exact methods which utilize HMMs, such as GENEHUNTER,

rely on the assumption of no interference in their algorithm. We showed the potential

benefits of accounting for interference in a haplotype study and linkage study.

In the haplotype study we showed that accounting for interference can lead to dramat-

ically different results. In the linkage study, we showed that the estimates of the gene

locations were more precise and accurate when accounting for interference.

In the algorithm described in this thesis, we decomposed the data into the information

that we have at the F loci and sequentially imputed the ordered-genotypes locus by locus.

We note that other decompositions are possible. For instance, one could decompose the

data into sets of loci. This would involve a multilocus peel per iteration, which obviously

71

Page 85: Sequential imputation and multipoint linkage analysis

increases the computational cost. The advantage is that it should decrease the Monte Carlo

variability and hence require less iterations to reach the same accuracy. Furthermore, the

order of the sequential imputation does not have to be the physical order of the loci. In fact,

the simulation variability should decrease by processing the more informative loci first.

SIMPLE, by default, uses the number of alleles as a measure of informativeness and sorts

the loci accordingly. The user may override this default and provide his/her own process

order.

Currently SIMPLE can estimate LOD scores, two popular nonparametric IBD sharing

statistics and produce all pairwise IBD sharing estimates with their weights. These IBD

sharing estimates and weights can be used to estimate other statistics of interest. We have

shown the flexibility and power of sequential imputation in linkage analysis. There are

still many other applications and improvements that can be made to the implementation of

sequential imputation in SIMPLE.

6.1 Efficiency

SIMPLE could be made more computationally efficient with a genotype elimination

algorithm. Without any genotype elimination peeling a single person in a simple pedigree

involves G, G

or G · computations, depending on the type of peel involved, where G is

the number of possible ordered genotypes. Even for peeling just one locus, as in SIMPLE,

this number can get large. For a single locus with 12 alleles G · =2,985,984! For a complex

pedigree the computational cost can be much more enormous, even if the pedigree is not

too complex. For example, the pedigree presented in Figure 2.1 in chapter 2 has a single

72

Page 86: Sequential imputation and multipoint linkage analysis

marriage loop. Yet peeling one of these people involved Gc

computations. For 12 alleles

Gc �¸­ ��¹ 9 ¹�º»� 9@¼ ¹ ¼ !

Currently there is some genotype elimination done in SIMPLE. For example, if every-

one in a pedigree is genotyped then G is effectively reduced to at most 2 and the compu-

tational cost of peeling anyone is at most 8 in a simple pedigree. But in the presence of

missing data, the genotype elimination algorithm can be improved. For example, peeling

people with missing data often involves summing over the number of alleles squared. An

obvious improvement would be to reduce the number of alleles at each locus to the number

of alleles observed plus one (for the unobserved alleles). This is what we did manually

to analyze the pedigree in Figure 2.1 in chapter 2 and it drastically reduced the computa-

tional time. We processed 8 markers with 9 to 12 alleles, as described in chapter 2. We

reduced the effective number of alleles to at most 5 alleles using the genotype algorithm

just described. To impute 5,000 genotypes on a linux machine with 512 MB RAM and a

1 gHz AMD Athlon processor took less than 2 hours after this genotype elimination. To

impute just 10 genotypes on the same machine but without the genotype elimination took

more than 24 minutes! With this application of genotype elimination the computational

time went from a matter of days to less than 2 hours.

6.2 Interference

All of these estimates can be done under the ( model of the chiasma process for simple

or complex pedigrees, as demonstrated in this thesis. We could implement other models of

interference in SIMPLE as well. Any multi-locus feasible map function or chiasma model

yielding recombination probabilities can be incorporated into SIMPLE. For any muli-locus

73

Page 87: Sequential imputation and multipoint linkage analysis

feasible map function,s

, we can derive the recombination probabilities using the formula

derived by Schnell (1961). Given the recombination probabilities, we can use them to

calculate the multi-locus probabilities in steps 1 and 2 in the algorithm in chapter 2 and

thus incorporate interference via the corresponding map function or chiasma process.

6.3 Quantitative Trait Statistics

As already described, we can calculate any IBD sharing statistic of the form given in

equation (1.2). This includes quantitative traits as well as qualitative traits. For example,

the maximum likelihood estimate of the Haseman-Elston statistic (Haseman and Elston,

1972; Kruglyak and Lander, 1995) for quantitative traits can be estimated by SIMPLE.

The Haseman-Elston statistic is based on the regression model ½|P:�¿¾À~°ÁÃÂLÄvPA~´Å`P ,&Æ� � 9JHIHJHI9'1 , where ½�P is the squared differences of phenotypes between the & NEO pair of

sibs and ĦP is the proportion of IBD sharing between the pair and Å@P P%Pl�Ç N 4 � 9GÈ ? . If Á is

significantly less than 0 then there is evidence of linkage. So we wish to estimate Á but we

don’t know the proportion of IBD sharing between the pairs, ɱ�5Ä � 9JHIHJHI9GÄ , . We can use

an EM algorithm derived by Kruglyak and Lander (1995) to get the ML estimate of Á .

In this application of the EM algorithm the complete data is 46Éw9@Ê8? , where É =( Ä � 9JHIHJH�9+Ä , )are the proportion of IBD sharing for all 1 pairs of sibs and Ê = 46½ � 9JHIHJHI9G½ , ? are the squared

differences of the observed quantitative phenotypes for 1 pairs of sibs. The observed data

is 46287�9@ÊA? . Note that Ê is just a function of the trait phenotypes, ;8= .In order to implement this EM algorithm one needs to know the probability distribution

of the ĦP ’s, P 46ĦP<R 287y? . This can be computationally inhibitive. We propose to use sequential

imputation to estimate these distributions.

74

Page 88: Sequential imputation and multipoint linkage analysis

We will outline the algorithm for estimating these distributions. Following the steps 1

through 4 in the algorithm in chapter 2 we produce a sample of N weights ��46� � ?p9JHJHJH�9G��4�� N ?and N inheritance vectors. Given an inheritance vector we can determine ÄLP .

To estimate the probability P 4ËÄ�PY�ÌÄAR 287y? we average of the sample realizations that

yield Ä . �P 46ĦP��¸ÄAR 2�7T?�� Ng PEk � I 4ËÄvPB�¸Ä»? ��46�8P§?��40~�? 9

where I 4ËĦP��ÍÄ�? is the indicator function, i.e. I 46;B?Î� �if ; is true, otherwise I 46;B?Ï� �

,

and ��40~�?� �NPEk � ��46&�? is the sum of the weights. This is a consistent estimator of the

desired probability. Using these estimated probabilities we implement the EM algorithm

described in Kruglyak and Lander (1995) to derive a maximum likelihood estimate of the

Haseman-Elston statistic.

75

Page 89: Sequential imputation and multipoint linkage analysis

APPENDIX A

SIMPLE DOCUMENTATION

The URL for the software:

http://www.stat.ohio-state.edu/ Ñ statgen/SOFTWARE/SIMPLE

===================== INSTRUCTIONS ======================

This document contains instructions for compiling andrunning SIMPLE (Sequential Imputation for Multi-PointLinkage Estimation). SIMPLE uses the same input files usedin GENEHUNTER and takes similar commands.

SIMPLE is made up of two programs written in C. The firstprogram is ’simple’ which produces importance samplingweights and (optionally) simulated in-phase genotypes(if NPL or QTL analysis) and likelihood (if LOD analysis)ratios. This output written to the screen which may bedirected to a file. The second program, ’scan’, reads inthe output from simple and produces the desired statistics(which are written to the screen). Both programs takethe same commands.

/********** Set up **********/

untar and unzip SIMPLE.tar.gz

76

Page 90: Sequential imputation and multipoint linkage analysis

>gzip -d < SIMPLE.tar.gz | tar xvf -

This will produce a directory called /SIMPLE in thecurrent directory. /SIMPLE contains: Makefile file,simple (compiled), scan (compiled), /src directory, /checkdirectory and instruct.txt file (this document). Makefileis used to compile simple and scan. /src directorycontains the relevant code. /check directory containssample pedigree files to run SIMPLE and ensure that itwas compiled correctly.

/************* Compiling *************/

simple and scan were compiled on a Solaris 7. To recompilethem type

>make simple scan

/********************************* Basic form for running simple *********************************/

>simple [command file] > ’simple output file’

Basic form for running scan

>scan [command file] < ’simple output file’ > ’scan ouput file’

The command file allows for changes from the default setup.As in GENEHUNTER the command file must at least contain the‘load’ and ‘scan’ commands to specify the input files.The following commands may be used, with the defaultsgiven in [].

77

Page 91: Sequential imputation and multipoint linkage analysis

/************ Commands ************/

>load [linkloci.dat]

Gives the GENEHUNTER compatible marker-locus data (allelefrequencies for each genetic marker, frequency andpenetrance information for the disease). The format ofthis file must be identical to the Linkage parameter file(output from the PREPLINK program).

>scan [ped.dat]

Gives the location of the file with the pedigree, marker,and disease data. The pedigree should be in the Linkagepedigree input format (before running MAKEPED or doingany preprocessing!). Each line of this file must havethe following structure:

3 12 8 9 1 2 1 1 2 ... 0 0 1.269(a) (b) (c) (d) (e) (f) (g) (h -------) (i)

(a) pedigree name(b) individual ID #(c) father’s ID #(d) mother’s ID #(e) sex (1=MALE, 2=FEMALE)(f) affectation status (0=UNKNOWN, 1=UNAFFECTED,

2=AFFECTED)(g) liability class (OPTIONAL) - classes specified

in marker data file(h) marker genotypes(i) quantitative trait values

A ‘0’ in any of the disease phenotype or marker genotypepositions (as in the the genotypes for the last markerabove) indicates missing data. (See the example pedigree

78

Page 92: Sequential imputation and multipoint linkage analysis

file in /check.) A non-numeric character for quantitativetraits is the default for missing values for quantitativetraits. To change this to a numeric character use thecommand:

>missingQT [ # ]

where ‘#’ is the numerical character for a missing valuein the quantitative traits data.

You should only enter one pedigree at a time, though intheory the software could handle more than one. Howeverthis will put a strain on the memory. In general, youwill be able to handle bigger pedigrees if they are runone at a time.

>maxiter [ 5000 ]

The number of imputed data sets to be generated by thesequential imputation procedure. The null distributionis estimated with 20*maxiter iterations.

>increment step [ 5 ]

Acts the same way as increment step does in GENEHUNTER.However, simple does not currently calculate scores outsidethe marker map. This should be changed in the future.In addition the increment scan command of GENEHUNTERhas not been implemented, and may not be in the futureversions.

>seed [ 123456789 ]

As simple is a Monte Carlo procedure, a starting pointfor the random number generator is required. While adefault value is given, this should be set for every run.

79

Page 93: Sequential imputation and multipoint linkage analysis

A valid integer between 0 and 2147483647 is required.SIMPLE automatically drops a random integer in a filecalled ’seed’ after execution. This file may be appendedto the command file to accomodate changing the seed.

>debug

There is a debug mode simple which allows for intermediateresults to be examined. It was mainly added originally toallow easier debugging as the code was being developed.The default is not to debug. If you wish to use thisprocedure, inclue ’debug’ in the command file.

>process order [1 2 ... nmarkers]

The order that the markers are processed can be set.The default is to process the markers in decreasing orderbased on the number of alleles (Most alleles processedfirst, least alleles processed last). If two or moremarkers have the same number of alleles, the lowestnumbered marker gets done first. If a disease locus isto be processed, it is always processed last.

>interference [ 19 0 ]

Currently only one interference model is includedin SIMPLE. The chisquare model with intensity m isindicated by

>interference 19 m

Note that a chisquare model with m=0 is the no interferencemodel. This is the default setting. In the future, otherinterference models will be added. The only change thatwill be observed by the user is that additional optionsto this command will become available.

80

Page 94: Sequential imputation and multipoint linkage analysis

>analysis [ BOTH ]

As in GENEHUNTER one may input ’BOTH’, ’NPL’ or ’LOD’.In addition, one may input ’NONE’. This may be useful ifyou just wish to conduct a QTL analysis.

>score [ PAIRS ]

As in GENEHUNTER one may input PAIRS or ALL.

>units [ ]

One may input ’cM’ or ’rec-frac’. The default, asin GENEHUNTER, is to assume that the distances are inrecombination fractions unless at least one distance isgreater than .5.

>peel [ <file name> ]

Indicate peeling order in file. This is only necessaryif the pedigree contains loops. The file should have2 columns. The first column contains the cut set.The second column contains the peel set. If there is morethan one member in any of the sets then separate them withcommas. The order of the cut and peel sets specifies thepeeling order. The last cut set, the root, should be 0.An example peeling file is given below.

>849, 845 853>844, 848 850>844, 848 851>839, 840 843>839, 840 846>839, 840 847>845, 844 839,840>849, 848 841,842

81

Page 95: Sequential imputation and multipoint linkage analysis

>849, 845, 844 848>849, 845 844>849 845>0 849

>total stat

As in GENEHUNTER, total stat produces the scores for thecombined pedigrees. The default is to produce the scoresfor the pedigrees separately.

>dumppairs

Print out the expected alleles shared IBD for eachrelative pair. This is printed after the scores. The rowscorrespond to the location and the columns correspond tothe pair. The header lists the PIDs of the pairs thateach column corresponds to. The first element of eachrow is labeled by the location that it corresponds to.Here is an example of the output from dumppairs:

>.>.>.>===========================>Expected alleles shared IBD>===========================>pedigree 1>pair: 1 2 3 ...>mem1: 839 839 839 ...>mem2: 843 844 845 ...> 0.00 1.000000e+00 1.000000e+00 1.000000e+00 ...> 1.87 1.000000e+00 1.000000e+00 1.000000e+00 ...> 3.73 1.000000e+00 1.000000e+00 1.000000e+00 ...>.>.>.

82

Page 96: Sequential imputation and multipoint linkage analysis

The first pair in pedigree 1 contained the members 839and 843 who had expected alleles shared IBD of 1, etc...

>dumpscores

Print out the raw scores simulated conditioned on the data.The output is printed out before the estimated scoresare printed. The rows correspond to each iteration.The first column is iteration number, the second column isthe weight and the rest of the columns correspond to thesimulated scores at each location. Here is an exampleoutput from dumpscores:

>1 4.809104e-14 2.500000e-01 2.500000e-01 5.000000e-01 ...>2 4.809104e-14 2.500000e-01 2.500000e-01 5.000000e-01 ...>3 4.809104e-14 2.500000e-01 5.000000e-01 5.000000e-01 ...>.>.>.

The first three iterations are shown. The weight for thefirst iteration is 4.809104e-14. The score simulated inthe first iteration at the first location is 2.500000e-01,etc...

/************************ Example command file ************************/

The following example command file includes all possiblecommands. In most cases the default values are used.This would be from an example with 5 markers and a disease.

>load linkloci.dat>scan ped.dat>analysis BOTH>score PAIRS

83

Page 97: Sequential imputation and multipoint linkage analysis

>maxiter 5000>increment step 5>seed 123456789>process order 1 2 3 4 5>interference 19 0>units cM>total stat>dumppairs>dumpscores

At the end of the output from simple and scan are writtenthe the commands that were used for the session. (See theexample output in /check.)

/***************** Check results *****************/

/check contains an example data set in ped.dat andlinkloci.dat. After you compile SIMPLE you can checkthe output from simple with out and the output from scanwith scores. The command file ’cmd’ contains the commandsnecessary to do the check and also has other commandscommented out which one may use. This command file mayserve as a template.

/***************************** Interpretation of results *****************************/

Appended at the end is a sample output file afterrunning SIMPLE on the sample data provided in the ’check’directory. The command file was the following:

84

Page 98: Sequential imputation and multipoint linkage analysis

------------Command file------------

>load linkloci.dat>scan ped.dat

We see that the command file simply indicated which locusfile to load and which pedigree file to scan. The defaultstatistics are both NPL-PAIRS and LOD scores. (See listof commands used appended at the end of the output file.)

The first statistics that are reported (see file below)are the estimated null mean and variance (under the columns’mean’ and ’var’, respectively). Next are the NPL scoresat the different locations. The first column, ‘ped’,indicates which pedigree these statistics correspond to.The second column, ‘pos’, indicates the position thatthe statistics were calculated at in cM. The thirdcolumn, ‘S’, are the raw scores at those positions.The fourth column, ‘Z’, are the standardized scores atthose positions. The next to last column, ‘SE(Z)’ arethe monte carlo SEs of the standardized scores. The lastcolumn are the estimated exact p-values.

The LOD scores follow the NPL scores. The first column‘pos’ are the locations in cM where the scores werecalculated. The second column, ‘LOD’, are the estimatedLOD scores. And the last column, ‘SE(LOD)’, are the MonteCarlo SEs of the estimated LOD scores.

85

Page 99: Sequential imputation and multipoint linkage analysis

-------------------Example Output File-------------------

>Null Scores>ped mean var>*** ********* *********> 1 1.906368 0.597479>>ped pos S Z +/- SE(Z) p-valu>*** ***** ********* ********* *** ********* *******> 1 0.00 1.625000 -0.364010 +/- 0.000000 0.6259> 1 4.00 1.977075 0.091475 +/- 0.007706 0.2502> 1 8.00 2.316438 0.530513 +/- 0.009376 0.2502> 1 12.00 2.667263 0.984381 +/- 0.009395 0.2502> 1 16.00 3.011613 1.429872 +/- 0.007738 0.2502> 1 20.00 3.375000 1.899992 +/- 0.000000 0.1236> 1 24.00 3.016950 1.436777 +/- 0.007702 0.2502> 1 28.00 2.675387 0.994892 +/- 0.009369 0.2502> 1 32.00 2.321700 0.537321 +/- 0.009394 0.2502> 1 36.00 1.982575 0.098590 +/- 0.007749 0.2502> 1 40.00 1.625000 -0.364010 +/- 0.000000 0.6259> 1 44.00 1.986063 0.103102 +/- 0.007780 0.2502> 1 48.00 2.325287 0.541962 +/- 0.009394 0.2502> 1 52.00 2.661738 0.977233 +/- 0.009403 0.2502> 1 56.00 3.014400 1.433478 +/- 0.007728 0.2502> 1 60.00 3.375000 1.899992 +/- 0.000000 0.1236>>>LOD Scores>pos LOD SE(LOD)>***** *****************> 0.00 -10000.000000 0.0000> 4.00 -1.600278 0.0000> 8.00 -1.426529 0.0000>12.00 -1.427397 0.0000>16.00 -1.602947 0.0000>20.00 -10000.000000 0.0000>24.00 -1.599642 0.0000>28.00 -1.426412 0.0000

86

Page 100: Sequential imputation and multipoint linkage analysis

>32.00 -1.427404 0.0000>36.00 -1.602604 0.0000>40.00 -10000.000000 0.0000>44.00 -1.603792 0.0000>48.00 -1.427640 0.0000>52.00 -1.426351 0.0000>56.00 -1.599998 0.0000>60.00 -10000.000000 0.0000>>The following commands were used.>An ’*’ indicates that the command was found in the>command file ’cmd’ and may not necessarily be the>default value.>------------------------------------------------->load linkloci.dat*>scan ped.dat*>analysis BOTH>score ALL*>units (AUTO) cM>maxiter 5000>increment step 5>seed 123456789>NO reweighting used>process order (AUTO) 1 2 3 4 5>interference 19 m=0

87

Page 101: Sequential imputation and multipoint linkage analysis

BIBLIOGRAPHY

Abecassis, G., Cherny, S., Cookson, W., and Cardon, L. (2002). Merlin-rapid analysis ofdense genetic maps using sparse gene flow trees. Nature Genetics, 30:97–101.

Bergman, N. (2001). Posterior Cramer-Rao bounds for sequential estimation. In Doucet,A., de Freitas, N., and Gordon, N., editors, Sequential Monte Carlo methods in prac-tice, pages 321–338. Springer-Verlag, New York.

Blake, A., Isard, M., and MacCormick, J. (2001). Statistical models of visual shape andmotion. In Doucet, A., de Freitas, N., and Gordon, N., editors, Sequential MonteCarlo methods in practice, pages 339–357. Springer-Verlag, New York.

Cannings, C., Thompson, E., and Skolnick, M. (1978). Probability functions on complexpedigrees. Advanced Applied Probability, 10:26–61.

Cummings, K. (1997). Concepts of Genetics., volume 5. Prentice Hall, Upper SaddleRiver, NJ.

Dausset, J., Cann, H., Cohen, D., Lathrop, M., Lalouel, J., and White, R. (1990). Cen-tre d’

�etude du Polymorphsme Humaine (CEPH): collaborative genetic maping of the

human genome. Genomics, 6:575–577.

Elston, R. and Stewart, J. (1971). A general model for the genetic analysis of pedigree data.Human Heredity, 21:523–542.

Goldstein, D., Speed, T., and Zhao, H. (1995). Relative efficiencies of chi-square modelsof recombination for exclusion mapping and gene ordering. Genomics, 27:265–273.

Gudbjartsson, D., Jonasson, K., Frigge, M., and Kong, A. (2000). Allegro, a new computerprogram for multipoint linkage analysis. Nature Genetics, 25:12–13.

Haldane, J. (1919). The combination of linkage values and the calcualtion of distancesbetween loci of linked factors. Journal of Genetics, 8:299–309.

Haseman, J. and Elston, R. (1972). The investigation of linkage between a quantitative traitand a marker locus. Behaviour Genetics, 2:3–19.

88

Page 102: Sequential imputation and multipoint linkage analysis

Idury, R. and Elston, R. (1997). A faster and more general hidden markov model algorithmfor multipoint likelihood calculations. Human Heredity, 47:197–202.

Irwin, M., Cox, N., and Kong, A. (1994). Sequential imputation for multilocus linkageanalysis. Proceedings of the National Academy of Sciences, 91:1684–1688.

Karlin, S. and Liberman, U. (1978). Classifications and comparisons of multilocus recom-bination distributions. Proceedings of the National Academy of Sciences, 75:6332–6336.

Kruglyak, L., Daly, M., and Lander, E. (1995). Rapid multipoint linkage analysis of reces-sive traits in nuclear families, including homozygosity mapping. American Journal ofHuman Genetics, 56:519–527.

Kruglyak, L., Daly, M., Reeve-Daly, M., and Lander, E. (1996). Parametric and nonpara-metric linkage analysis: A unified multipoint approach. American Journal of HumanGenetics, 58:1347–1363.

Kruglyak, L. and Lander, E. (1995). Complete multipoint sib-pair analysis of qualitativeand quantitative traits. American Journal of Human Genetics, 57:439–454.

Kruglyak, L. and Lander, E. (1998). Faster multipoint linkage analysis using Fourier trans-forms. Journal of Computational Biology, 5:1–7.

Kwiatkowski, D., Dib, C., Slaugenhaupt, S., Povey, S., Gusella, J., and Hains, J. (1993). Anindex marker map of chromosome 9 provides strong evidence of positive interference.American Journal of Human Genetics, 53:1279–1288.

Lander, E. and Green, P. (1987). Construction of multilocus genetic linkage maps in hu-mans. Proceedings of the National Academy of Sciences, 84:2363–2367.

Lange, K. (1997). Mathematics and Statistical Methods for Genetic Analysis. Springer-Verlag, New York.

Lange, K. and Goradia, T. (1987). An algorithm for automatic genotype elimniation. Amer-ican Journal of Human Genetics, 40:250–256.

Lathrop, G., Lalouel, J., Julier, C., and Ott, J. (1984). Strategies for multilocus linkage inhumans. Proceedings of the National Academy of Sciences, 81:3443–3446.

Lin, S. and Speed, T. (1996). Incorporating crossover interference into pedigree analysisusing the ( model. Human Heredity, 46:315–322.

Lin, S. and Speed, T. (1997). An algorithm for haplotype analysis. Journal of Computa-tional Biology, 4:535–546.

89

Page 103: Sequential imputation and multipoint linkage analysis

Lin, S. and Speed, T. (1999). Relative efficiencies of the chi-square recombination modelsfor gene mapping with human pedigree data. American Journal of Human Genetics,63:81–95.

Litt, M., Kramer, P., Browne, D., Gancher, S., Brunt, E., Root, D., Phromchotikul, T.,Dubay, C., and Nutt, J. (1994). A gene for Episodic Ataxia/Myokymia maps to chro-mosome 12p13. American Journal of Human Genetics, 55:702–709.

Luo, Y., Lin, S., and Irwin, M. (2001). Two-locus modeling of asthma in a Hutteritepedigree via Markov chain Monte Carlo. Genet Epidemiology., 21(Supp 1):S24–S29.

Markianos, K., Daly, M., and Kruglyak, L. (2001a). Efficient multipoint linkage analy-sis through reduction of inheritance space. American Journal of Human Genetics,68:963–977.

Markianos, K., Katz, A., and Kruglyak, L. (2001b). A new computational approach forrapid multipoint linkage analysis of qualitative and quantitative traits in large, complexpedigrees, and its implementation in GENEHUNTER. American Journal of HumanGenetics, 69:228.

McPeek, M. (1999). Optimal allele-sharing statistics for genetic mapping using affectedrelatives. Genet Epidemiology., 16:225–249.

O’Connell, J. and Weeks, D. (1995). The VITESSE algorithm for rapid exact multilocuslinkage analysis via genotype set-recoding and fuzzy inheritance. Nature Genetics,11:402–408.

O’Connell, J. and Weeks, D. (1999). An optimal algorithm for automatic genotype elimi-nation. American Journal of Human Genetics, 65:1733–1740.

Ott, J. (1989). Computer-simulation methods in human linkage analysis. Proceedings ofthe National Academy of Sciences, 86:4175–4178.

Ploughman, L. and Boehnke, M. (1989). Estimating the power of a proposed linkage studyfor a complex trait. American Journal of Human Genetics, 44:543–551.

Qian and Beckmann (2002). Minimum-recombinant haplotyping in pedigrees. AmericanJournal of Human Genetics, 70:1434–1445.

Schnell, F. (1961). Some general formulations of linkage effects in inbreeeding. Genetics,46:947–957.

Sobel, E., Lange, K., and O’Connell, J. (1996). Haplotyping algorithms. In Speed, T.and Waterman, M., editors, Genetic mapping and DNA sequencing, volume 81, pages89–110. Springer-Verlag, New York.

90

Page 104: Sequential imputation and multipoint linkage analysis

Tapadar, P., Ghosh, S., and Majumder, P. (2000). Haplotyping in pedigrees via a geneticalgorithm. Human Heredity, 50:43–56.

Thompson, E. (2000). Statistical inferences from genetic data on pedigrees., volume 6 ofNSF-CBMS Regional Conference Series in Probability and Statistics. IMS, Beach-wood, OH.

Warner, E., Foulkes, W., Goodwin, P., Meschino, W., Blondal, J., Paterson, C., Ozcelik, H.,Goss, P., Allingham-Hawkins, D., Hamel, N., Prospero, L. D., Contiga, V., Serruya,C., Klein, M., Moslehi, R., Honeyford, J., Liede, A., Glendon, G., Brunet, J., andNarod, S. (1999). Prevalence and penetrance of BRCA1 and BRCA2 gene mutationsin unselected Ashkenazi Jewish women with breast cancer. Journal of the NationalCancer Institute, 91:1241–1247.

Weeks, D., Lathrop, G., and Ott, J. (1993). Multipoint mapping under genetic interference.Human Heredity, 43:86–97.

Whittemore, A. and Halpern, J. (1994). A class of tests for linkage using affected pedigreemembers. Biometrics, 50:118–127.

Zhao, H., McPeek, M., and Speed, T. (1995a). Statistical analysis of chromatid interfer-ence. Genetics, 139:1057–1065.

Zhao, H., Speed, T., and McPeek, M. (1995b). Statistical analysis of crossover interferenceusing the chi-square model. Genetics, 139:1031–1044.

91