Sequential Imputation and Linkage Analysis DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Zachary Skrivanek, B.S., M.S. ***** The Ohio State University 2002 Dissertation Committee: Shili Lin, Adviser Mark Irwin Steven MacEachern Approved by Adviser Department of Statistics
104
Embed
Sequential imputation and multipoint linkage analysis
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Sequential Imputation and Linkage Analysis
DISSERTATION
Presented in Partial Fulfillment of the Requirements for
the Degree Doctor of Philosophy in the
Graduate School of The Ohio State University
By
Zachary Skrivanek, B.S., M.S.
* * * * *
The Ohio State University
2002
Dissertation Committee:
Shili Lin, Adviser
Mark Irwin
Steven MacEachern
Approved by
AdviserDepartment of Statistics
c�
Copyright by
Zachary Skrivanek
2002
ABSTRACT
Multilocus calculations using all available information on all pedigree members are im-
portant for linkage analysis. Exact calculation methods in linkage analysis are limited in
either the number of loci or the number of pedigree members they can handle. In this the-
sis, we propose a Monte Carlo method for linkage analysis based on sequential imputation.
Unlike exact methods, sequential imputation can handle both a moderate number of loci
and a large number of pedigree members. Sequential imputation does not have the prob-
lem of slow mixing encountered by Markov chain Monte Carlo methods because of high
correlation between samples from pedigree data. This Monte Carlo method is an applica-
tion of importance sampling in which we sequentially impute ordered genotypes locus by
locus and then impute inheritance vectors conditioned on these genotypes. The resulting
inheritance vectors together with the importance sampling weights are used to derive a con-
sistent estimator of any linkage statistic of interest. The linkage statistic can be parametric
or nonparametric; we focus on nonparametric linkage statistics. We showed that sequential
imputation can produce accurate estimates within reasonable computing time. Then we
performed a simulation study to illustrate the potential gain in power using our method for
multilocus linkage analysis with large pedigrees. We also showed how sequential imputa-
tion can be used in haplotype reconstruction, an important step in genetic mapping. In all
ii
of the applications of sequential imputation we can incorporate interference, which often is
ignored in linkage analysis due to computational problems. We demonstrated the effect of
interference on haplotyping and linkage analysis. We have implemented sequential impu-
tation for multilocus linkage analysis in a user-friendly software package called SIMPLE
(Sequential Imputation for Multi-Point Linkage Estimation). SIMPLE currently can esti-
mate LOD scores, IBD sharing statistics and haplotype configuration probabilities for both
simple and complex pedigrees with or without interference.
iii
This is dedicated to my father, Kenneth Skrivanek, for his unwavering support.
iv
ACKNOWLEDGMENTS
I thank my advisors Mark Irwin and Shili Lin for their enormous dedication and contri-
bution to my research at Ohio State University.
The Collaborative Study on the Genetics of Alcoholsim (COGA) (H. Begleiter, SUNY
HSCB principal Investigator, T. Reich, Washington University, Co-Principal Investigator)
includes nine different centers where data collection, analysis, and/or storage takes place.
The nine sites and Principal Investigators and Co-Investigators are: Indiana Univeristy (T.-
K. Li, J. Nurnberger Jr., P.M. Conneally, H. Edenberg); Univeristy of Iowa (R. Crowe, S.
Kuperman); University of California at San Diego (M. Schuckit); University of Connecticut
(V. Hesselbrock); State University of New York, Health Sciences Center at Brooklyn (B.
Porjesz, H. Begleiter); Washington University in St. Louis (T. Reich, C.R. Coninger, J.
Rice, A. Goate); Howard University (R. Taylor); Rutgers University (J. Tischfield); and
Southwest Foundation (L. Almasy). This national collaborative study is supported by the
NIH Grant U10AA08403 from the National Institute on Alcohol Abuse and Alcoholism
(NIAAA). GAW11 was supported by NIH grant GM31575.
v
VITA
March 17,1970 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Born - New York, USA
Z. Skrivanek, S. Lin, M. Irwin, “Linkage Analysis with Sequential Imputation”. Depart-ment of Statistics, Ohio State University, Technical Report No. 689. August, 2002.
FIELDS OF STUDY
Major Field: statistics
Studies in Linkage Analysis: Prof. Shili Lin and Mark Irwin
Meiosis is the process of forming gametes. During meiosis homologous pairs self-
replicate to give rise to two sister chromatids each of which are connected to each other at
the centromere. These homologous chromosomes (consisting of 2 chromatids each) align
together to form a bundle of 4 chromatids. The homologs are bound by bands of protein
known as the synaptonemal complex (Cummings, 1997). As homologs begin to separate
from each other one or more areas between non-sister chromatids remain in contact at
locations known as chiasmata (chiasma singular). It is believed that at these chiasmata ex-
change of genetic material between non-sister chromatids occurs through a process known
as crossing over. Each chromatid in the pair of sister chromatids is assumed to have a���������chance of participating in each crossover. It is further assumed that the probabil-
ity of a chromatid participating in one crossover is independent of previous crossovers for
that same chromatid, i.e. there is no chromatid interference. This assumption is largely
supported by empirical data (Zhao et al., 1995a).
3
There is, however, considerable evidence that the occurrence of a chiasma suppresses
the occurrence of another chiasma nearby (Weeks et al., 1993). This phenomenon is known
as (positive) chiasma interference or (positive) crossover interference or simply interfer-
ence.
Figure 1.1: Simplified depiction of meiosis with crossing over.
Not all crossovers may be observed due to the discrete nature of genetic data. We
only observe the phenotypes at loci on the chromosome, not on an actual interval of the
chromosome where the crossovers would occur. In fact, for two adjacent loci it is only
possible to observe a recombination of genetic material resulting from an odd number of
crossovers (the exact number not known). Consider the simplified depiction of crossover
during meiosis in Figure 1.1. The black chromatids are from one parent and the white
chromatids are from the other parent. Before separating, the two non-sister chromatids
exchange genetic material at the 2 points indicated in the picture (I in Figure 1.1). After
the homologous pairs separate (II in Figure 1.1) the sister chromatids eventually divide into
4 separate chromatids (III in Figure 1.1) which are allocated to 4 gametes. If the second
or third chromatid is passed on to the offspring we may observe a recombination between
the loci indicated by A and B in the picture since there was an odd number of crossovers
between A and B involving these two chromatids. Whereas if we just had information on
A and C there is no recombination to observe on any of the chromatids since there were an
even number of crossovers between these two loci on all of the chromatids.
The probability of a recombination between two loci is called the recombination frac-
tion. The recombination fraction, � , is bounded between 0 and � if there is no chromatid
interference. This can be easily seen with Mather’s formula. Let �� �� ��� be the random
number of chiasmata between two loci � and � . Mather’s formula is (Lange, 1997, p.
207): ��� �� P �� � �� ����� ���The proof follows from the definition of recombination and aforementioned assump-
tions. When two loci are linked (i.e. they are on the same chromosome) the recombination
fraction is less than � . Otherwise they are unlinked and the recombination fraction is � .The recombination fraction between two loci reflects their distance between each other
in the genome. The closer two loci are the smaller the recombination fraction between the
two loci. The recombination fraction is not additive, however. To get this desired property
we use the genetic distance.
The genetic distance, � , between two loci is defined as the expected number of crossovers
between the two loci on a chromatid, ie ��� � E !"� �� �#�%$ . This metric has the advantage that
it is additive. The genetic distance, � , is measured in units called Morgans (or centiMor-
gans, cM, by multiplying the Morgan by 100).
5
Map functions have been derived which map a recombination fraction to a genetic dis-
tance. These include the Haldane map function which assumes no interference (Haldane,
1919) and map functions derived from count-location models (Karlin and Liberman, 1978).
An alternative to map functions is to model the chiasma process along the chromatid
bundle directly by a stationary renewal process in the genetic distance metric. For example,
we may model the distance between adjacent chiasmata as &'&'�)( +* ,.- �0/ . The parameter 1can be considered an intensity parameter of interference. When 1 =0, the point process
corresponds to the Poisson process and there is no interference. As 1 increases so does the
level of interference.
The recombination model for this point process was derived by a number of authors
(Zhao et al., 1995b). Given a set of ordered loci and their corresponding (genetic) distances
one can compute recombination probabilities under the ( +* ,.- �0/ model for any intensity
parameter 1 (Zhao et al., 1995b). Zhao et al. (1995b) showed that this model fits a wide
variety of recombination data well. Lin and Speed (1996) showed that the ( model with
intensity parameter 1 =4 fits human pedigree data at least as well as, if not better than,
competing map functions.
1.1.3 Pedigree Data & Inheritance Vectors
The data that we analyze in linkage analysis consists of pedigrees and information
on the individuals in the pedigrees. A pedigree contains members who are related either
through marriage or kinship. Founders are members whose parents are not included in the
pedigree and nonfounders are the rest of the members. By convention, nonfounders have
both of their parents in the pedigree. A pedigree has a “loop” if there is an individual in
6
the pedigree such that you can trace a path from that individual to connecting members and
eventually come back to the same member by a different path. For example, consider the
looped pedigree in Figure 1.2. Starting with member 7 you can trace a path from 7 to 4
to 5 to 8 and then back to 7 again. Pedigrees with at least one loop will be referred to as
“complex”; those without one will be called “simple”.
Figure 1.2: Pedigree with a loop.
The data on the individuals in a pedigree, 2 , consists of their disease status, covariates
(e.g. age, weight, etc...) and marker data. In linkage analysis, pedigrees are ascertained or
included in the study based on a certain criteria such as number of affecteds. We partition
the data into 23�546287:9<;>=@? , the marker information, 2A7 , and the information on the trait of
interest, ;B= .7
By convention males are symbolized by squares and females are symbolized by circles.
The pedigree in Figure 1.3 has 7 members genotyped at 3 markers. Members 1, 2 and 4 are
founders, the rest are nonfounders.
Figure 1.3: Pedigree with genotypes.
The inheritance information in a pedigree can be completely described by a set of inher-
itance vectors. The inheritance vector C3�D4E1 � 9GF � 9IHJHJHI901#K�9+F�KL? is a binary representation
of the inheritance information at a location in the genome for each of the M nonfounders.
The &0NEO nonfounder is assigned 2 bits, 1�P and FQP , corresponding to the genetic information
inherited from the father and mother. Each bit is either 1 or 0 depending on whether the
allele was inherited from the grandmother or grandfather, respectively. The inheritance
8
distribution, P 40CSR 2A7T? , is the distribution of the inheritance vectors conditioned on the ob-
served marker data, 2A7 . The inheritance vector at the first locus for nonfounders 3, 5, 6
and 7 in Figure 1.3 is (?,?, ?,0, ?,1, ?,0). The ‘?’ indicates that the inheritance bit cannot
be determined. Only the ancestral origins of the maternal alleles (the maternal inheritance
bits) for the children in the third generation (5, 6 and 7) can be determined. The genotypes
of the parents and grandparents of an individual are required (but not necessarily sufficient)
to determine his/her inheritance bits.
We can further determine that allele � at the first locus for persons 5 and 7 are copies of
the same maternal allele. In this case they share the grandpaternal allele from their mother.
We say that this allele is shared Identical By Descent (IBD). On the other hand, although
persons 5 and 7 both inherited a � allele from their father, it is not clear whether this allele is
shared IBD, since one could be grandmaternal and the other could be grandpaternal. There
was an observed recombination between the first and second loci in member 7’s maternal
gamete indicated by an ‘x’. So the maternal gamete for individual 7 was a combination of
the two chromosomes in her mother. The first locus was grandpaternal and the next two
loci were grandmaternal. As a result, 5 and 7 do not share any other maternal alleles IBD.
The concepts of recombination and IBD play an important role in linkage analysis.
1.2 Linkage Analysis
Linkage analysis assesses whether a locus of interest is linked to a set of markers. That
is, it tests whether the locus is on the same chromosome as a set of markers. The hypotheses
9
being tested are:
H UWV disease gene not linked
H XYV disease gene linked
The LOD score is a popular parametric statistic in linkage analysis. It is the logarithm
base 10 of the likelihood of the disease gene at a specific location linked to the markers
(an alternative hypothesis scenario) divided by the likelihood of the disease gene not linked
to the markers (the null case). Traditionally, a LOD score above 3 is used as a criteria for
linkage.
Linkage analysis extracts inheritance information from pedigree data to evaluate the
cosegregation of marker and trait alleles. Thus it is important to utilize available infor-
mation on multiple markers and all pedigree members. Unfortunately, algorithms for exact
analysis are computationally limited in either the number of markers or the number of pedi-
gree members they can handle. Peeling and the Hidden Markov Model (HMM) approaches
are two such exact methods that are most frequently used.
Peeling (Elston and Stewart, 1971; Cannings et al., 1978) is a computational algorithm
that successively aggregates inheritance information from pedigree members. The algo-
rithm scales linearly with the number of pedigree members, but exponentially with the
number of loci. Genotype elimination (Lange and Goradia, 1987; O’Connell and Weeks,
1999) and set-recoding (O’Connell and Weeks, 1995) have been proposed to reduce the
computational requirements so that data from more loci can be processed jointly. Despite
these improvements, peeling is still limited in the number of loci that it can handle.
chain Monte Carlo (MCMC) and sequential imputation. MCMC algorithms can be de-
signed such that they scale linearly in both the number of loci and the number of pedigree
members (Thompson, 2000). Thus, MCMC is an extremely powerful estimation method
that can practically deal with any number of loci and pedigree of arbitrary size and com-
plexity (Luo et al., 2001). However, due to strong dependencies among realizations of the
Markov chain, convergence can be slow (Thompson, 2000).
Sequential imputation is another Monte Carlo method that has been successfully ap-
plied to a variety of areas (Blake et al., 2001; Bergman, 2001). Irwin et al. (1994) illus-
trated how to use sequential imputation in linkage analysis to calculate the likelihood (and
hence LOD scores), utilizing the peeling algorithm for a single locus, which results in an
algorithm that also scales linearly in both the number of loci and the number of pedigree
members. For pedigrees that are not very complex (i.e., single-locus peelable), sequen-
tial imputation is expected to be more efficient computationally than MCMC methods in
most circumstances. However, it should be noted that sequential imputation is not meant
to be a replacement for MCMC, as it cannot handle very complex pedigrees, such as the
1544-member Hutterite pedigree successfuly dealt with using MCMC methods (Luo et al.,
2001).
1.3 NPL Statistics
In this dissertation we will extend sequential imputation to nonparametric linkage anal-
ysis. This is an important step forward in making sequential imputation a viable alternative
12
for linkage analysis, as nonparametric linkage analysis is frequently more suited for ana-
lyzing complex traits whose underlying genetic model is unknown or unclear. We will now
describe nonparametric linkage (NPL) statistics.
NPL statistics measure IBD sharing among affecteds at a locus and compare the ob-
served sharing to what would be expected if the locus was not linked to a disease locus.
NPL statistics make no explicit assumptions about the trait model, hence they are nonpara-
metric. If the sharing significantly exceeds the expected value under the null model then
there is evidence of linkage.
The NPL statistics are based on a scoring function which scores the amount of IBD
sharing there is among affecteds. The scoring function is designed to give higher scores
under linkage than no linkage.
1.3.1 Scoring Functions
A scoring function, S VZ� S 4'C[9+;B=@? for inheritance vector C and observed disease phe-
notypes ;B= , measures the amount of IBD sharing among the affecteds. Whittemore and
Halpern (1994) presented two scoring functions, S , X\P%]\^ and S X`_a_ , which are popular today
in linkage analysis. S, X\P%]\^ assigns � 9b�c or 0 to each pair of affecteds that share 2, 1 or 0
alleles IBD, respectively, and then takes the average of the scores from all possible pairs in
a group of affecteds to score IBD sharing in the entire pedigree. For example, suppose two
sibs are affected in a pedigree and have the following inheritance vector (1,0, 1,0), which
implies that they both inherited the grandmaternal allele from their father and the grandpa-
ternal allele from their mother. Therefore they share two alleles IBD and would contribute� to the numerator of the score for the pedigree. The scores for all pairs of affecteds are
13
added together and the sum is divided by � X � , where d is the number of affecteds in the
pedigree. S, X\P%]\^ gives increasing scores as the number of alleles shared IBD between a pair
of affecteds increases.
In contrast, S X`_a_ gives increasing scores as the number of affecteds sharing an allele IBD
increases. It is defined as (Kruglyak et al., 1996):
S � �fe XAg Oh \ij Plk ��m P`4'n�?po!q"H
where n is a collection of alleles obtained by choosing one allele from each of these affected
individuals, and m P`4'n�? denotes the number of times that the &'NEO founder allele appears in n(for &r� � 9JHJHJH.9 ��s ) where
sis the number of founders. The sum is taken over the
� Xpossible ways to choose n , where d is the number of affecteds.
The (raw) score, S, from a pedigree is then standardized by the mean and standard
deviation under no linkage, S U and V U , respectively, to form the standardized score Z:
Z � S�
S UtV U H (1.1)
1.3.2 The Statistic
Rarely can the inheritance vector C be determined completely given the data in a pedi-
gree, 2 . Instead, we derive the expected value of the score conditioned on 2u7 , defined
previously as the marker data (Kruglyak et al., 1996):
E S 4'C[9+;>=@?vR 287w$>� g�x S 4'C[9+;>=@? P 40CSR 287y?.H (1.2)
We note that, if we add genetic parameters for the disease model to the score function,
the statistic in the form (1.2) becomes a parametric statistic. In fact, the familiar LOD
14
score is included in this class (Kruglyak et al., 1996). For ease of notation we will letzS V%� E S 40C{9+;>=@?|R 287S$ . This should not be confused with the sample mean, however, since
this mean is derived with respect to the inheritance distribution, P 4'CSR 2{7T? .Following Kruglyak et al. (1996) we standardize the expected raw score by the same
mean and standard deviation used in equation (1.1):zZ � z
S�
S UtV U H
We note that this is the correct null mean since
S Uy� E S $�� E E S R 287w$}$09but the null variance is actually conservative since
V Uy� Var S $� Var E S R 287S$}$b~ E Var S R 287�$�$�Var E S R 287�$�$
and strict inequality will always hold unless 2[7 determines C (and hence determines S) in
which case zS � E S 4'C{9<;>=p?vR 2�7w$B� S 4'C{9<;>=@?S� S H
Using the null variance of S 40C{9+;B=p? as a substitute for the null variance of E S R 2[7�$ was
suggested by Kruglyak et al. (1996) as the “perfect data approximation”. The variance of
E S R 287w$ is difficult to calculate analytically. It could be estimated via simulation, but we
will not pursue that here.
15
Suppose a data set contains M pedigrees with scoreszS � 9IHJHJHI9 zS K and null means and
variances S U+� � 9JHJHJH.9 S U+� K and V U+� � 9JHIHJHI9 V U+� K , respectively. We can standardize the sum of the
raw scores by � KPEk � zS P � � KPEk � S U+� P� � KPEk � V U+� P HThis statistic has a null mean of 0 and variance � �
and is asymptotically normal. This
standardization was suggested by McPeek (1999) and considered “optimal” (under certain
conditions and a criterion of power). We will use this standardization throughout this paper.
16
CHAPTER 2
SEQUENTIAL IMPUTATION FOR NPL ANALYSIS
The idea is to estimate the linkage statistic in equation (1.2) via sequential imputation
instead of calculating it exactly. Sequential imputation is an application of importance
sampling. We first impute ordered genotypes sequentially locus by locus via single-locus
peels. We then simulate inheritance vectors conditioned on these multilocus genotypes.
The inheritance vectors along with the importance sampling weights can be used to estimate
any linkage statistic of the form given in equation (1.2).
2.1 The Algorithm
We decompose the marker data, 2[7 , further into the information we have on the Fmarkers, 287����|; � 9JHIHJH�9�;�7�� . We denote the ordered genotypes at the F markers, �v� � 9IHJHJHI9��7�� , as � . We peel the first locus and impute the ordered genotypes at this locus (step
1). We sequentially impute the ordered genotypes of the rest of the loci locus by locus
conditioned on the previously imputed genotypes and then form the importance sampling
weight (steps 2 and 3). Then we simulate the inheritance vector C at a particular location
given the simulated ordered genotypes at the F markers (step 4). Finally, we calculate the
score using C (step 5).
17
Step 1. Calculate P 46; � ? and simulate � � from P 46� � R ; � ? .Step 2. For ��� � 9JHIHJH�9+F we carry out the following steps:
(a) Calculate P 4�; N R ; � 9G� � 9JHJHJH�9+; N e � 9G� N e � ? .(b) Derive P 46� N R ; � 9G� � 9JHJHIHI9+; N e � 9G� N e � 9+; N ?pH(c) Simulate � N from P 46� N R ; � 9G� � 9IHJHJH.9+; N e � 9G� N e � 9+; N ? .
Step 3. Form ��4���?�� P 46; � ?�� 7N k P 4�; N R ; � 9+� � 9JHIHJHI9+; N e � 9+� N e � ? .Step 4. Simulate C at a location of interest according to P 4'C�R ��? , where � are the
ordered genotypes simulated in steps 1-3. Note that P 40CSR ��? =P 4'CSR �w9+2{7�? .Step 5. Calculate the score S 4'C[9+;�=G? .
Steps 1 to 5 are carried out N times to form ��46� � ?.9JHIHJHI9G��46� N ? and S 4'C � 9+;>=p?.9IHJHJH.9 S 4'C N 9<;��.? .The probability calculations and the simulations in steps 1 through 2 are done by means of
single locus peeling and sampling using reverse peeling (Ploughman and Boehnke, 1989;
Ott, 1989).
Irwin et al. (1994) show that the sampling distribution of the ordered genotypes,
P = 46��R 287y? , satisfies:
P = 46��R 287y?S� P 4��WR 287y? P 4�287�?��4���? H (2.1)
x � �� S 4'C[9+;>=@?`��46��?vR 287�$� g�x�g � S 4'C[9+;>=p?`��46��? P 4'CSR �w9+287�? P = 4���R 287y?� g�x S 4'C[9+;>=@? g � P 4'C�R �w9+287�? P 46��R 2�7T? P 46287y?��46��? ��46��?� g x S 4'C[9+;>=@? g � P 4'C[9+��R 287�? P 46287T?� g x S 4'C[9+;>=@? P 4'CSR 287T? P 462�7T?� P 4�287�? E x S 4'C[9+;>=p?vR 287S$'H
This result and the fact that the average of the weights is an unbiased estimator of P 462u7T?(Irwin et al., 1994) gives us a consistent estimator for the linkage statistic in (1.2):�
E
x S 4'C{9<;>=@?\$B� Ng � k � S 40C � 9+;>=p? ��46� � ?��4'~�? 9 (2.2)
where ��40~�?�� �N� k � ��46� � ? . So the estimate is a weighted average of the scores.
The only disease data that we use to calculate the nonparametric IBD scores (step 5 in
the algorithm) is the affectation status. To calculate the score, S 4'C{9<;A=@? , we first assign each
of the founders two unique labels, known as IBD states. We pass the IBD states down the
pedigree using the simulated inheritance vector. We then measure the amount of IBD states
in common amongst the affecteds via the IBD statistics.
2.2 The Null Distribution
The IBD statistic measures the amount of IBD sharing. If the amount of sharing among
the affecteds is significantly more than what would be expected under random segregation
19
and independent assortment, then there is evidence of linkage. Therefore it is necessary to
measure the mean and variance of the scores under random segregation and independent
assortment, the null case. To estimate the null mean and variance we simply pass the IBD
states through the pedigree with 50% probability of a particular state being passed on to
an offspring and calculate the score. We repeat this process many times to get a sample of
the scores from the null distribution. The mean and variance of this sample give unbiased
estimates of the null mean and variance. Furthermore, the null distribution can be used to
estimate the exact p-values. We then standardize the estimated score by the null mean and
null standard deviation to form the standardized statistic:�zZ . Furthermore, the simulated
scores under the null distribution are used to estimate the exact p-value. We note that this
leads to conservative estimates of the standardized statistic and p-value as pointed out by
Kruglyak et al. (1996)
2.3 The Software Package
We have implemented sequential imputation for linkage analysis in a software package
called SIMPLE (Sequential Imputation for MultiPoint Linkage Estimation). The nonpara-
metric IBD statistics currently available in SIMPLE include the score functions S X<_�_ and
S, X`PZ]\^ , plus others as well. Furthermore, SIMPLE can calculate LOD scores. SIMPLE
takes input files with the same format as those used in GENEHUNTER, enabling the user
to easily switch to SIMPLE if the pedigree is too large to be handled by GENEHUNTER
in its entirety. The software is freely available from Ohio State University’s Statistical Ge-
netics’ web site. The URL and documentation for the software is provided in appendix
A.
20
2.4 Computational Requirements
Producing the weights and ordered genotypes (steps 1-3) takes the majority of the com-
puting time. To complete a single imputation we need to do a single locus peel for each
marker and then do reverse peeling (Ploughman and Boehnke, 1989; Ott, 1989) to simulate
the ordered genotypes. So the complexity and memory requirements are the same as those
required to do F single locus peels. The key difference in computational cost between this
algorithm and a standard peeling algorithm for linkage analysis such as that implemented
in LINKAGE (Lathrop et al., 1984) is that we are only doing a single locus peel at a time,
so the calculations are linear in the number of markers. Efficiencies in peeling algorithms
can be applied to the peeling step here to improve the overall efficiency. Currently some
genotype elimination has been implemented in SIMPLE to achieve such efficiencies. As in
peeling, this stage is sensitive to missing data.
In step 4 in the algorithm, we simulate the inheritance vector at a location of interest,
conditioned on the simulated ordered genotypes. For one imputation this involves simulat-
ing inheritance bits for two times the number of nonfounders, resulting in the calculations
being linear in the number of pedigree members. The computational time required for cal-
culating the score (step 5 in the algorithm) depends on its complexity; see Markianos et
al. (2001a) for a detailed discussion. Missing data has no effect on either of these last two
steps since they are conditioned on complete ordered genotypes.
The memory is most influenced by the number of loci being analyzed. This is because
we store the joint recombination probabilities across all loci, leading to the storage be-
ing exponential in the number of loci being analyzed. In steps 1 through 3 we store the
Table 2.1: We report the time and memory requirements to complete 1,000 imputations ofsteps 1-3 and 4 & 5 of the algorithm (including the calculation of the estimate) for eightmarkers in each of three pedigrees of sizes small (52 members), medium (86 members) andlarge (100 members). Results are reported per disease location for steps 4 & 5. Note thatthe time units are different for steps 1 -3 and steps 4 & 5.
The time and memory requirements to produce the weights and ordered genotypes
(steps 1-3) for the small and medium pedigrees were similar. Though the medium pedi-
gree was substantially larger than the small pedigree, they both had a comparable amount
of missing data. This would explain why they took similar amount of time and memory
to be analyzed. On the other hand, the large pedigree had twice as much missing data and
therefore took more than twice as long and almost twice as much memory as the other two
pedigrees to be analyzed. The memory requirements to simulate the inheritance vectors
(step 4), calculate the scores (step 5) and form the weighted estimates were the same for
all three pedigrees. This is expected since the number of loci (8 markers and 1 point) being
23
analyzed was the same for all three pedigrees. On the other hand, the time increased as
the size of the pedigree increased since the number of inheritance vectors to be simulated
increased accordingly.
2.5 Accuracy of Estimates
We did a number of validation studies of SIMPLE using GENEHUNTER to verify that
the scores were being estimated accurately within reasonable computing time. The scores
were always quite close to the true scores produced by GENEHUNTER. Of course the ac-
curacy is a function of the number of imputations. To estimate the necessary sample size
to reach a certain desired accuracy one may run SIMPLE for a small number of imputa-
tions (say 100) to estimate the sampling variability (which is automatically calculated in
SIMPLE). From this estimate one can calculate the necessary number of imputations.
To illustrate the accuracy of SIMPLE, we analyzed pedigree 76 of the COGA (Collab-
orative Studies on the Genetics of Alcoholism) data set from Genetics Analysis Workshop
11. We removed three members so GENEHUNTER could analyze it. The pedigree is
shown in Figure 2.1. Note that it has a marriage loop. There are fourteen members in
the (reduced) pedigree with four founders. Eight markers are used from chromosome one:
D1S1613, D1S550, D1S532, D1S1588, D1S1631, D1S1675, D1S534, D1S1595. They
have nine to twelve alleles with an average heterozygosity of .75. The markers are spaced
11.2, 8.4, 18.1, 12.5, 11.9, 9.0 and 9.8 cM apart. Two founders (14%) are missing all of
their marker data. In addition, seven other members (50%) are missing data on D1S1631,
two members (14%) are missing data on D1S534 and three members (21%) are missing
data on other markers.
24
Figure 2.1: Pedigree used for validation study. The individuals marked with a slash markhave no marker data nor information on disease phenotypes.
The linkage statistics S, X`P%]�^ and S X`_�_ were estimated at five locations between each ad-
jacent pairs of markers, using both GENEHUNTER and SIMPLE with 5,000 imputations.
As can be seen from the plots in Figure 2.2 the estimated standardized scores produced by
SIMPLE were quite close to the true scores produced by GENEHUNTER. The scores plus
and minus 3 standard errors are plotted in Figure 2.3.
25
Pai
rs
OOOOOOOOOOOOO O O O O O OOOOOOOOOOOOOOOOOOOOOOOOO
0.0 11.2 19.6 37.7 50.2 62.1 71.1 80.9
01
23
All
OOOOOOOOOOOOO OO O O O OOOOOOOOOOOOOOOOOOOOOOOOO
0.0 11.2 19.6 37.7 50.2 62.1 71.1 80.9
01
23
Location on Chrom 1 (cM)
Figure 2.2: Scores produced by GENEHUNTER and SIMPLE are plotted by the line andcircles, respectively. S, X`PZ]\^ are plotted in the top frame and S X<_�_ are plotted in the bottomframe. The markers are indicated by the extended tick marks and the locations in cM areindicated on the x-axis of the bottom plot.
26
Pai
rs
0.0 11.2 19.6 37.7 50.2 62.1 71.1 80.9
01
23
All
0.0 11.2 19.6 37.7 50.2 62.1 71.1 80.9
01
23
Location on Chrom 1 (cM)
Figure 2.3: Scores produced by GENEHUNTER and SIMPLE are plotted by the line anddots (with vertical bars indicating 3 standard errors above and below estimated scores),respectively. S, X\P%]\^ are plotted in the top frame and S X`_a_ are plotted in the bottom frame.The markers are indicated by the extended tick marks and the locations in cM are indicatedon the x-axis of the bottom plot.
27
2.6 To Reweight or Not?
In the methods described above, we simulate the inheritance vectors (step 4) at ev-
ery location of interest (usually the entire chromosome in which the markers reside) and
then estimate the statistic using the simulated inheritance vectors. Alternatively, we could
simulate inheritance vectors at only a few locations of the chromosome and estimate the
linkage statistics at neighboring locations by reweighting, another importance sampling
idea exploited by Irwin et al. (1994). For instance, suppose that inheritance vectors were
simulated at position �fU . We can estimate the statistic at a nearby location, say � � , by:�E�G x S 4'C[9+;>=p?\$�� ���40~�? Ng � k � S 4'C � 9+;>=@?\���\¡+� � 4'C � 9+� � ?pH (2.3)
where
���\¡+� � 4'C[9+��?�� P �G �4'C�R ��?P �\¡v4'C�R ��? ��46��? (2.4)� P � 4'C�R �w9+287�?P � ¡ 4'C�R �w9+287�? ��46��? (2.5)� P � 4'C�R �w9+287�?P �\¡v4'C�R �w9+287�? P 46��R 287y?
P = 46��R 287y? P 46287T?� P � 4'C[9+��R 287�?P =� ¡ 4'C[9+��R 287�? P 4�287y?.H (2.6)
As pointed out previously, conditioned on the ordered genotypes, the inheritance vectors
are independent of the observed marker data, 2{7 (equation (2.5)). For ease of notation we
will drop the subscripts �bUI9@� � in the notation for this new weight. The reweighted statistic
in equation (2.3) is a consistent estimator of the linkage statistic at � � . To see why this is a
28
consistent estimator we note:
E �\¡x � � S 4'C[9+;>=@?\��4'C[9+��?\$�� g�x�g � S 4'C{9<;>=@?`��4'C{9<��? P =�\¡ 4'C[9+��R 287y?� g x g � S 4'C{9<;>=@? P 462�7T? P �@ .4'C{9<�WR 287�?
P =�`¡ 4'C{9<�WR 287�? P =�\¡ 40C{9+��R 2�7�?� P 46287T? g�x S 4'C[9+;>=@? g � P �G �4'C[9+��R 287�?� P 46287T? g x S 4'C[9+;>=@? P � 40CSR 287y?� P 46287T? E � x S 4'C[9+;>=@?vR 287w$'HBy the fact that the mean of the weights is an unbiased estimator of the probability of
the data (Irwin et al., 1994) and the above results, the importance sampling estimate in
equation (2.3) is consistent.
The main issue with importance sampling is not bias but rather variance (Irwin et al.,
1994). To illustrate the potential problems with reweighting we will use a pedigree with 3
generations, 7 siblings in the last generation as shown in Figure 2.4. Two of the sibs are
labeled as affecteds. There are 4 markers spread evenly over 30 cM. The inheritance bits for
the gametes at the 4 markers can be determined in all 7 children in the last generation. (Here
“gamete” is used loosely to refer to the inherited chromosome upon which the markers are
located.) The gametes are listed in Table 2.2. Children 9 through 12 had gametes a and b.
Child 13 had gametes b and d. Child 7 had gametes d and ¢ and child 8 had gametes ¢ and� . There were 10 observed recombinations between the second and third markers and no
observed recombinations elsewhere.
We analyzed this pedigree with S, X`P%]�^ using SIMPLE with a sample size of N � � 9 �����at 21 interior points with reweighting and without reweighting. The reweighting was done
29
Figure 2.4: Pedigree used to illustrate reweighting. The two affecteds are shaded.
by imputing the inheritance vector in the middle of each interval and estimating scores in
the interior of the interval using equation (2.3). The scores on the markers were estimated
by imputing the inheritance vector at the marker and estimating the score. They were also
estimated by reweighting from the scores simulated in the middle of each flanking interval.
So there were 2 to 3 estimates (including the reweighted estimate(s) from the adjacent
interval(s) and the non-reweighted estimate at the marker) for the scores at each marker
depending on whether there were 1 or 2 flanking intervals, respectively. The scores without
reweighting were estimated by simulating the inheritance vector at each point of interest
and estimating the score according to equation (2.2). We estimated the scores 10 times
with each method to capture the variability of the scores. The scores are plotted for both
methods in Figure 2.5. The curve corresponding to the true scores (which was calculated
Table 2.2: Gametes of children in last generation.
with GENEHUNTER) is also included. The scores without reweighting are tightly packed
around the true scores. We see that the reweighted estimates do well in the first and last
intervals but do poorly in the middle interval as the scores approach the flanking markers. In
the middle interval the true scores form a diagonal between the scores at the two markers
but the 10 estimated curves grossly diverge from the diagonal as they approach the two
markers.
The performance of the reweighted scores at interior points in the middle interval close
to the flanking markers was much poorer than anywhere in the other two intervals because
the variability was much higher at these points. In all 3 intervals the variability increases as
the location of the reweighted scores approaches a flanking marker. But 10 gametes have
observed recombinants in the middle interval whereas there are no observed recombinants
in any other interval. As a result a recombination is guaranteed between ��U and one of the
two adjacent markers in all of the simulations for these recombinant gametes. On the other
hand a recombination is much less probable between the middle of the first or last intervals
and the adjacent markers. Assuming no interference the probability of a recombination
31
Location(cM)
0 10 20 30
02
46
Reweight
Pai
rs
Location(cM)
0 10 20 30
02
46
No Reweight
Figure 2.5: The left plot is of the scores estimated with reweighting (10 estimates) and theright plot is of the scores estimated without reweighting (10 estimates). The curves withoutreweighting are very close to the truth.
32
between �fU and each of the two adjacent markers in an outside interval was approximately
.0025 for each of the recombinant gametes.
The variability of the reweighted scores will decrease as the sample size increases. For
the same pedigree data, we took 10 independent estimates of the scores for 1000, 5000,
25000 and 125000 imputations. The curves for the reweighted scores are plotted in Figure
2.6. The variability between the curves decreases as the number of imputations increases,
as expected, and the estimated scores converge to the truth.
Pai
rs
0 10 20 30
02
46
1K
0 10 20 30
02
46
5K
Pai
rs
0 10 20 30
02
46
25K
Location(cM)
0 10 20 30
02
46
125K
Location(cM)
Figure 2.6: 10 scores using reweighting are plotted for sample sizes N=1K, 5K, 25K and125K.
33
The sequential imputation proposed by Irwin et al. (1994) for estimating the likelihood
saved a lot of computational time since the implementation translated directly to less single
locus peels for the disease gene. In this case, however, there is no such clear advantage of
using reweighting, and hence the practice is not adopted here.
34
CHAPTER 3
POWER STUDY
To illustrate the potential benefit to multipoint linkage analysis by processing all pedi-
gree members of a large pedigree, we performed a simulation study. We used the S , X\P%]\^statistic to analyze the full pedigree shown in Figure 3.1 with SIMPLE and then the same
pedigree was analyzed using GENEHUNTER, which needed to discard some members of
the pedigree. The pedigree had 37 members, 11 of whom were founders and 5 members
had missing marker and disease data. The ascertainment criteria was that at least one sib in
each of the seven sibships in the last generation had to be affected.
We used 6 markers with equally frequent alleles for each marker. The markers were
spaced 15 cM apart. We simulated the marker and disease data under three disease models.
In all three cases the disease data was simulated at a locus in the middle of the marker map
at 37.5 cM. In model I, the penetrances for genotypes aa, Aa and AA were 0, .9 and .95
with a disease allele frequency P(A) �£H � . In model II, the penetrances were .05, .4 and .6
with a disease allele frequency .05. In the third model the penetrances were .05, .5 and .7
with a disease allele frequency of .3.
35
Figure 3.1: Pedigree structure for the power study. The individuals marked with a slashwill have no marker nor disease data.
Five hundred pedigrees were simulated under all three models. GENEHUNTER had to
drop between 14 (38 %) to 20 (54%) members in order to process the pedigrees. To estimate
power for a single pedigree, we calculated the proportion of pedigrees that had a maximum
score exceeding a certain threshold. Four thresholds levels were entertained: 2.33; 3.09;
3.72 and 4.27, as suggested by Kruglyak et al. (1996). These thresholds correspond to
asymptotic significance levels .01, .001, .0001 and .00001, respectively.
From the initially simulated pedigrees, we re-sampled, with replacement, 500 data sets
of size ¤ , with ¤ ranging from 2 to 50 pedigrees for each of the three models. We estimated
powers by the proportion of data sets with standardized scores that exceeded the threshold
values.
36
3.1 Results
The results for a single pedigree are summarized in Table 3.1. The power estimates are
all low since the data set only consists of a single pedigree. The power is consistently higher
under SIMPLE versus GENEHUNTER under all three models and all threshold levels.
Model I Model II Model IIILevel SIMPLE GH SIMPLE GH SIMPLE GH
.01 44% 40% 38% 26% 21% 19%.001 26 24 23 12 11 7
.0001 15 10 15 5 5 3.00001 8 3 10 3 2 1
Table 3.1: Power estimates for a single pedigree. Power was defined as the percentage ofpedigrees that exceeded certain thresholds. The thresholds used for asymptotic significancelevels of .01, .001, .0001 and .00001 were 2.33, 3.09, 3.72 and 4.27, respectively.
The power estimates under all three models for the data sets with different pedigree
sizes are plotted in Figure 3.2 for thresholds 2.33 and 3.09 and Figure 3.3 for thresholds
3.72 and 4.27. We added a spline smooth curve to each of the plots. The power esti-
mates increase as the sample sizes increase, as expected. The power under SIMPLE is also
consistently above the power under GENEHUNTER.
For the first two models, we calculated the minimal sample sizes needed, based on
the spline smooth curve to reach 50%, 65% and 80% power for each of the threshold
levels: 2.33; 3.09; 3.72 and 4.27. The results are summarized in Table 3.2. Since the
power was much weaker for the third model we reported the corresponding results for
37
O
OOOOOOOOOOOOOOOOOOOOOOO
0.0
0.2
0.4
0.6
0.8
1.0
5 15 25
Pow
er
Model I
+
++
++++
+++++++++++++++++
OO
OOOOO
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
10 25 40
Model II
+++++++
+++++
+++++
++++++++++++++++++++++
threshold = 2.33 ( nominal level .01 )
OOOOOOOO
OOOOOO
OOOOOOOO
OOOOOOO
OOOOOO
OOOOOOOOOO
OOO
10 30 50
Model III
++++++++
+++++
++++++++++++
+++++++++++++++
++++++++
OOO
O
OOO
OOOOOOOOOOOOOOOOO
0.0
0.2
0.4
0.6
0.8
1.0
5 15 25
Pow
er
Model I
No. Peds
+++
+
+++
+++
++++++++++++++
O
OOO
OOOOO
OOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
No. Peds
10 25 40
Model II
++++
+++++
++++
++++
++++++
++++++++++
++++++
threshold = 3.09 ( nominal level .001 )
OOOOOOOO
OOOOOO
OOOOOOOO
OOOOOOO
OOOOOOO
OOOOOOOOOOOO
No. Peds
10 30 50
Model III
++++++++++++++
++++++++
+++++++
+++++++
++++++++++
++
Figure 3.2: Power curves for SIMPLE (solid line and ‘ ¥ ’) and GENEHUNTER (dashedline and ‘+’) based on a thresholds of 2.33 and 3.09 for all three genetic models.
38
OO
OOOOO
OOO
OOOOOOOOOOOOOO
0.0
0.2
0.4
0.6
0.8
1.0
5 15 25
Pow
er
Model I
++
+
++
++
+++
++++
++++++++++
OOO
O
OOOOO
OOOO
OOOO
OOOOOOOOOOOOOOOOOOOOOO
10 25 40
Model II
+++++++
+++
+++
++++
++++
++++++
++++++
++++++
threshold = 3.72 ( nominal level .0001 )
OOOOOOOOOOOOOOO
OOOOOOO
OOOOOOO
OOOOOOOOO
OOOOOOO
OOO
10 30 50
Model III
++++++++++++++++++++++
+++++++
++++
+++++++++++++
++
OOOOOOO
OOO
OOOO
OOOOOOOOOO
0.0
0.2
0.4
0.6
0.8
1.0
5 15 25
Pow
er
Model I
No. Peds
+++++++
+++
+
+++++++++++
++
OOO
OOO
OOOOOO
OOOOO
OOOOO
OOOOOOOOOOOOOOOOO
No. Peds
10 25 40
Model II
++++++
+++++
++++++
++++
+++++
++++++
++++++
+
threshold = 4.27 ( nominal level .00001 )
OOOOOOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOO
OOOOO
No. Peds
10 30 50
Model III
++++++++++++++++++++++++++++++++++
++++++++++++++
Figure 3.3: Power curves for SIMPLE (solid line and ‘ ¥ ’) and GENEHUNTER (dashedline and ‘+’) based on a thresholds of 3.72 and 4.27 for all three genetic models.
39
powers 40%, 50% and 65% at thresholds 2.33 and 3.09 for this latter model. The results
for this latter model are summarized in Table 3.3. For model I, SIMPLE requires slightly
less number of pedigrees to achieve the same power as GENEHUNTER. For model II,
SIMPLE requires approximately half the number of pedigrees as GENEHUNTER. In the
third model, GENEHUNTER needs approximately 50% more pedigrees than SIMPLE to
achieve the same power. In all three models, the reduction in the number of pedigrees
necessary to achieve the given powers using SIMPLE versus GENEHUNTER grows as the
desired power increases and as the threshold becomes more stringent.
Table 3.2: Sample size estimates for models I & II. For nominal significance levels of .01,.001, .0001 and .00001 we report the minimal sample size necessary (based on a splinefit) to achieve 50%, 65% and 80% power. The thresholds used for asymptotic significancelevels of .01, .001, .0001 and .00001 were 2.33, 3.09, 3.72 and 4.27, respectively.
Table 3.3: Sample size estimates for model III. For nominal significance levels of .01 and.001 we report the minimal sample size necessary (based on a spline fit) to achieve 40%,50% and 65% power. The thresholds used for asymptotic significance levels of .01 and.001 were 2.33 and 3.09, respectively. The cases marked by ‘-’ indicate that the requiredsample size is greater than 50.
3.2 Type I error
We studied the type I error rates for a data set of 15 pedigrees, which was chosen to
reflect a realistic situation. To estimate type I error we simulated the genotypes for 10,000
pedigrees using the same pedigree structure and missing data pattern used in the previous
power study (Figure 3.1), fixing the last generation as all affected. From these initially
simulated pedigrees we re-sampled 2,000 data sets of size 15 pedigrees with replacement.
We then calculated the proportion of data sets with standardized scores exceeding each
of four thresholds to estimate the type I error rates. The results for both SIMPLE and
GENEHUNTER are shown in Table 3.4. GENEHUNTER dropped 17 (46%) members in
each of the pedigrees simulated. The estimated type I error rates were close to the nominal
Table 3.4: For nominal levels of .01, .001, .0001 and .00001 we report the estimated typeI error rates for a sample of 15 pedigrees. The thresholds used for asymptotic significancelevels of .01, .001, .0001 and .00001 were 2.33, 3.09, 3.72 and 4.27, respectively.
3.3 Discussion
One advantage of this method over the HMM is that it can process larger pedigrees
which can lead to an increase in power. We demonstrated the potential gain in power in our
simulation study using S, X\P%]\^ and three genetic models, although the magnitude of power
gains varied from model to model. Substantial power gains are observed under models II
and III, while the gains under model I are minimal. The different levels of power gains in
the three models are due to the differences in the amount of IBD information carried by the
affected individuals dropped. We note that using MCMC methods would yield comparable
results as these methods can process the same data as sequential imputation. However, we
would expect sequential imputation to be more efficient than MCMC for pedigrees that are
not too complex, such as the pedigrees studied.
We would expect the gains in power to be even greater with S X<_�_ due to the nature of
the statistic. Unlike S, X\P%]\^ , S X`_a_ gives increasing scores to the larger number of affected
pedigree members sharing an allele IBD. Since GENEHUNTER often discards affected
42
members, we would expect this to adversely affect the power to a greater degree with S X`_a_than with S, X\P%]\^ . One drawback of using S X<_�_ , however, is the computational intensity of its
current implementation. Markianos et al. (2001b) have addressed this issue and proposed
a method to reduce the computational burden.
43
CHAPTER 4
HAPLOTYPING: AN APPLICATION
Haplotype reconstruction of many markers in pedigrees plays an important role in lo-
calizing disease causing genes. Haplotype reconstruction is the attempt to reconstruct the
haplotypes in pedigrees given genotype information, 2{7 . Often the genotype informa-
tion does not determine haplotypes exactly due to missing data or low heterozygosity of
markers. Various methods have been proposed to reconstruct haplotypes. In a recently
published article (Qian and Beckmann, 2002), a six-rule algorithm for the reconstruc-
tion of minimum-recombinant haplotype (MRH) configurations in pedigrees was proposed.
The authors compared their rule based method to Tapadar’s evolution-based MRH method
(Tapadar et al., 2000). Neither method, however, explores the entire haplotype space nor
do they provide the probabilities of haplotype configurations. The rule based MRH method
is further limited to pedigrees with “informative” or “partially informative” members. A
pedigree that is missing genotype information for two mating founders, for example, could
not be analyzed with this method. This places severe and often unrealistic restrictions on
the application of this method.
44
An alternative method is to derive a set of highly probable haplotype configurations
given the marker data. Determining these posterior probabilities by exact methods for
large pedigrees with many markers can be computationally infeasible, however. Monte
Carlo methods offer a viable solution to this computational challenge. Markov chain Monte
Carlo methods have been implemented for haplotype reconstruction (Lin and Speed, 1997;
Sobel et al., 1996), but they are subject to slow convergence due to high correlation between
samples. We propose to use sequential imputation to determine the haplotype configura-
tions with the highest posterior probabilities. Haplotype reconstruction with sequential
imputation is easy to implement and computationally efficient.
Furthermore, sequential imputation can easily incorporate crossover interference in de-
termining the haplotype configuration probabilities. Interference plays an influential role in
the formation of gametes and modeling it is essential to accurately determine haplotypes.
Yet the aforementioned MHR methods can not incorporate it in their derivations, and exact
probability methods often ignore it due to computational difficulties when analyzing many
loci, expecially in the presence of missing data. We will describe the methodology for esti-
mating the haplotype configuration probabilities via sequential imputation and then apply
the methodology to a real data set using different models of interference.
4.1 Methodology
The ordered genotypes, � , determine a haplotype configuration, H. To estimate the pos-
terior probability of a haplotype configuration, H, we first sample the ordered genotypes,� = �v� � 9JHIHJHI9G�¦7�� , sequentially, conditioned on the marker data, and store the appropriate
45
weight, ��46��? . We do this N times and then take a weighted average of the sample re-
alizations that yield ordered genotypes which correspond to a haplotype configuration to
estimate its probability.
To get a sample of size N of the ordered genotypes and weights we follow steps 1
through 3 in chapter 2 for N imputations to get ordered genotypes � � 9JHJHIHI9+� N with cor-
responding weights ��4�� � ?.9IHJHJHI9+��46� N ? . We form the estimator of P 4 H R 2A7T? , the posterior
probability of haplotype configuration H:�P 4 H R 2�7T?S� Ng PEk � I 46�8P�� H ? ��46�8P§?��4'~�? 9
where I 4�;B? is the indicator function, i.e. I 46;B?"� �if ; is true, otherwise I 46;B?"� �
, and��40~�?�� �NPEk � ��46&�? is the sum of the weights.
Figure 4.1: Four haplotype configurations, A-D, from pedigree in episodic ataxia study.A single arrow indicates a recombination between the two adjacent loci. A double arrowindicates two distinct possible locations for the recombination. This image was obtainedon November 3, 2002 from http://www.journals.uchicago.edu/AJHG/journal/issues/v70n6/013591/fg2.h.gif 48
have equal allele frequencies since this assumption will have little impact on the estimated
probabilities of the haplotype configurations because most of the missing genotypes of the
founders can be unambiguously determined by the genotype information of their children
and the ambiguous genotypes are most likely as described above, regardless of the allele
frequencies.
We analyzed the pedigree in Figure 4.1 under 7 different models of interference and
compared the results to those given by authors who have studied the same pedigree. We
used the ( model of interference with intensity parameters 1 =0 (no interference), 1, 2, 3, 4,
5 and 6. We sampled N=100, 500, 1,000 (1K) and 100,000 (100K) ordered gentoypes under
each interference model. The haplotype configurations being reported all have a posterior
probability greater than 1% under at least one of the interference models for N=100,000.
100,000 imputations took 4 minutes to draw on a linux machine with an AMD 1800+MP
processor and 3 GB of RAM. The results for N=100,000 imputations are summarized in ta-
ble 4.1. The results for all other sample sizes are summarized in table 4.2. The empty cells
correspond to configurations with probabilities less than 1%. The configurations (‘cfg’)
will now be described.
The 4 most probable haplotype configurations under no interference (1 =0) were C, D,
A and B (see Figure 4.1) with posterior probabilities .28, .27, .07 and .07, respectively. All
other haplotype configurations had posterior probabilities of less than 1%. These matched
the configurations derived by Qian and Beckmann (2002). Note that they derived compa-
rable relative frequencies (.4, .4, .1 and .1 for configurations C, D, A and B, respectively)
Configurations A, B, C, and D all had at least 5 recombinations: 3 single recombi-
nations and 1 double recombination. Whereas the other 8 configurations had at least 6
recombinations, but they were all single recombinations.
52
This difference in recombination patterns was solely due to the haplotype of member
1001. The haplotype d for 1001 observed in configurations A, B, C and D yielded one
double recombination in his child 100 and no recombinations in the other two children
(102 and 1) in all 4 configurations. In contrast, the haplotypes 1 and 2 observed in the other
8 configurations (under interference) yielded a single recombination in all three children
of 1001 but no double recombinations. The recombination patterns observed in the other
members were identical to each configuration’s conjugate.
Positive interference makes multiple recombinations less probable than under no inter-
ference. The stronger the positive interference the less probable are multiple recombina-
tions. When we did not account for interference (1 =0) only configurations A, B, C and D
had posterior probabilities greater than 1%. As the positive interference increased, i.e. as1 increased from 0 to 6, the configurations with only single recombinations became more
probable, even though there were more recombinations over all. In fact, for 1�¬® the
posterior probabilities for configurations A, B, C and D were all less than 1% and the other
8 configurations all had posterior probabilities greater than 1%. In light of the fact that
interference is known to occur in humans, it is important to take it into account. Ignoring it
can lead to drastically different results as observed in this exercise.
4.3 Discussion
Though the rule-based MRH method finds all configurations with the minimum number
of observed recombinations, it does not distinguish between single versus multiple recom-
binations on the same chromosome nor does it account for the varying distances between
markers when minimizing the number of recombinations. Furthermore, it does not explore
53
the entire haplotype space nor provide posterior probabilities for the haplotype configura-
tions that it finds.
Probability methods are advantageous over this rule-based MRH method because they
are not limited by missing data and they are based on the probability of the haplotype
configurations given the pedigree data. They also provide more flexibility to an investigator
by giving a set of haplotype configurations with probabilities attached to them from which
the investigator can choose from based on his/her own expertise (Lin and Speed, 1997).
Unlike exact probability methods, the computational cost of sequential imputation scales
linearly in both the number of pedigree members and markers.
MCMC methods will yield the same results as sequential imputation after large enough
samples are drawn. But we expect sequential imputation to be more efficient than most
MCMC methods for simple pedigrees such as the one studied in Figure 4.1.
None of the methods used to reconstruct the haplotypes in Figure 4.1 accommodated
crossover interference except for sequential imputation. It is impossible to include it in the
rule-based MRH method and computationally infeasible to adequately include it in exact
probability methods for many markers. None of the MCMC methods that were mentioned
in this article accounted for it (though it could be done). Yet interference plays an influ-
ential role in determining haplotype configurations. Indeed, the ( model with intensity
parameter 1 =4 has been found to fit well with human pedigree data (Lin and Speed, 1996)
and yet under this model none of the haplotype configurations that were derived by the
other methods had a posterior probability greater than 1%!
54
CHAPTER 5
INTERFERENCE STUDY
There is strong evidence that positive interference occurs during meiosis in humans
(Kwiatkowski et al., 1993). Accounting for interference is important when carrying out
multipoint analysis. Simulation studies have shown increased efficiencies in exclusion
mapping and gene ordering when accounting for interference using the ( model in small
human pedigrees (Lin and Speed, 1999).
We did a simulation study to examine the effect on power, precision and accuracy in
gene mapping when interference is ignored in a large pedigree with missing data. Inter-
ference was modeled by the ( model. The factors that we considered were the disease
model (the penetrances and allele frequencies of the disease gene), the location of the dis-
ease gene, the number of markers, the number of alleles of the markers and the genetic
distances between the markers. We assumed that the order of the markers was known and
there were no genotyping errors.
We used a single large pedigree in all of our simulations. Since the pedigree was large
and there were multiple markers exact methods could not be used to analyze this data with
interference. We used sequential imputation to analyze the data instead.
We considered two genetic models for the disease data: complex and dominant. The
complex model had penetrances of .05, .5 and .75 for the homozygous normal, heterozy-
gous and homozygous mutant genotypes with a mutant allele frequency of 30%. The dom-
inant model had penetrances of 0, .9 and .95 for the homozygous normal, heterozygous and
homozygous mutant genotypes with a mutant allele frequency of 10%.
For both disease models we simulated data for 8 markers and varied the number of
alleles between 4 and 8, and the distances between the equally spaced markers over 1, 5
and 10 Centimorgans and the location of the disease gene from the middle of the set of
markers to outside of the set of markers. (When the disease gene was simulated outside
of the set of markers it was simulated at a distance equal to the marker interval width for
that model. So if the width was 5 cM it was simulated 5 cM outside of the set of markers.)
In addition, we repeated the same 12 configurations for the dominant disease model but
used 4 markers instead of 8. So there were 36 configurations in total. Under all of these
configurations the meioses were simulated under the ( model for the chiasma process with
intensity parameter 1Q�¯ .The pedigree used in the study contained 35 members over 3 generations. The pedigree
structure is shown in Figure 5.1. There were 6 individuals designated as missing which is
indicated by the slash through their individual symbol. Neither their affectation status nor
marker data were used in the analyses. A pedigree was ascertained if at least one member
was affected in each of the sibships of the last generation.
56
Figure 5.1: The pedigree used in interference study. The slash indicates that the markerand disease data will be missing for that individual.
5.2 Methodology
As in previous chapters we partition the data into the F°~ � loci, 23���|; � 9JHJHJH�9+;�7�9+;>=�� .Where 287����|; � 9JHJHJH�9+;�7�� is the data on the F markers and ;�= is the data on the affectation
status for the pedigree members. To estimate the LOD scores we followed the algorithm
described in Irwin et al. (1994). We employed the reweighting technique suggested by
Irwin et al. (1994) to reduce the computational cost in estimating the likelihood of the
disease at different locations. Since the computational cost in this application of sequential
imputation involved peeling the disease locus, the savings were substantial. We used the ( model of the chiasma process with intensity parameter 1Q�° to account for interference.
For each pedigree we sampled the ordered genotypes N � � 9 ����� times. We estimated
the likelihood of the disease in the middle of each marker interval. We used reweighting
to estimate the likelihood at 6 interior points (equally spaced) and each of the flanking
57
markers. We then calculated the estimate of the LOD scores at each marker and the 6
interior points within each marker interval.
We did a preliminary study to determine the number of pedigrees necessary to achieve
sufficient power. The disease model that we used in the power analysis was a fully pene-
trant dominant disease gene with a mutant allele frequency of 10%. The disease gene was
simulated in the middle of a marker map with 8 markers. The 8 markers each had 8 equally
frequent alleles. The meioses were simulated under the ( model with interference inten-
sity parameter 1±�² . We simulated 100 pedigrees under this model for the disease and
marker data using the pedigree (with the same missing data pattern as described previously)
in Figure 5.1.
Over 75% of the pedigrees yielded LOD scores over 3 under the assumption of no inter-
ference. We resampled 5 pedigrees from the simulated pedigrees 100 times and combined
their LOD scores (thus estimating the power for a data set of 5 pedigrees). 100% of these
data sets had LOD scores over 3. We decided that a data set of 5 pedigrees should be
adequate to get sufficient power under the more realistic genetic models that we already
described.
When the disease gene was simulated in the middle of the marker map the disease re-
gion was defined as the marker interval containing the disease gene, including both flanking
markers. When the disease gene was simulated outside of the marker map the disease re-
gion was defined as half of the adjacent interval to where the disease gene was simulated,
including the adjacent marker. To compare power we calculated the percentage of data sets
that had maximum LOD scores exceeding 3 in the disease region under both interference
58
models for each data configuration. These percentages were our power estimates. If the
LOD scores under interference were more powerful than under no interference we would
expect these estimates to reflect that difference.
To compare precision we looked at the behavior of the LOD scores outside of the dis-
ease region when the maximum LOD scores were found in the disease region. When the
LOD score is calculated outside of the disease region we would want it to be lower than
inside the disease region. If incorporating interference makes the analysis more precise,
we would hope that the corresponding LOD scores to be lower under interference than no
interference outside of the disease region. To quantify this we calculated the mean dif-
ference in LOD scores (no interference - interference) at all interior points outside of the
disease region. (We excluded estimates at the markers since these at times were -infinity.)
Within each configuration we took the mean of these mean differences. For simplicity we
will refer to this statistic as the mean LOD difference. We would hope the mean LOD
differences to be positive and large outside the disease region if incorporating interference
improves precision. We only considered data sets where the maximum LOD score under
both interference models occurred in the disease region. For relative comparison we also
calculated the mean LOD difference inside the disease region for each data configuration.
To evaluate accuracy we examined the location of the maximum LOD scores. If the
analysis under interference is more accurate, the maximum LOD scores should occur more
often in the disease region under interference than no interference. To assess this we ex-
amined the joint distribution of the location of the maximum LOD scores for each data
configuration relative to the disease region. For each data configuration, we calculated the
59
percentage of data sets where the maximum LOD scores under both interference models
occurred in the disease region, where only one occurred in the disease region and where
neither occurred in the disease region.
5.3 Results
For each of the 36 data configurations described in the previous section we generated
100 pedigrees. We resampled 100 data sets of 5 pedigrees within each configuration and
estimated the LOD scores under interference (using the correct ( model) and also under
no interference at the markers and 6 interior points. All other parameters were consistent
with the true data/disease model for each analysis.
5.3.1 Power
The power estimates were comparable under both interference models. We see from
Table 5.1 that the power estimates did not differ from each other by more than 5 percentage
points for any data configuration. The disease model and data configuration of course
affected the power. As expected the power was much higher under the dominant model
than complex model and the power tended to increase as the marker intervals became finer.
But the difference in power between the interference models was never substantial.
5.3.2 Precision/Accuracy
While the powers across the entire marker map were quite similar, there were substan-
tial differences in terms of precision and accuracy under the two interference models.
Table 5.1: Power estimates. The percentage of data sets that had a maximum LOD scoregreater than 3 is reported. Loc=location of disease gene (‘mid’=middle of marker map,‘out’=outside of marker map), n=number of alleles, W=width of interval between markers,m=number of markers.
61
When the maximum LOD score under both interference models occurred within the
disease region the scores were often very close within that region, but outside of the region
the scores under interference tended to drop off faster. The LOD scores for a typical data
set where this occurred is shown in Figure 5.2. This data set came from a simulation done
with the dominant disease simulated in the middle of 8 markers with 8 alleles each and a 10
cM width between markers. We can see from Figure 5.2 that the scores were very close in
the middle interval where the disease gene is located, but outside of this interval the scores
under interference were relatively much lower than the scores under no interference.
10 20 30 40 50 60 700 20 40 60
-15
-10
-50
510
15
0 20 40 60
-15
-10
-50
510
15LO
D
Location (cM)
Figure 5.2: An example of LOD scores from data with interference. The solid and dottedlines correspond to LOD scores estimated with and without interference, respectively.
As a result, the mean LOD differences were much closer in magnitude inside the disease
region versus outside the disease region for many of the data configurations. The mean
62
LOD difference was as high as 1.32 outside the disease region compared to .03 inside the
disease region. Roughly half of the non-zero mean LOD differences were negative inside
the disease region, i.e. the mean of those mean LOD scores under interference was higher
than the mean of the mean LOD scores under no interference. We reported the mean LOD
differences in Table 5.2 for both inside the disease region and outside the disease region
for each data configuration. We see that the mean LOD differences were all non-negative
outside of the disase region. And this difference was consistently greater in magnitude than
the mean LOD difference inside the disease region. Thus, by this measure of precision,
incorporating interference in linkage analysis tended to lead to more precise estimates of
the disease location than not incorporating interference.
The apparent increase in precision depended on a number of factors. There was a
negligible effect on precision for the complex model, whereas in the dominant model with
8 markers the increase in precision was at times substantial depending on the number of
alleles and number of markers. When the disease was simulated in the middle for the
dominant model the increase in accuracy was consistently higher for 8 markers versus 4
markers. The precision increased in all but one case for 8 alleles relative to 4 alleles. (For
width 1 cM, the mean LOD difference was .1 for 4 alleles versus .09 for 8 alleles.) As the
width of the marker interval increased the precision also tended to increase.
When the disease was simulated outside of the marker map the effects were less clear.
But in general the precision increased as the marker interval increased, holding all other
Table 5.2: Difference of LOD scores (no interference - interference) inside the diseaseregion (‘Inside’) and outside the disease region (‘Outside’). These differences are reportedfor data sets in which the maximum LOD scores under both interference models occurredin the disease region (‘RR’). Loc=location of disease gene (‘mid’=middle of marker map,‘out’=outside of marker map), n=number of alleles, W=width of interval between markers,m=number of markers.
64
The accuracy also increased when accounting for the presence of interference. The
maximum LOD scores tended to occur more often in the disease region under interference
than no interference. The difference in the rates ranged from 4 percentage points below to
26 percentage points above when comparing interference to no interference.
The joint distribution of the locations of the maximum LOD scores for interference and
no interference is summarized in Table 5.3. ‘RR’ is when the maximum LOD score evalu-
ated under interference and no interference both occur in the disease region. ‘RW’ is when
the maximum LOD score occurs in the disease region under interference, but not under no
interference. ‘WR’ is when the maximum LOD score occurs in the disease region under
no interference, but not under interference. And the remaining category (when neither of
the maximum LOD scores occur in the disease region) can be inferred from the other 3
categories.
As with precision, the improvements in accuracy are most apparent for the dominant
model. In the dominant model, the largest factor affecting the increase in accuracy was the
width of the marker intervals when the disease was simulated outside of the marker map.
When the disease was simulated in the middle the improvement in accuracy was virtually
negligible.
5.4 Discussion
Goldstein et al. (1995) demonstrated substantial gains in efficiency were possible in
exclusion mapping and gene ordering for completely informative data with the ( model
of interference. Lin and Speed (1999) showed that large gains (though not as substantial)
Table 5.3: Maximum LOD scores by locations. ‘RR’ is when the maximum LOD scoreevaluated under interference and no inteference both occur in the disease region. ‘RW’is when the maximum LOD score occurs in the disease region under interference, butnot under no interference. ‘WR’ is when the maximum LOD score occurs in the dis-ease region under no interference, but not under interference. Loc=location of diseasegene (‘mid’=middle of marker map, ‘out’=outside of marker map), n=number of alleles,W=width of interval between markers, m=number of markers.
66
could also be achieved with more realistic small human pedigree data (sizes 7 and 10) for
the same problems.
We addressed the effect of accounting for interference on power, precision and accuracy
in linkage analysis using a large pedigree (35 members) with missing data. We found
that although power was not affected, the precision and accuracy were improved when
accounting for interference. While the LOD scores were very similar inside the disease
region (when the maximum LOD scores both occurred in the disease region under both
interference models) they were often relatively much lower outside of the disease region
under interference.
Furthermore, the maximum LOD scores tended to occur more often in the disease re-
gion under interference than no interference. Both measures of precision and accuracy
under the complex model showed very little difference whereas the dominant model often
showed substantial differences. The improvement increased with the number of alleles and
the number of markers. This can be explained by the increased informativeness of the data.
As the number of alleles and markers increased the data became more informative and the( model fit the data even better and thus became more precise and accurate. Likewise,
with the dominant model the data was more informative for linkage than with the complex
model and hence the ( model was more precise and accurate. Similarly, Lin and Speed
(1999) found increases in relative efficiency for the number of alleles but not for the num-
ber of markers. This may be due to the fact that they considered pedigrees with no missing
data (other than phase information). With missing data, as in this simulation study, the
67
number of markers has more of an effect on the informativeness of the data as evidenced
by this simulation study.
As the data became more informative for linkage the power also increased, as expected.
The ( model did not show any increased power relative to the no interference model, how-
ever. This apparent paradox can be explained by the nature of linkage analysis and the ( model. When the data was informative for linkage there was more evidence of recombina-
tion between the disease gene and points outside the disease region than inside the disease
region. This lead to more multiple recombinations outside of the disease region than inside.
For the distances analyzed, multiple recombinations are much less probable under the ( model with 1³�´ than under no interference. When there are less recombinations the two
probabilities are more similar. Inside the disease region, when there was no evidence of
recombination, the probabilities under the ( model (1µ�¶ ) were similar to the probabili-
ties under no interference and hence the LOD scores were similar. Whereas outside of the
disease region, when there was strong evidence for recombination, the probabilities were
distinct; with the probabilities under the ( model being lower. Hence while power was
similar, the ( model proved to be more precise and accurate.
Another apparent paradox was that the relative accuracy increased as the marker in-
terval widths increased. The informativeness of the data actually decreases as the interval
width increases. Hence the power decreases and one would likewise expect from arguments
similar to above that the precision and accuracy should also decrease. But the precision and
accuracy of the ( model relative to the no interference model actually increased when the
widths increased from 1 to 10 cM. At 1 cM there was virtually no difference in accuracy.
68
This is due to the fact that the ( model yields very similar probabilities to the no in-
terference model at such small distances (Lin and Speed, 1999). Whereas at 10 cM the
probabilities are more distinct and the resulting difference between the two models can be
seen. Increases in efficiency were similarly found by Lin and Speed (1999) when going
from 5 to 10 cM.
69
CHAPTER 6
DISCUSSION
Exact methods for linkage analysis are limited computationally in either the size of the
pedigree or the number of loci they can handle. As a result, investigators often have to
reduce the size of the data they are studying. This is usually done by a combination of
reducing the number of markers, subsetting a pedigree or splitting a pedigree into two or
more separate pedigrees. This can lead to a loss of power and even bias.
Monte Carlo methods are an important alternative to exact methods. They can often
handle both large pedigrees and a large number of loci. Markov chain Monte Carlo meth-
ods have been successfully implemented in large and complex pedigrees with many loci.
But due to the nature of pedigree data the MCMC samples are highly correlated and con-
vergence can be slow.
For simple pedigrees or pedigrees with at most 1 or 2 loops, sequential imputation
is a viable alternative to MCMC methods. Since it draws independent samples from its
sampling distribution sequential imputation is expected to be more efficient than MCMC
methods for many problems. Of course if the pedigree is too complex such that a single
70
locus peel is impossible then sequential imputation is not even viable and MCMC methods
are the appropriate choice.
Sequential imputation has been implemented to estimate LOD scores for simple pedi-
grees. We extended sequential imputation to handle complex pedigrees and showed how
it can be used to estimate any IBD sharing statistic of the form given in equation (1.2).
We compared the power of analyzing entire pedigrees with sequential imputation versus
analyzing reduced pedigrees with GENEHUNTER, a popular software package which cal-
culates linkage statistics exactly. We showed that analyzing the entire pedigree can lead
to substantial increases in power versus reducing them, which GENEHUNTER had to do
since the pedigrees were too large to process with its exact method.
We also incorporated the ( model of the chiasma process to carry out linkage analysis.
This model can lead to varying degrees of interference, from no interference to arbitrarily
strong positive interference. Exact methods which utilize HMMs, such as GENEHUNTER,
rely on the assumption of no interference in their algorithm. We showed the potential
benefits of accounting for interference in a haplotype study and linkage study.
In the haplotype study we showed that accounting for interference can lead to dramat-
ically different results. In the linkage study, we showed that the estimates of the gene
locations were more precise and accurate when accounting for interference.
In the algorithm described in this thesis, we decomposed the data into the information
that we have at the F loci and sequentially imputed the ordered-genotypes locus by locus.
We note that other decompositions are possible. For instance, one could decompose the
data into sets of loci. This would involve a multilocus peel per iteration, which obviously
71
increases the computational cost. The advantage is that it should decrease the Monte Carlo
variability and hence require less iterations to reach the same accuracy. Furthermore, the
order of the sequential imputation does not have to be the physical order of the loci. In fact,
the simulation variability should decrease by processing the more informative loci first.
SIMPLE, by default, uses the number of alleles as a measure of informativeness and sorts
the loci accordingly. The user may override this default and provide his/her own process
order.
Currently SIMPLE can estimate LOD scores, two popular nonparametric IBD sharing
statistics and produce all pairwise IBD sharing estimates with their weights. These IBD
sharing estimates and weights can be used to estimate other statistics of interest. We have
shown the flexibility and power of sequential imputation in linkage analysis. There are
still many other applications and improvements that can be made to the implementation of
sequential imputation in SIMPLE.
6.1 Efficiency
SIMPLE could be made more computationally efficient with a genotype elimination
algorithm. Without any genotype elimination peeling a single person in a simple pedigree
involves G, G
or G · computations, depending on the type of peel involved, where G is
the number of possible ordered genotypes. Even for peeling just one locus, as in SIMPLE,
this number can get large. For a single locus with 12 alleles G · =2,985,984! For a complex
pedigree the computational cost can be much more enormous, even if the pedigree is not
too complex. For example, the pedigree presented in Figure 2.1 in chapter 2 has a single
72
marriage loop. Yet peeling one of these people involved Gc
computations. For 12 alleles
Gc �¸ ��¹ 9 ¹�º»� 9@¼ ¹ ¼ !
Currently there is some genotype elimination done in SIMPLE. For example, if every-
one in a pedigree is genotyped then G is effectively reduced to at most 2 and the compu-
tational cost of peeling anyone is at most 8 in a simple pedigree. But in the presence of
missing data, the genotype elimination algorithm can be improved. For example, peeling
people with missing data often involves summing over the number of alleles squared. An
obvious improvement would be to reduce the number of alleles at each locus to the number
of alleles observed plus one (for the unobserved alleles). This is what we did manually
to analyze the pedigree in Figure 2.1 in chapter 2 and it drastically reduced the computa-
tional time. We processed 8 markers with 9 to 12 alleles, as described in chapter 2. We
reduced the effective number of alleles to at most 5 alleles using the genotype algorithm
just described. To impute 5,000 genotypes on a linux machine with 512 MB RAM and a
1 gHz AMD Athlon processor took less than 2 hours after this genotype elimination. To
impute just 10 genotypes on the same machine but without the genotype elimination took
more than 24 minutes! With this application of genotype elimination the computational
time went from a matter of days to less than 2 hours.
6.2 Interference
All of these estimates can be done under the ( model of the chiasma process for simple
or complex pedigrees, as demonstrated in this thesis. We could implement other models of
interference in SIMPLE as well. Any multi-locus feasible map function or chiasma model
yielding recombination probabilities can be incorporated into SIMPLE. For any muli-locus
73
feasible map function,s
, we can derive the recombination probabilities using the formula
derived by Schnell (1961). Given the recombination probabilities, we can use them to
calculate the multi-locus probabilities in steps 1 and 2 in the algorithm in chapter 2 and
thus incorporate interference via the corresponding map function or chiasma process.
6.3 Quantitative Trait Statistics
As already described, we can calculate any IBD sharing statistic of the form given in
equation (1.2). This includes quantitative traits as well as qualitative traits. For example,
the maximum likelihood estimate of the Haseman-Elston statistic (Haseman and Elston,
1972; Kruglyak and Lander, 1995) for quantitative traits can be estimated by SIMPLE.
The Haseman-Elston statistic is based on the regression model ½|P:�¿¾À~°ÁÃÂLÄvPA~´Å`P ,&Æ� � 9JHIHJHI9'1 , where ½�P is the squared differences of phenotypes between the & NEO pair of
sibs and ĦP is the proportion of IBD sharing between the pair and Å@P P%Pl�Ç N 4 � 9GÈ ? . If Á is
significantly less than 0 then there is evidence of linkage. So we wish to estimate Á but we
don’t know the proportion of IBD sharing between the pairs, ɱ�5Ä � 9JHIHJHI9GÄ , . We can use
an EM algorithm derived by Kruglyak and Lander (1995) to get the ML estimate of Á .
In this application of the EM algorithm the complete data is 46Éw9@Ê8? , where É =( Ä � 9JHIHJH�9+Ä , )are the proportion of IBD sharing for all 1 pairs of sibs and Ê = 46½ � 9JHIHJHI9G½ , ? are the squared
differences of the observed quantitative phenotypes for 1 pairs of sibs. The observed data
is 46287�9@ÊA? . Note that Ê is just a function of the trait phenotypes, ;8= .In order to implement this EM algorithm one needs to know the probability distribution
of the ĦP ’s, P 46ĦP<R 287y? . This can be computationally inhibitive. We propose to use sequential
imputation to estimate these distributions.
74
We will outline the algorithm for estimating these distributions. Following the steps 1
through 4 in the algorithm in chapter 2 we produce a sample of N weights ��46� � ?p9JHJHJH�9G��4�� N ?and N inheritance vectors. Given an inheritance vector we can determine ÄLP .
To estimate the probability P 4ËÄ�PY�ÌÄAR 287y? we average of the sample realizations that
yield Ä . �P 46ĦP��¸ÄAR 2�7T?�� Ng PEk � I 4ËÄvPB�¸Ä»? ��46�8P§?��40~�? 9
where I 4ËĦP��ÍÄ�? is the indicator function, i.e. I 46;B?Î� �if ; is true, otherwise I 46;B?Ï� �
,
and ��40~�?� �NPEk � ��46&�? is the sum of the weights. This is a consistent estimator of the
desired probability. Using these estimated probabilities we implement the EM algorithm
described in Kruglyak and Lander (1995) to derive a maximum likelihood estimate of the
This document contains instructions for compiling andrunning SIMPLE (Sequential Imputation for Multi-PointLinkage Estimation). SIMPLE uses the same input files usedin GENEHUNTER and takes similar commands.
SIMPLE is made up of two programs written in C. The firstprogram is ’simple’ which produces importance samplingweights and (optionally) simulated in-phase genotypes(if NPL or QTL analysis) and likelihood (if LOD analysis)ratios. This output written to the screen which may bedirected to a file. The second program, ’scan’, reads inthe output from simple and produces the desired statistics(which are written to the screen). Both programs takethe same commands.
/********** Set up **********/
untar and unzip SIMPLE.tar.gz
76
>gzip -d < SIMPLE.tar.gz | tar xvf -
This will produce a directory called /SIMPLE in thecurrent directory. /SIMPLE contains: Makefile file,simple (compiled), scan (compiled), /src directory, /checkdirectory and instruct.txt file (this document). Makefileis used to compile simple and scan. /src directorycontains the relevant code. /check directory containssample pedigree files to run SIMPLE and ensure that itwas compiled correctly.
/************* Compiling *************/
simple and scan were compiled on a Solaris 7. To recompilethem type
>make simple scan
/********************************* Basic form for running simple *********************************/
The command file allows for changes from the default setup.As in GENEHUNTER the command file must at least contain the‘load’ and ‘scan’ commands to specify the input files.The following commands may be used, with the defaultsgiven in [].
77
/************ Commands ************/
>load [linkloci.dat]
Gives the GENEHUNTER compatible marker-locus data (allelefrequencies for each genetic marker, frequency andpenetrance information for the disease). The format ofthis file must be identical to the Linkage parameter file(output from the PREPLINK program).
>scan [ped.dat]
Gives the location of the file with the pedigree, marker,and disease data. The pedigree should be in the Linkagepedigree input format (before running MAKEPED or doingany preprocessing!). Each line of this file must havethe following structure:
(a) pedigree name(b) individual ID #(c) father’s ID #(d) mother’s ID #(e) sex (1=MALE, 2=FEMALE)(f) affectation status (0=UNKNOWN, 1=UNAFFECTED,
2=AFFECTED)(g) liability class (OPTIONAL) - classes specified
in marker data file(h) marker genotypes(i) quantitative trait values
A ‘0’ in any of the disease phenotype or marker genotypepositions (as in the the genotypes for the last markerabove) indicates missing data. (See the example pedigree
78
file in /check.) A non-numeric character for quantitativetraits is the default for missing values for quantitativetraits. To change this to a numeric character use thecommand:
>missingQT [ # ]
where ‘#’ is the numerical character for a missing valuein the quantitative traits data.
You should only enter one pedigree at a time, though intheory the software could handle more than one. Howeverthis will put a strain on the memory. In general, youwill be able to handle bigger pedigrees if they are runone at a time.
>maxiter [ 5000 ]
The number of imputed data sets to be generated by thesequential imputation procedure. The null distributionis estimated with 20*maxiter iterations.
>increment step [ 5 ]
Acts the same way as increment step does in GENEHUNTER.However, simple does not currently calculate scores outsidethe marker map. This should be changed in the future.In addition the increment scan command of GENEHUNTERhas not been implemented, and may not be in the futureversions.
>seed [ 123456789 ]
As simple is a Monte Carlo procedure, a starting pointfor the random number generator is required. While adefault value is given, this should be set for every run.
79
A valid integer between 0 and 2147483647 is required.SIMPLE automatically drops a random integer in a filecalled ’seed’ after execution. This file may be appendedto the command file to accomodate changing the seed.
>debug
There is a debug mode simple which allows for intermediateresults to be examined. It was mainly added originally toallow easier debugging as the code was being developed.The default is not to debug. If you wish to use thisprocedure, inclue ’debug’ in the command file.
>process order [1 2 ... nmarkers]
The order that the markers are processed can be set.The default is to process the markers in decreasing orderbased on the number of alleles (Most alleles processedfirst, least alleles processed last). If two or moremarkers have the same number of alleles, the lowestnumbered marker gets done first. If a disease locus isto be processed, it is always processed last.
>interference [ 19 0 ]
Currently only one interference model is includedin SIMPLE. The chisquare model with intensity m isindicated by
>interference 19 m
Note that a chisquare model with m=0 is the no interferencemodel. This is the default setting. In the future, otherinterference models will be added. The only change thatwill be observed by the user is that additional optionsto this command will become available.
80
>analysis [ BOTH ]
As in GENEHUNTER one may input ’BOTH’, ’NPL’ or ’LOD’.In addition, one may input ’NONE’. This may be useful ifyou just wish to conduct a QTL analysis.
>score [ PAIRS ]
As in GENEHUNTER one may input PAIRS or ALL.
>units [ ]
One may input ’cM’ or ’rec-frac’. The default, asin GENEHUNTER, is to assume that the distances are inrecombination fractions unless at least one distance isgreater than .5.
>peel [ <file name> ]
Indicate peeling order in file. This is only necessaryif the pedigree contains loops. The file should have2 columns. The first column contains the cut set.The second column contains the peel set. If there is morethan one member in any of the sets then separate them withcommas. The order of the cut and peel sets specifies thepeeling order. The last cut set, the root, should be 0.An example peeling file is given below.
As in GENEHUNTER, total stat produces the scores for thecombined pedigrees. The default is to produce the scoresfor the pedigrees separately.
>dumppairs
Print out the expected alleles shared IBD for eachrelative pair. This is printed after the scores. The rowscorrespond to the location and the columns correspond tothe pair. The header lists the PIDs of the pairs thateach column corresponds to. The first element of eachrow is labeled by the location that it corresponds to.Here is an example of the output from dumppairs:
The first pair in pedigree 1 contained the members 839and 843 who had expected alleles shared IBD of 1, etc...
>dumpscores
Print out the raw scores simulated conditioned on the data.The output is printed out before the estimated scoresare printed. The rows correspond to each iteration.The first column is iteration number, the second column isthe weight and the rest of the columns correspond to thesimulated scores at each location. Here is an exampleoutput from dumpscores:
The first three iterations are shown. The weight for thefirst iteration is 4.809104e-14. The score simulated inthe first iteration at the first location is 2.500000e-01,etc...
/************************ Example command file ************************/
The following example command file includes all possiblecommands. In most cases the default values are used.This would be from an example with 5 markers and a disease.
/check contains an example data set in ped.dat andlinkloci.dat. After you compile SIMPLE you can checkthe output from simple with out and the output from scanwith scores. The command file ’cmd’ contains the commandsnecessary to do the check and also has other commandscommented out which one may use. This command file mayserve as a template.
/***************************** Interpretation of results *****************************/
Appended at the end is a sample output file afterrunning SIMPLE on the sample data provided in the ’check’directory. The command file was the following:
84
------------Command file------------
>load linkloci.dat>scan ped.dat
We see that the command file simply indicated which locusfile to load and which pedigree file to scan. The defaultstatistics are both NPL-PAIRS and LOD scores. (See listof commands used appended at the end of the output file.)
The first statistics that are reported (see file below)are the estimated null mean and variance (under the columns’mean’ and ’var’, respectively). Next are the NPL scoresat the different locations. The first column, ‘ped’,indicates which pedigree these statistics correspond to.The second column, ‘pos’, indicates the position thatthe statistics were calculated at in cM. The thirdcolumn, ‘S’, are the raw scores at those positions.The fourth column, ‘Z’, are the standardized scores atthose positions. The next to last column, ‘SE(Z)’ arethe monte carlo SEs of the standardized scores. The lastcolumn are the estimated exact p-values.
The LOD scores follow the NPL scores. The first column‘pos’ are the locations in cM where the scores werecalculated. The second column, ‘LOD’, are the estimatedLOD scores. And the last column, ‘SE(LOD)’, are the MonteCarlo SEs of the estimated LOD scores.
>32.00 -1.427404 0.0000>36.00 -1.602604 0.0000>40.00 -10000.000000 0.0000>44.00 -1.603792 0.0000>48.00 -1.427640 0.0000>52.00 -1.426351 0.0000>56.00 -1.599998 0.0000>60.00 -10000.000000 0.0000>>The following commands were used.>An ’*’ indicates that the command was found in the>command file ’cmd’ and may not necessarily be the>default value.>------------------------------------------------->load linkloci.dat*>scan ped.dat*>analysis BOTH>score ALL*>units (AUTO) cM>maxiter 5000>increment step 5>seed 123456789>NO reweighting used>process order (AUTO) 1 2 3 4 5>interference 19 m=0
87
BIBLIOGRAPHY
Abecassis, G., Cherny, S., Cookson, W., and Cardon, L. (2002). Merlin-rapid analysis ofdense genetic maps using sparse gene flow trees. Nature Genetics, 30:97–101.
Bergman, N. (2001). Posterior Cramer-Rao bounds for sequential estimation. In Doucet,A., de Freitas, N., and Gordon, N., editors, Sequential Monte Carlo methods in prac-tice, pages 321–338. Springer-Verlag, New York.
Blake, A., Isard, M., and MacCormick, J. (2001). Statistical models of visual shape andmotion. In Doucet, A., de Freitas, N., and Gordon, N., editors, Sequential MonteCarlo methods in practice, pages 339–357. Springer-Verlag, New York.
Cannings, C., Thompson, E., and Skolnick, M. (1978). Probability functions on complexpedigrees. Advanced Applied Probability, 10:26–61.
Cummings, K. (1997). Concepts of Genetics., volume 5. Prentice Hall, Upper SaddleRiver, NJ.
Dausset, J., Cann, H., Cohen, D., Lathrop, M., Lalouel, J., and White, R. (1990). Cen-tre d’
�etude du Polymorphsme Humaine (CEPH): collaborative genetic maping of the
human genome. Genomics, 6:575–577.
Elston, R. and Stewart, J. (1971). A general model for the genetic analysis of pedigree data.Human Heredity, 21:523–542.
Goldstein, D., Speed, T., and Zhao, H. (1995). Relative efficiencies of chi-square modelsof recombination for exclusion mapping and gene ordering. Genomics, 27:265–273.
Gudbjartsson, D., Jonasson, K., Frigge, M., and Kong, A. (2000). Allegro, a new computerprogram for multipoint linkage analysis. Nature Genetics, 25:12–13.
Haldane, J. (1919). The combination of linkage values and the calcualtion of distancesbetween loci of linked factors. Journal of Genetics, 8:299–309.
Haseman, J. and Elston, R. (1972). The investigation of linkage between a quantitative traitand a marker locus. Behaviour Genetics, 2:3–19.
Idury, R. and Elston, R. (1997). A faster and more general hidden markov model algorithmfor multipoint likelihood calculations. Human Heredity, 47:197–202.
Irwin, M., Cox, N., and Kong, A. (1994). Sequential imputation for multilocus linkageanalysis. Proceedings of the National Academy of Sciences, 91:1684–1688.
Karlin, S. and Liberman, U. (1978). Classifications and comparisons of multilocus recom-bination distributions. Proceedings of the National Academy of Sciences, 75:6332–6336.
Kruglyak, L., Daly, M., and Lander, E. (1995). Rapid multipoint linkage analysis of reces-sive traits in nuclear families, including homozygosity mapping. American Journal ofHuman Genetics, 56:519–527.
Kruglyak, L., Daly, M., Reeve-Daly, M., and Lander, E. (1996). Parametric and nonpara-metric linkage analysis: A unified multipoint approach. American Journal of HumanGenetics, 58:1347–1363.
Kruglyak, L. and Lander, E. (1995). Complete multipoint sib-pair analysis of qualitativeand quantitative traits. American Journal of Human Genetics, 57:439–454.
Kruglyak, L. and Lander, E. (1998). Faster multipoint linkage analysis using Fourier trans-forms. Journal of Computational Biology, 5:1–7.
Kwiatkowski, D., Dib, C., Slaugenhaupt, S., Povey, S., Gusella, J., and Hains, J. (1993). Anindex marker map of chromosome 9 provides strong evidence of positive interference.American Journal of Human Genetics, 53:1279–1288.
Lander, E. and Green, P. (1987). Construction of multilocus genetic linkage maps in hu-mans. Proceedings of the National Academy of Sciences, 84:2363–2367.
Lange, K. (1997). Mathematics and Statistical Methods for Genetic Analysis. Springer-Verlag, New York.
Lange, K. and Goradia, T. (1987). An algorithm for automatic genotype elimniation. Amer-ican Journal of Human Genetics, 40:250–256.
Lathrop, G., Lalouel, J., Julier, C., and Ott, J. (1984). Strategies for multilocus linkage inhumans. Proceedings of the National Academy of Sciences, 81:3443–3446.
Lin, S. and Speed, T. (1996). Incorporating crossover interference into pedigree analysisusing the ( model. Human Heredity, 46:315–322.
Lin, S. and Speed, T. (1997). An algorithm for haplotype analysis. Journal of Computa-tional Biology, 4:535–546.
Lin, S. and Speed, T. (1999). Relative efficiencies of the chi-square recombination modelsfor gene mapping with human pedigree data. American Journal of Human Genetics,63:81–95.
Litt, M., Kramer, P., Browne, D., Gancher, S., Brunt, E., Root, D., Phromchotikul, T.,Dubay, C., and Nutt, J. (1994). A gene for Episodic Ataxia/Myokymia maps to chro-mosome 12p13. American Journal of Human Genetics, 55:702–709.
Luo, Y., Lin, S., and Irwin, M. (2001). Two-locus modeling of asthma in a Hutteritepedigree via Markov chain Monte Carlo. Genet Epidemiology., 21(Supp 1):S24–S29.
Markianos, K., Daly, M., and Kruglyak, L. (2001a). Efficient multipoint linkage analy-sis through reduction of inheritance space. American Journal of Human Genetics,68:963–977.
Markianos, K., Katz, A., and Kruglyak, L. (2001b). A new computational approach forrapid multipoint linkage analysis of qualitative and quantitative traits in large, complexpedigrees, and its implementation in GENEHUNTER. American Journal of HumanGenetics, 69:228.
McPeek, M. (1999). Optimal allele-sharing statistics for genetic mapping using affectedrelatives. Genet Epidemiology., 16:225–249.
O’Connell, J. and Weeks, D. (1995). The VITESSE algorithm for rapid exact multilocuslinkage analysis via genotype set-recoding and fuzzy inheritance. Nature Genetics,11:402–408.
O’Connell, J. and Weeks, D. (1999). An optimal algorithm for automatic genotype elimi-nation. American Journal of Human Genetics, 65:1733–1740.
Ott, J. (1989). Computer-simulation methods in human linkage analysis. Proceedings ofthe National Academy of Sciences, 86:4175–4178.
Ploughman, L. and Boehnke, M. (1989). Estimating the power of a proposed linkage studyfor a complex trait. American Journal of Human Genetics, 44:543–551.
Qian and Beckmann (2002). Minimum-recombinant haplotyping in pedigrees. AmericanJournal of Human Genetics, 70:1434–1445.
Schnell, F. (1961). Some general formulations of linkage effects in inbreeeding. Genetics,46:947–957.
Sobel, E., Lange, K., and O’Connell, J. (1996). Haplotyping algorithms. In Speed, T.and Waterman, M., editors, Genetic mapping and DNA sequencing, volume 81, pages89–110. Springer-Verlag, New York.
Tapadar, P., Ghosh, S., and Majumder, P. (2000). Haplotyping in pedigrees via a geneticalgorithm. Human Heredity, 50:43–56.
Thompson, E. (2000). Statistical inferences from genetic data on pedigrees., volume 6 ofNSF-CBMS Regional Conference Series in Probability and Statistics. IMS, Beach-wood, OH.
Warner, E., Foulkes, W., Goodwin, P., Meschino, W., Blondal, J., Paterson, C., Ozcelik, H.,Goss, P., Allingham-Hawkins, D., Hamel, N., Prospero, L. D., Contiga, V., Serruya,C., Klein, M., Moslehi, R., Honeyford, J., Liede, A., Glendon, G., Brunet, J., andNarod, S. (1999). Prevalence and penetrance of BRCA1 and BRCA2 gene mutationsin unselected Ashkenazi Jewish women with breast cancer. Journal of the NationalCancer Institute, 91:1241–1247.
Weeks, D., Lathrop, G., and Ott, J. (1993). Multipoint mapping under genetic interference.Human Heredity, 43:86–97.
Whittemore, A. and Halpern, J. (1994). A class of tests for linkage using affected pedigreemembers. Biometrics, 50:118–127.
Zhao, H., McPeek, M., and Speed, T. (1995a). Statistical analysis of chromatid interfer-ence. Genetics, 139:1057–1065.
Zhao, H., Speed, T., and McPeek, M. (1995b). Statistical analysis of crossover interferenceusing the chi-square model. Genetics, 139:1031–1044.