Simulation Study on the Methods for Mapping Quantitative Trait Loci in Inbred Line Crosses A Dissertation Submitted in Partial Fulfilment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY in Genetics and Plant Breeding by SHENGCHU WANG Zhejiang University Hangzhou, Zhejiang, China 2000
103
Embed
Simulation Study on the Methods for Mappingstatgen.ncsu.edu/zeng/Wang-Shengchu-Thesis.pdfestimation of marginal effects will increase largely too. On the other hand, the detection
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Simulation Study on the Methods for Mapping Quantitative Trait Loci
in Inbred Line Crosses
A Dissertation Submitted in Partial Fulfilment of the Requirements
for the Degree of
DOCTOR OF PHILOSOPHY in
Genetics and Plant Breeding
by
SHENGCHU WANG
Zhejiang University
Hangzhou, Zhejiang, China 2000
A Ph.D. DISSERTATION
Simulation Study of the Methods for Mapping
Quantitative Trait Loci in Inbred Line Crosses
By
Shengchu Wang
Major: Genetics and Plant Breeding
Supervisors: Dr. Jun Zhu and Dr. Zhao-Bang Zeng
Zhejiang University
Hangzhou, Zhejiang China
2000
DEDICATION
To My Wife, Xiu-Juan Rong
And Daughter, Min-Xue Wang
Acknowledgments
I like to express my special thanks to my advisor Dr. Jun Zhu for his important
directions, encouragement and support for my doctoral study and dissertation research.
The experience of studying with Dr. Zhu was beneficial and unforgettable.
I would like to express my sincere thanks to Dr. Zhao-Bang Zeng for supporting
me financially to do part of my dissertation research in US and giving me a lot of
helps in my research work and my life while I stayed at NCSU, US. Thanks also to Dr.
Bruce Weir for furnishing me the host lab and for good advice on my research work. I
would like to express my gratitude to my wife and daughter for their support and
patience.
I am grateful to Dr. Xin-Fu Yan, Dr. Yue-Fu Liu, Dr. Rong-Ling Wu, Hai-Ming Xu,
Ci-Xin He, and everyone who helped me during my dissertation research. I also wish
to express my thanks to my colleagues of computer centre, Zhejiang University, for
their support on my doctorial study and the dissertation research.
Abstract
As the fast advance in molecular genetics, it is much easy to get well-distributed
genetic markers in almost every organism nowadays. Therefore, as the major
direction of quantitative genetics, vary statistical methods have been developed to
detect or map quantitative trait loci (QTL) by using the genetic marker information. In
this dissertation, the principles and models have been summarized for various QTL
mapping methods. These methods include single marker analysis, interval mapping
1-1 HISTORY OF THE QTL MAPPING WORK ...................................................................................3 1-2 MOLECULAR MARKERS ..........................................................................................................6 1-3 EXPERIMENTAL DESIGN ..........................................................................................................9 1-4 MODELS AND SOFTWARE ......................................................................................................11 1-5 SIMULATION VS. REAL DATA.................................................................................................13 1-6 MAP FUNCTIONS AND MARKER ANALYSIS ...........................................................................15
1-7 PURPOSE OF THIS RESEARCH ................................................................................................21
2. REVIEW OF MAJOR QTL MAPPING METHODS 22
2-1 ONE MARKER METHOD.........................................................................................................22 1. Statistic Bases for One Marker Method ..............................................................................22 2. The t -test Method..............................................................................................................23 3. Likelihood Ratio Test Method..............................................................................................24 4. Simple Regression Method ..................................................................................................25
2-2 INTERVAL MAPPING METHOD ...............................................................................................25 1. Conditional Probabilities of QTL Genotypes......................................................................26 2. Genetic Model .....................................................................................................................27 3. Maximum Likelihood Analysis ............................................................................................28 4. Likelihood Ratio Test...........................................................................................................29
2-4 MIXED LINEAR MODEL APPROACH ......................................................................................35 1. Genetic Model .................................................................................................................36 2. Likelihood Analysis .............................................................................................................36 3. Hypothesis Test....................................................................................................................37 4. A Model for GE Interaction ................................................................................................37 5. A Model for QTL Epistasis..................................................................................................38
3-1 SIMULATION MODEL AND DATA............................................................................................41 1. Genetic Model for Simulation .............................................................................................41 2. Parameter Setting ...............................................................................................................42
−1−
3. Simulation Procedure..........................................................................................................43 4. Format of the Simulation Data ...........................................................................................44
3-2 SINGLE MARKER ANALYSIS ..................................................................................................45 3-3 COMPARING DIFFERENT MAPPING METHOD.........................................................................47
1. Parameters Setting ..........................................................................................................47 2. Estimation of QTL Effects ...............................................................................................47 3. Power and False Positive................................................................................................48 4. Positions and Effects of Detected QTLs ..........................................................................52 5. The LR Profile .................................................................................................................54
3-4 CONSIDER THE COMPLICATED QTL MAPPING SITUATIONS ..................................................56 1. Parameters Setting ..........................................................................................................56 2. Performance of IM and CIM Methods ............................................................................57 3. Using MCIM Method ......................................................................................................62
4. MODEL SELECTION AND CRITERIA 65
4-1 MIM AND MODEL SELECTION ..............................................................................................65 4-2 MODEL EVALUATION STANDARD ..........................................................................................66 4-3 MODEL SELECTION STRATEGY AND CRITERIA......................................................................67 4-4 PROCEDURE OF MODEL SELECTION ......................................................................................69 4-5 SUMMARY OF CRITERIA FOR MODEL SELECTION .................................................................71
1. Adjusted R2..........................................................................................................................71 2. Mallow’s Cp (Mallows 1973)...............................................................................................71 3. Mean Squared Error Prediction (Aitkin 1974, Miller 1990)...............................................71 4. BIC and Related Criteria ....................................................................................................72
4-6 SIMULATION STUDIES OF CRITERIA.......................................................................................73 1. FW and BW Methods ..........................................................................................................74 2. Criteria and the Various Parameters ..................................................................................74 3. Experimental Criteria .........................................................................................................77
It is believed that the rediscovery of Mendelian genetics in 1900 was beginning of
the modern genetics. Through the demonstration on the inheritance of discrete
characters, such as purple vs. white flower, smooth vs. wrinkled seeds, it is clear that
the traits are controlled by genetics factors or genes, which will, inherited from
generation to generation. Later on, great efforts have been made on understanding
how the genes effecting the discrete characters or qualitative traits, especially the
nature of the genes to transmit from the parents to their offspring.
However, most economically as well as biologically important traits are not
qualitative, but quantitative in nature. Here the quantitative means that the trait’s
value cannot be divided into several categories and the distribution of these values is
continuously over a range in a population. The examples of the quantitative trait are of
crop yield, plant height, resistance to diseases, weight gain in mice and egg or milk
production in animals. Due to the complexity nature of the quantitative inheritance,
the progress of quantitative genetics is far behind the Mendelian genetics. To partition
phenotypic variance into various genetic and non-genetic variance components is the
traditional way to study the quantitative traits.
VP = VG+Ve = VA+VD+VI+Ve
Here the phenotypic variance VP is partitioned into two components: genetic part
VG and environmental and residual part Ve. The genetic variance can be further
partitioned into additive VA, dominance VD and epistatic VI variances. It is also
possible to partition VG into other variance components according to the applications.
For example:
VG = VA+VD+VL+VM
where VL is the sex linkage component and VM is the maternal variance component
(Zhu and Weir, 1996).
−3−
These variance components can be estimated under the special breeding designs
(Cockerham 1961, Eberhart et al. 1966, Falconer, 1996, Zhu 1998). These estimations
allow us to evaluate the relative importance of various determinants of the phenotypic
variance. The ratio PG VV is called as heritability in broad sense and PA VV is
called as heritability in narrow sense or just heritability (h2). Heritability measures the
degree that genes transmitted from parents to their offspring comparing to phenotypic
deviation and it is useful in predicting the response to selection.
The questions how the genes contribute to the quantitative trait values and why
the trait values are continuously distributed may be answered partially by polygene
theory (Johannsen 1909, Nilsson-Ehle 1909, East 1916). In this theory, a quantitative
trait is controlled by many genes with small effects, and at the same time is also
influenced easily by environment effects. However, it is very difficult to dissect the
individual genes that controlling a quantitative trait by classical quantitative genetic
means. Therefore, Breeders usually have no idea about the number, location and
effect of the individual genes involved in the inheritance of target quantitative traits
(Comstock 1978). These genes are also called quantitative trait loci (QTLs). It is
impossible to manipulate the QTLs using genetic engineering method and through
that to improve the organism’s traits without obtaining the QTLs information, such as
number, locations, and effects.
The history of QTLs mapping can be traced back to 1920’s. Sax (1923) used the
morphological markers to demonstrate an association between seed weight and seed
coat colour in beans. Thoday (1961) used multiple genetic markers to systematically
map the individual polygenes, which control a quantitative trait. He notices: “The
main practical limitation of the technique seems to be the availability of suitable
markers”. It is obvious that the numbers of the morphological or protein markers are
very limited. Therefore, genetic markers are the nature choice for detecting or
mapping QTLs.
Nowadays, it is much easy to get well-distributed genetic markers in almost every
organism, because the fast advance of molecular genetic technology. Vary statistical
−4−
methods have been developed to detect or map QTLs by using genetic markers
information. Lander and Botstein (1989) proposed the interval mapping method (IM),
which use two adjacent markers to bracket a region for testing the existence of a QTL
by performing a likelihood ratio test at every position in the region. The method has
been proven more powerful and requiting fewer progeny than one-marker methods.
However, interval-mapping method has some drawback. Because it is a one QTL
model, the mapping position of QTLs will be seriously biased when more than one
QTL located at same chromosome (Knott and Haley 1992; Martinez and Curnow
1992).
Later on, several attempts have been made to solve this problem. Zeng (1993)
proved an important property of multiple regression analysis in relation to QTL
mapping: “If there is no epistasis, the partial regression coefficient of a trait on a
marker depends only on those QTLs that are in the interval bracketed by the two
neighbouring markers and is independent of QTLs located in other intervals”. Zeng
(1994) proposed an improved method called composite interval mapping (CIM) by
combining interval mapping with multiple regression analysis. Jansen (1993) has also
proposed a similar strategy. Composite interval mapping has proved having a better
performance than interval mapping in multiple linked QTLs case. Recently an
extended method called multiple interval mapping (MIM) has been proposed (Kao,
Zeng and Teasdale 1999). This method fits all QTLs into the model altogether and has
the ability for analysing QTL epistasis and the associated statistical issues.
A new methodology was also proposed (Zhu, 1998, 1999; Zhu and Weir, 1998)
for systematically mapping QTLs based on the mixed linear model approaches
(MCIM). The MCIM method has very similar performance with Zeng’s CIM
method (See chapter 3). However, MCIM method does not have the problem of
selecting the background control markers and setting the mapping windows size as
CIM method does. MICM method also has the advantage that is very easy to extend
for more complicated QTLs mapping situations such as QTL epistasis and QTL by
environmental interaction etc.
−5−
1-2 Molecular Markers
In classical Mendelian approach, the units of analysis are genetic variances rather
than the underlying genes themselves. However, individual QTL can be dissected by
using linked marker loci. This approach has long been recognized (Sax 1923;
Rasmusson 1933; Thoday 1961), but until recently it has been regarded as of minor
importance because of the lack of sufficient genetic markers. Thanks to modern
molecular biology, this situation has now been changed dramatically. The ability to
detect genetic variation directly at the DNA level has resulted in an essentially endless
supply of markers for any species of interest. Not surprisingly, there has been an
explosion in the use of marker-based methods in quantitative genetics.
The first molecular markers used were allozymes, protein variants detected by
differences in migration on starch gels in an electric field. This class of markers has
been extensively applied to a variety of genetic problems (Tanksley and Rick 1980;
Delourme and Eber 1992; Baes and Van Cutsem 1993; Kindiger and Vierling 1994).
Allozymic variants have the advantage of being relatively inexpensive to score in
large numbers of individuals, but there is often insufficient protein variation for
high-resolution mapping. This is the reason why the rapid development of QTL
mapping did not start with the advent of allozymic markers.
As methods for evaluating variation directly at the DNA level became widely
available during the mid-1980s, DNA-based markers largely replaced allozymes in
mapping studies. DNA is the genetic material of organisms and genetic differences
between individuals will be reflected directly by the nucleotide sequences of DNA
molecules. There are effectively no limitations on either the genomic location or the
number of DNA markers.
A wide variety of techniques can be used to measure DNA variation. Direct
sequencing of DNA provides the ultimate measure of genetic variation, but much
quicker scoring of variation is sufficient for most purposes. These methods include
Restriction Fragment Length Polymorphisms (RFLPs), Polymerase Chan Reaction
(PCR), Randomly Amplified Polymorphic DNAs (RAPDs) and microsatellite DNAs
−6−
etc. There are several recently developed methods that include Representational
Difference Analysis (RDA) and Genomic Mismatch Scanning (GMS).
RFLPs is one of the simplest and wide used types of DNA marker. The approach
is to digest DNA with a variety of restriction enzymes, each of which cuts the DNA at
a specific sequence or restriction site. When the digested DNA is run on a gel under
an electric current, the fragments separate out according to size. A variety of DNA
from different individuals can generate length variation. If we attempted to score the
entire genome for fragment lengths, the result would be a complete smear on the gel.
Instead, individual bands are isolated from this smear by using labelled DNA probes
that have base pair complementarily to particular regions of the genome. Each RFLP
probe generally scores a single marker locus, and the marker alleles are codominant,
as heterozygotes and homozygotes can be distinguished. The first use of the RFLP
markers is in construction of human genetic map (Botstein et al. 1980; Doris-Keller et
al. 1987), and this has been extended to analysis for other species (Beckmann and
Soller 1983, 1986a, 1986b; Soller and Beckmann 1988).
PCR is a rather different molecular marker approach that uses short primers for
DNA replication to delimit fragment sizes. A opposite orientated region flanked by
primer binding sequences that are sufficiently close together allows the PCR reaction
to replicate this region, generation an amplified fragment. If primer-binding sites are
missing or are too far apart, the PCR reaction fails and no fragments are generated for
that region. RAPDs method (Williams et al. 1990) has the similar procedure that the
sequence polymorphisms are detected by using random short sequences as primer.
The advantage is that a single probe can reveal several loci at once, each
corresponding to different regions of the genome with appropriate primer sites. They
also require smaller amounts of DNA. However, RAPDs markers are dominant and
the marker genotype can be ambiguous. Ragot and Hoisington (1993) conclude that
RAPDs are suitable for modest number of individuals, while RFLPs are better for
larger studies.
Microsatellite DNAs, short arrays of simple repeated sequences tend to be very
highly polymorphic. Since array length is cored, microsatellites are codominant, as
−7−
heterozygotes show two different lengths and hence can be distinguished from
homozygotes. This kind of marker is especially suitable for outbred population
because it is most efficient with marker loci having a large number of alleles.
RDA and GMS are two recently developed advance methods. Both methods
examine the entire genome, allowing one to isolate only those sequences that are
shared by two populations (GMS) or those that differ between populations (RDA).
Good use of these methods will very likely provide powerful approaches for the
isolation of QTLs (Lander 1993, Aldhous 1994). Besides above commonly used
markers, other categories of markers can also be very useful in some cases.
The linear arrangement of the markers along the chromosomes or genome for the
species is called marker linkage map. The map information is very important for vary
QTL mapping research work. There are many saturated marker maps, which means
markers covering whole genome in a reasonable distance, have been published in
many organisms (Halward et al. 1993, Xu et al. 1994, Causse et al. 1994, Viruel et al.
1995, Hallden at al. 1996). Based on these kind of saturated maps, many research
areas became more likely to be successful. These research works include studies on
evolutionary process of organisms through comparative mapping (Lagercrantz et al.
1996, Simon et al. 1997), marker assisted selection to improve breeding efficiency
(Lee 1995, Hamalainen et al. 1997) and marker based cloning (Xu 1994) etc.
It is necessary to distinguish between the ideas of the physical maps and the
genetic maps. The set of hereditary material transmitted from parent to offspring is
known as the genome, and it consists of molecules of DNA (DeoxyriboNucleic Acid)
arranged in chromosomes. The DNA itself is characterized by its nucleotide sequence
that is the sequence of bases A, C, G or T. A physical map is an ordering of features
of interest along the chromosome in which the metric is the number of base pairs
between features. This is the level of detail needed for molecular studies, and there are
several techniques available for physical mapping of discrete genetic markers or traits.
However, in this paper genetic map are the main concern and that is the distances
depending on the level of recombination expected between two points. An individual
receives one copy of each heritable unit (allele) from each parent at each location
−8−
(locus) of the genome. The combination of units (haplotype) at different locations
(loci) that the individual transmits to the next generation need not be one of the
parental sets. Recombination may have taken place during the process of meiosis
producing eggs or sperm. That is, through crossing over events alleles in diploids may
come from either of the two parental chromosomes to form the haploid egg or sperm.
Although there is generally a monotonic relation between physical and recombination
distance, the relation is not a simple one.
1-3 Experimental Design
To cross between completely inbred lines, which differ in the trait of interest, offer
an ideal setting for detecting and mapping QTLs by marker-trait associations. The
reason is by doing that all F1s are genetically identical and shows complete linkage
disequilibrium for genes differing between the inbred lines. A number of designs have
been proposed to exploit these features. These designs can produce various mapping
populations that include backcross population, intercross population, doubled haploid
population and recombinant inbred lines population etc. The most inbred lines cross
design population are involved crop plants, however it is also applied to a number of
animal species, especially mice (reviewed by Frankel 1995).
Here we call the two different parental inbred lines (P1 and P2), the one is low (L)
line, and another one is high (H) line. The F1 individuals receive a copy of each
chromosome from each of the two parental lines, and so, wherever the parental lines
differ, they are heterozygous. All F1 individuals will be genetically identical and have
the genotype of HL at each locus. Almost all-experimental designs are starting from
the F1 status.
In a backcross design, The F1 individuals are crossed to one of the two parental
lines, for example, the high line. The backcross progeny, which may number from 100
to over 1000, receive one chromosome from the F1, and one from high parental line.
Thus, at each locus, they have genotype either HH or HL. As a result of crossing over
during meiosis, which is the process during the formation of the gametes, the
chromosome received from the F1 is a mosaic of the two parental chromosomes. At
−9−
each locus, there is a half chance of receiving the allele from the high parental line
and a half chance of receiving the allele from the low parental line. The chromosome
received will be the alternation between stretches of L’s and H’s.
Another common experimental design used in plants is the intercross design. F2
population is made from selfing or sib mating F1 individuals. The F2 individuals
receive two sets of chromosomes from the F1 generation, each of which will be a
combination of parental chromosomes. Thus, at each locus, the F2 individuals will
have the genotypes of HH, HL or LL. The F2 population provides the most of genetic
information among different types of mapping populations (Lander et al. 1987), and is
relatively easy to be obtained.
A doubled haploid (DH) population is composed of many DH lines that are
usually developed from pollens on an F1 plant through anther culture and
chromosome doubling. The genotypes of the DH line’s individuals are homozygous
and are HH or LL in different locus along chromosome. DH populations are also
called permanent population because there will be no segregation in the further
generations. The advantage of the DH population is that the marker data can be used
repeatedly in different locations and years under various experimental designs.
However, the rates of pollens successfully turned into DH plants may vary with
genotypes of pollens, and this will cause segregation distortion and false linkage
between some marker loci.
A recombinant inbred lines (RIL) population is constructed by selfing or sib
mating individuals for many generations start from F2 by single seed descent approach
till almost all of the segregating loci come to be homozygous. Some RIL populations
have been developed in rice, maize and barley etc. recent year (Burr et al. 1988, Reiter
et al. 1992, and Li et al. 1995). The advantage of the RIL population is the genetic
distances are enlarged compared to those obtained from F2 or BC populations. The
reason is that many generations of selfing or sib mating increases the chance of
recombination. Therefore, It may useful for the increasing of the precision in QTL
mapping. However, it is not possible that all individuals in a RIL population are
−10−
homogeneous at all segregating loci through the limited generations of selfing or sib
mating, which will decrease the efficiency for QTL mapping to some extent.
People use different experiment design population for different QTL mapping
research. In this dissertation, B1 and DH population will be used as chief example
because of its simplicity. At each locus in the genome, the progeny of B1 or DH
population have only two possible genotypes. However, the principles and results
obtained here are very easy to extend to other experiment design populations.
1-4 Models and Software
The QTLs information (numbers, positions, and effects etc.) of the experiment
population is unobservable. Through the experiment, people can only observe the trait
phenotype and marker information for each individual. The idea that genetic markers,
which tend to be transmitted together with specific values of the trait, are likely to be
close to a gene affecting that trait is the base for QTLs mapping. Therefore, the
genetic and statistic models are very important for describing the data and abstracting
the QTLs information from the data.
Genetic models are used for describing the organism’s genetic activity such as
recombination events and additive, dominant, or epistatic phenomena etc. For more
than two markers in a chromosome, the simplifying assumption is that recombination
between any two of them is independent from others’ recombination events. This
assumption is called no interference and the phenomenon of a single crossing over
between DNA strands can be considered as a Poisson-process. Therefore, Haldane’s
mapping function (Haldane 1919) can be used for describing the relationship between
recombination fraction r and genetic distance x.
Statistical models are the methods to obtain the QTLs information from the
experimental data through associate analysis and statistical calculation. Without the
appropriate statistical model, there is no way to retrieve the QTL information from the
experiment data, which includes the quantitative phenotypes and molecular markers.
Therefore the statistical model is critical for mapping QTL and a large number of new
models have been proposed since the 1980s (Weller 1986, Lander and Botstein 1989,
−11−
Haley and Knott 1992, Jansen 1992, 1993, Zeng 1993, 1994, Zhu 1998, Kao 1999
etc.).
We can classify these statistical models (methods) base on the number of markers
used or the techniques applied (Liu 1997, Hoeschele et al. 1991). The classification
according to marker numbers includes “single marker method”, “Flanking marker
methods” and “multiple marker methods”. It also can group the methods as “least
square methods”, “regression methods”, “maximum likelihood methods”, and “mixed
linear model approach methods” etc. In summary, these various methods differ from
simple to complicated, from detecting QTL-marker association to locating QTLs
position and estimation their effects, and from low resolution and power to high
resolution and power. In the later chapters, we will discuss these methods in more
details.
It is possible to use calculator to solve statistic problems when the data set is not
very large and the method is not too complicated. However, computer program is
usually used when people analysis the data set by statistic means. There is several
commercial software packages exist currently for statistical analysis purpose. These
general-purpose statistical software packages include SAS, SPSS, SPLUS, and
STATISTICA etc. It is likely to use these kinds of software to do the QTL mapping
analysis (Haley and Knott, 1992). However, the methods for QTL mapping are
usually complicated and not standardized. It is usually not efficient sometime even
impossible to map QTL by using these kinds of software package. Therefore, many
computer programs based on specific statistical methods have been developed for
QTL mapping purpose (Lander and Botstein 1989, Basten 1994,Wang 1999).
Base on the classical interval mapping principles, Mapmaker/QTL (Lander et al.
1987) is one of the popular QTL mapping software. This software has different
versions for PC, Mackintosh, and UNIX systems and it uses command-driven user
interface. It means that a series of commands should be executed for different stages
such as data input, doing various mapping functions and output the result.
QTL Cartographer (Basten et al. 1994) is another popular QTL mapping software
developed according to Zeng’s composite interval mapping method (Zeng 1994). The
−12−
software also has different versions for PC and UNIX. However, the original software
uses several commands to fulfil the mapping tasks and sometimes it is confusing. We
have developed a windows-version of QTL Cartographer software that uses
user-friend interface and graphic result representation. It is certain that the new
version of the QTL Cartographer will be much easier to use and the software will be
described in more details later.
Other software is also available for QTL mapping, such as QTLSTAT (Liu and
Knapp, 1992), PGRI (Lu and Liu, 1995), MAPQTL (Van Ooijen and Maliepaard
1996) and Map Manager QTL (Manly et al 1996). Obvious, these programs are not as
popular as Mapmaker/QTL and QTL Cartographer. However, It is believed that new
method based QTL mapping software will be gradually accepted by genetic
researchers over the time. Advanced statistical method and good user interface should
be the most important facts for these kinds of software.
1-5 Simulation vs. Real Data
Statistical model is used for describing the real biological or genetic system.
Because this kind system is so complicated and some facts are unknown, it is
impossible to include all the facts (parameters) into a model. Therefore, it is
reasonable that there are several statistical models for QTL mapping research. Some
of these are quit complicated and some others maybe very simple. The properties of
an estimator for the statistical model can be obtained parametrically if the distribution
of the estimator is known and well characterized. However, in most models for QTL
mapping, it is usually too complicated to get the properties of the estimators
parametrically. Therefore, computer simulation is necessary for obtaining the
properties and checking the performance of the models and methods. This is no way
to examination a model’s performance by using real (experimental) data because the
true parameter is unknown. The advantage of using computer simulation data is that
we know the true parameters that can be used to compare with the estimator of the
model.
−13−
The data for QTL mapping have two components, which include the map
information and the cross information. The map information data set contains
information of the marker positions and orders for each chromosome or linkage group
for an experimental organism. Figure 1-1 is the estimated genetic map for X
chromosome of the mouse species and the Table 1-1 is the map data in QTL
Cartographer format.
Figure 1-1. Markers information of X chromosome for mouse data. The numbers are distance incM between two markers and the labels are the marker’s names.
It is very easy to use above formula for calculating the pair wise recombination
frequency between each pair of markers. By doing this calculations, we can decide the
linkage groups. A linkage group is a group of markers where each marker is linked (r
< 0.5) to at least one other marker. If a marker is not linked to any marker in a linkage
group, it does not belong to that group, and most likely belongs to some other linkage
group. In theory, the linkage group numbers should equal to chromosome numbers.
However, sometime the linkage group numbers is greater than chromosome numbers
because the sample variance and the limitation of the sample size. In other words,
−17−
some of the recombination events are not detected by the experiment. In this case, to
increase the sample size or to do more experiments are necessary.
T
w
F
r
g
1
r
g
r
0
−
Figure 1-2. A linkage group structure for simulation study. Numbers abovethe markers are distances of the two markers in cM and under are maker
abl
Indi-dua
123456789
101112131415
T
ith
igu
eco
eno
-6
eco
I
rou
eco
.13
18−
2 3 5 1 4
e
v
m
m
m
20.3 17.6 7.8 21.0
1-5. Simulation data set of marker genotypes for a Backcross population.
Markers Markers i ls 1 2 3 4 5
Indivi -duals 1 2 3 4 5
AA AA AA Aa AA 16 AA AA Aa AA AA AA Aa Aa AA AA 17 Aa Aa Aa Aa Aa Aa Aa Aa Aa Aa 18 Aa Aa Aa Aa Aa AA Aa Aa AA Aa 19 Aa Aa Aa AA Aa AA AA AA AA AA 20 Aa Aa Aa Aa Aa AA Aa AA AA AA 21 AA AA AA AA AA AA AA Aa Aa Aa 22 AA AA AA AA AA AA AA AA AA AA 23 Aa Aa Aa Aa Aa Aa Aa Aa Aa Aa 24 Aa AA Aa AA Aa
AA AA AA Aa AA 25 Aa Aa Aa Aa Aa Aa Aa Aa Aa Aa 26 AA AA AA AA AA
AA AA AA AA Aa 27 AA Aa Aa AA Aa AA Aa Aa AA AA 28 AA AA AA AA AA Aa Aa Aa Aa Aa 29 Aa AA Aa Aa Aa AA AA AA AA AA 30 AA AA AA AA AA
able 1-5 is a simulation data set that includes marker genotypes of B1 population
5 markers and 30 individuals produced from the linkage structure showed in
re 1-2. Here the Haldane map function has been used. The numbers of the
bination events between two markers, which are the counts of changing from
type AA to Aa or from genotype Aa to AA, are presenting in Table 1-6. Table
also includes the recombination frequencies that are the numbers of the
bination events divided by total individual number 30.
t is very important to know the makers’ orders and positions along the linkage
p or chromosome. We can estimate this information from the table of
bination frequencies (Table 1-6). From Table 1-6 we know the smallest value is
and they are the recombination frequencies between marker 1 and 5 or between
marker 3 and 5. Here choice 3-5 as starting point (can choice 1-5 also). Then finding
the smallest value either from marker 3 side (2-3 is 0.17) or from marker 5 side (5-1 is
0.13) and the new order become 3-5-1. The next maker picked is 4 (1-4 is 0.17) and
the new order is 3-5-1-4. Therefore the final orders are 2-3-5-1-4.
After obtaining the markers order, it is easy to estimate the map distance between
markers by using recombination frequencies and appropriate map function. For
example, the recombination frequencies between marker 2 and 3 are 0.17 and the
distance will be 20.8 cM by using formula (1-2) to calculation. The final result is in
Figure 1-3.
Table 1-6. The count (frequencies) of recombination events.
Figure 1-3. Estimated linkage group structure for the simulation data set.
2 0.17 3 0.13 5 0.13 1 0.17 4
m
t
er
e
m
to
n
i
m
20.8 15.1 15.1 20.8
paring Figure 1-3 to Figure 1-2, the markers order of estimation is correct but
ances between markers are not very accurate. It is quite reasonable for
ing such a small sample size (only have 30 individuals). As the sample size
d, the estimation will be more precise.
this simple example, it seems quite easy to obtain the markers order and the
rs of the marker distance by counting the recombination events. However, as
umber increase, the problem of ordering a set of genetic markers will become
fficult. This problem is equivalent to the famous “Travelling Salesman
”. One of the criteria for comparing two different orders is to minimize the
−19−
Sum of Adjacent Recombination Fractions (SAR). For above example, the SAR
value for the final order 2-3-5-1-4 is 0.17+0.13+0.13+0.17 = 0.60. The other criterion
includes SAL standards for Sum of Adjacent Likelihood Functions.
The main problem for ordering the markers is not the criterion but the
computation time. As the marker number increased, the numbers of possible orders
will quickly become unmanageable by means of computation. Therefore, the only
way to solve this problem is to find the better (not necessary the best) order through
some kind of searching procedures. Several methods have been proposed since 1980.
These methods include Branch and Bound (Thompson, 1984), Simulated Annealing
(Weeks and Lange, 1987), Seriation (Buetow and Chakravarti, 1987a, 1987b), and
Rapid Chain Delineation (Doerge, 1993) etc. There are numbers of software available
for ordering markers and estimating distance between markers, MAPMAKER
(Lander etc 1987) is one of it.
3. Marker Segregation Analysis
It is also important to do the Mendelian segregation test for each marker to test the
segregation distortion of the markers. By expectation, the segregation ratio should be
1:1 for population of BC, DH, or RIL and 1:2:1 for the intercross population. In
backcross population, to across between A/A and A/a produces the zygotes AA and
Aa with the same expected number of n/2. Table 1-7 shows the expected number and
observed number for above simulation data set as showed in Table 1-5. A test statistic
can be constructed by using χ2 under the null hypothesis, p(AA) = p(Aa) = 0.5
(Mendelian Segregation), as showed in formula (1-4). In this example, the individual
number n = 30 and n1 and n2 is observed number for genotype AA and Aa in each
marker position.
∑ −=
−= 2
1
221
22 ~)(
.#).#.#( χχ
nnn
ExpExpObs
(1-4)
Rejecting H0 means the deviation from Mendelian segregation is significant and
this phenomenon is called segregation distortion. Segregation distortion can be caused
by sample variation. However sometimes it is caused by genetic reason such as the
−20−
selection force on different types of zygotes is different. Significant segregation
distortion can bias estimation of recombination frequency (distance) between markers.
It can also reduce the power to identify QTLs and bias the estimation of QTLs
positions and effects.
Table 1-7. Marker segregation analysis for the simulation data set.
Markers Marker 1 Marker 2 Marker 3 Marker 4 Marker 5 Genotypes AA Aa AA Aa AA Aa AA Aa AA Aa 1Frequency under H0 ½ ½ ½ ½ ½ ½ ½ ½ ½ ½ Expected number 15 15 15 15 15 15 15 15 15 15 Observed number 18 12 15 15 12 18 17 13 14 16 χ2 value 1.20 0.00 1.20 0.53 0.13 p-value >0.250 >0.995 >0.250 >0.250 >0.500 1H0: null hypothesis.
1-7 Purpose of This Research
The purpose of the QTL mapping practice is to identify or locate various QTLs
along the chromosomes for a species through special experimental design and genetic
markers information. The QTLs information such as number, locations, and effects
can help geneticist and breeders to improve the quality and quantity of the plants or
animals. However, the fundamental of the QTL mapping methods is based on statistic
principles. It is important to understand the statistic principles before using a
particular QTL mapping method to analysis the experimental data set. Moreover, it is
also useful by comparing different QTL mapping methods to understand the
performances of the various methods under difference circumstance. This kind of
study can help users to choose the appropriate QTL mapping method according to
their experiment requirements and provide the basis for understanding the result after
QTL mapping analysis.
In this research, a large scale of computer simulation has been conducted for
studying and comparing the performances of the major QTL mapping methods. These
methods include Interval Mapping method, Composite Interval method, and Mixed-
model based CIM mapping method. We have also conducted a series of simulation
researches for identifying the model selection criteria that are the critical part for the
multiple QTL mapping methods. The computer software accompany with a particular
−21−
QTL mapping method is very important because the QTL mapping method is usually
too complicate to use without the computer software. However, the most QTL
mapping software existed are using command drive system as its interface and it is
usually not very convenience to use. We have developed a QTL mapping software
with user friend interface and result visualization ability. The software is called
“Windows QTL Cartographer” (Wang et al. 1999) that has been posted on the Internet
and has many users.
2. Review of Major QTL Mapping Methods
2-1 One Marker Method
One marker method is based on the simple idea that if there is an association
between marker type and trait value, it is likely that a QTL locus is close to that
marker locus. The approach has been applied in many studies of QTLs for various
organisms such as Drosophila (Thoday, 1961), maize (Edwards et al, 1987) and
tomato (Weller, Soller and Brody, 1988).
Table 2-1. Trait mean and distribution for various populations.
Population Genotype Mean Distribution P1
1MQ / MQ µ1 = µ + a N( µ1, σ2) P2 mq / mq µ2 = µ − a N( µ2, σ2) F1 MQ / mq µ12 = µ + d N( µ12, σ2)
1M or m means marker and Q or q indicates QTL.
Table 2-2. Frequencies and mean effects for various marker-QTL genotypes in B1 population.
MN / mn (1 – r12)/2 MQN / mqn (1−r1)(1−r2)/2 Pr(Qq) = [(1−r1)(1−r2)] / (1– r12) ≈ 1 1Probability of the marker class. 2Probability of the marker – QTL genotype. 3Conditional probability for the QTL genotype according to formula (2-5), here p equal to r1 / r12.
Under the assumption of no interference assumption (Haldane), the relationship
between r12 and r1, r2 will be , while under complete
interference (Kosambi). When r
212112 2 rrrrr −+= 2112 rrr +=
12 is small, gamete frequencies are essentially
identical under either interference assumption. Because the QTL is unknown, we
can only use the observable marker genotype to infer the QTL genotype. Table 2-5
shows the probability of the QTL genotype according to the two flank markers
genotypes.
2. Genetic Model
For a backcross population, to analyse a QTL located on an interval flanked by
marker M and N, the interval mapping method assumes the following linear model.
jjj exby ++= **µ j = 1, 2, …, n (2-6)
where = The effect of the putative QTL *b
=QqisgenotypeQTLtheifQQisgenotypeQTLtheif
x j 01*
),0(~ 2σNe j
In the model, the variable x* is used for indicating the QTL genotype which are
unobserved. However, the probabilities of possible QTL genotypes can be inferred by
given the genotypes of two flank markers as showed in Table 2-5 and the summary is
showed in Table 2-6. For backcross population, we can define
−27−
.1,0),,|(Pr * === kpNMkxobp jkj
where 121 rrp = and the approximation is obtained by assuming that the double
recombination events can be ignored.
Table 2-6. The probabilities of possible QTL genotypes condition on marker classes.
QTL Genotype Maker Classes Numbers QQ(1) Qq(0)
MN / MN n1 11
)1)(1(
12
21 ≈−
−−r
rr 01
))((
12
21 ≈− r
rr
MN / Mn n2 pr
rr−≈
−−
11
))(1(
12
21 pr
rr≈
−−
12
21
1)1)((
MN / mN n3 pr
rr≈
−−
12
21
1)1)(( p
rrr
−≈−
−1
1))(1(
12
21
MN / mn n4 01
))((
12
21 ≈− r
rr 11
)1)(1(
12
21 ≈−
−−r
rr
3. Maximum Likelihood Analysis
For model (2-6), there are two possible QTL genotypes each of that can be true
with a certain probability. The distribution of the model is a mixture normal
distribution and the likelihood function can be defined as
∏=
−+
−−=
n
j
jj
jj
yp
byppbL
10
*
12* ),,,(
σµ
φσµ
φσµ (2-7)
where ( ) ( 22
21 zez −=π
φ ) is the standard normal density function.
In likelihood function (2-7), the parameters include:
µ - the mean of the model
*b - the effect of the putative QTL
121 rrp = - the position of the putative QTL related to the flank markers
2σ - residual variance of the model
The data of the analysis include:
jy - Phenotypic value of a quantitative trait for each individual
Genotypes of markers for each individual that contribute to the analysis of
jkp , k = 1, 2; j = 1,2, …, n
−28−
The maximum likelihood analysis of a mixture model is usually through an
Expectation-Maximization algorithm. EM is an iterative procedure and the E-step for
likelihood function (2-7) is to calculate:
[ ]( )[ ]( ) [ ]( )σµφσµφ
σµφ−+−−
−−=
jjjj
jjj ypbyp
bypP
0*
1
*1
The M-step is to calculate:
( ) nbPyn
jjj∑
=
−=1
*ˆµ
( ) ∑∑==
−=n
jj
n
jjj PPyb
11
*ˆ µ
( )[ ]∑=
−−=n
jjj bPy
n 1
2*22 1ˆ µσ
This process is iterated until convergence of estimates.
4. Likelihood Ratio Test
The test statistic can be constructed using a likelihood ratio in LOD (likelihood of
odds) score:
)ˆ,ˆ,ˆ()ˆ,0,ˆ(log
2*
2*
10σµσµ
bLbLLOD =
−=
Under the hypotheses
0:0: *1
*0 ≠= bHandbH
By assuming that the putative QTL is located at the position indicated
by 121 rrp =
2* ˆ,ˆ, σb
, we can get the maximum likelihood estimates of under H2* ,, σµ b 1 as
and under Hµ 0 as with constrained to zero. That the LOD score test
is essentially the same test as the usual likelihood ratio test:
2ˆ,ˆ σµ *b
)ˆ,ˆ,ˆ()ˆ,0,ˆ(ln2
2*
2*
σµσµ
bLbLLR =
−=
And we have the relationship between LOD value and LR value as
−29−
( ) LRLReLOD 217.0log21
10 ==
The test can be performed at any position covered by markers and thus the method
creates a systematic strategy of searching for QTL. The amount of support for a QTL
at a particular map position is often displayed graphically through the use of
likelihood maps profile, which plots the likelihood ratio test statistic as a function of
map position of the putative QTL. If the LOD score at a region exceeds a pre-defined
critical threshold, a QTL is indicated at the neighbourhood of the maximum of the
LOD score with the width of the neighbourhood defined by one or two LOD support
interval (Lander and Botstein 1989). By the property of the maximum likelihood
analysis, the estimates of locations and effects of QTL are asymptotically unbiased if
the assumption that there is at most one QTL on a chromosome is true.
The test statistic LR for a given position is expected to be asymptotically
chi-square distributed with one degree of freedom under the null hypothesis for the
backcross design and with two degree of freedom for the F2 design (Lander and
Botstein 1989, Van Ooijen 1992, Zeng 1994). However, because the test is usually
performed in the whole genome, there is a multiple testing problem. The distribution
of the maximum LR or LOD score over the whole genome under the null hypothesis
becomes very complicated. An asymptotic theory, which is based on an
Orenstein-Uhlenbeck diffusion process for determining appropriate genome-wise
critical values, has been developed by Lander and Botstein (1989), Feingold et al.
(1993) and Lander and Schork (1994). Lander and Botstein (1989) suggested that a
typical LOD score threshold should be between 2 and 3 to ensure a 5% overall false
positive error for detecting QTL.
2-3 Composite Interval Mapping
For interval mapping method, the estimated locations and effects of QTL tend to
be asymptotically unbiased if there is only one segregating QTL on a chromosome.
However, if there is more than one QTL on a chromosome, the test statistic at the
position being tested will be affected by all those QTL and the estimated positions and
−30−
effects of QTL identified by this method are likely to be biased. ‘Ghost QTL problem’.
One of the reasons for these shortcomings is that the test used in interval mapping
method is not an interval test. An interval test is that the effect of the QTL within a
defined interval should be independent of the effects of QTL outside the region.
Otherwise, even when there is no QTL within an interval, the likelihood profile on the
interval can still exceed the threshold significantly if there is a QTL at some nearby
region on same chromosome.
In order to overcome the shortcoming of interval mapping method, Zeng (1994)
proposed an improved method called composite interval mapping by combining
interval mapping with multiple regression analysis. Let us first review some relevant
theory in multiple regression analysis for QTL mapping (Zeng 1993).
1. Properties of Multiple Regression Analysis
Due to the linear structures of locations of genes on chromosomes, multiple
regression analysis has a very important property. That is the partial regression
coefficient of a trait on a marker is expected to depend only on those QTLs that are
located on the interval bracketed by the two neighbouring markers. It is independent
of any other QTL outsides the region if there is no crossing over interference and no
epistasis. However, interference and epistasis will introduce non-linearity in the
model.
Suppose we regression trait value y on t markers observed in B1 population:
∑=
++=t
kjjkkj exby
1
µ
where is the indicate value (1 or 0) of the th marker in the th individual,
and is the partial regression coefficient of the phenotype y on the th marker
conditional on all other markers. can also be denoted as and denotes a
set which includes all markers except the th marker.
jkx k j
kb k
kskb skykb .
k
−31−
Since takes a value of 1 or 0 with equal probability, the variance of the th
marker in the population is
jkx k
412 =kσ . It is easy to show that the covariance between
the th and th markers is i k 4)21( ikik r−=σ and is the recombination
frequency between marker i and marker k. The covariance between the trait value y
and the th maker is:
ikr
k
4)21(1∑=
−=m
uuukyk r δσ
where is u th QTL effect. uδ
With these basic equations, any conditional variance and covariance can be
derived. The variance of marker k conditional on marker i is:
where is the recombination frequency between QTL i and QTL j. ijr
2. Parameter Setting
The first step of producing simulation data is to set the mapping parameters, such
as experimental population (B1 or DH), sample size (n), trait mean (µ), map function
(Haldane or Kosambi), and marker genotypes (for example, 1 for one genotype and 0
for another genotype). Especially, it is important to define chromosome information
such as chromosome number, marker number and positions for each chromosome.
Table 3-1 shows an example of parameters setting for QTL mapping information.
Table 3-1. An example of parameters setting for simulation mapping information.
Marker genotype Population
Sample Size
Trait Mean
Map Function Chromosomes Mm MM
B1 200 15.8 Haldane 9 1 0
The second step is to set the parameters of QTLs such as heritability (h2), the ratio
of epistatic variance by additive variance C, which is defined as VI / VA (see formula
3-2A and 3-3A), QTL number, positions, and effects. One example of the parameters
setting is showed in Table 3-2. By using this information, it is easy to produce the
additive (α) – epistatic (β) upper-triangle matrix as showed in Table 3-3. The QTL
effects can be adjusted according to h2, C, and as following. eV
Table 3-2. An example of parameters setting for QTL information.
Additive Effect Epistatic Effect QTL Number Heritability C = VI / VA 1Sign : Both (1:3) Sign : Same
8 0.6 0.1 2Distribution : γ-2.1 Distribution : γ-0.3 1Effects can be same direction or both directions, in which case, a ratio can be indicated. 2Effects
−42−
can be chosen for different distributions, such as gamma (with one parameter), normal or even.
Assume heritability is h2 and then AI VVC /= Ge Vh
h2
21−=V .
Note: We can use formula (3-2A), (3-2B) and (3-3A), (3-3B) to calculate VI and
VA. After setting the values of αi and βij, the βij’s value should be adjusted according
to the value of C.
If 1≠=A
I
CVVR then Rijij /ββ = to ensure that R = 1 and . AI VVC /=
Finally, to standardize the QTL effects by adjusting the values of α and β using
formula eV
α and eV
β and to make sure that the value of is equal to 1. eV
Table 3-3. An example of Simulation parameters setting for positions and effects of QTLs. Here, VA = 1.364, VI = 0.136, Ve = 1.0, C = VA / VI = 0.10, h2 = 0.60. Chromosome 1 1 3 3 7 7 7 9 Positions (cM) 11.7 31.8 9.1 43.1 11.8 40.2 65.9 21.9
It is implied that the density of the genetic marker will affect both the power of
QTL detection and the probability of false QTL detected as showed in Table 3-9.
When marker density increases, there is no apparent gain of power for detecting QTLs
with large effects (Q2-1L and Q2-2L) by three QTL mapping methods. But MCIM
method tends to be more powerful than the other two methods (IM and CIM) for
detecting QTLs with small effects. When considering the power of detecting linked
QTLs with reverse effects (Q1-2M and Q1-3M), MCIM method has a great
improvement, while CIM method performs quite poor. It may suggest that increasing
marker density is sometime even harmful for the CIM method. The QTL Q1-3M is
still cannot be detected by IM method as the marker density increased.
The impact of the sample size on the power of QTL detection and the probability
of false QTL detected is showed in Table 3-10. Basically, the power of the QTL
detection will increase as the sample size increased for all the three mapping methods.
Especially, the CIM method has obtained large improvement both in power of QTL
detection and probability of false QTL detected after the sample size is increasing to
300.
−50−
Table 3-9. Power of QTL detection and the probability of false QTL detected under Model-II when chromosomes = 3, average marker distance = 4 cM, and threshold value is LOD = 2.5.
1Probability of false QTL detected in whole genome.
Table 3-10. Power of QTL detection and the probability of false QTL detected under Model-II for different sample sizes (threshold value with LOD = 2.5).
The performance of the QTL mapping analysis will also be affected by the
adjusted factors of the method itself. Before the QTL mapping analysis, the CIM
method needs to set the parameters such as “window size” and “control marker
numbers”. In this simulation study, we simply use the default parameters and that is
10 cM for the “window size” and 5 for the “control marker numbers”. However,
sometimes the change of these parameters in CIM method has a great influence on the
power of QTL detection and the probability of false QTL detection as showed in
Table 3-11. On the other hand, because the MCIM method treats the background
control markers as random effects, the influence of the control markers is much less
than that of CIM method.
−51−
Table 3-11. Power of QTL detection and the probability of false QTL detected under different number of background control markers in model-II (threshold value with LOD = 2.5).
The summary of the position estimation and the 95% experimental confidence
interval (ECI) for detected QTLs was presented in Table 3-12 for Model-I and
Model-II with threshold setting to LOD = 2.5. For the two QTLs with large effects,
the estimation of position is quite accurate with small ECI for all three mapping
methods. The average range of ECI is 14cM, 8.3cM, and 9.5cM for IM, CIM, and
MCIM methods. Unlike the CIM and MCIM methods, the average range of ECI
increases largely (11cM to 17cM) from Model-I to Model-II for the IM method. As
the QTL has median effect, the estimation of the QTL position becomes less accurate
and the ECI becomes larger. For example, the average range of ECI is almost doubled
for the median effect QTLs in Model-I by using CIM and MCIM methods (15cM for
CIM and 20cM for MCIM). For the two small effect QTLs, it is difficult to obtain a
good estimation for the QTL position and a reasonable ECI because this kind of QTL
can only be detected very few times in 500 replications due to the extreme low power
of QTL detection.
For the single QTL Model-I, the estimated effects of detected QTLs for the two
large QTLs (Q1-1L and Q4-1L) are unbiased as showed in Table 3-13. However, the
estimation of QTL effects tends to be overestimated for the QTLs with median and
small effects. The reason is that the detection power for this kind of QTL is much less
than 100%. That is, we only pick the large LR peak (greater than the predefined
threshold value) as the identified QTL for each replication. It is obvious that the large
−52−
LR peak tends to have the large estimation of QTL effect as compared to the small LR
peak. Therefore, in the real QTL mapping situation, if you identified a QTL with
median or small effect, it is likely to have slightly overestimated effect. The
overestimation in QTL effect could be larger for two linked QTLs as Q2-1L and
Q2-2L at Model-II. This may imply that the QTL linkage will affect the estimation of
QTL effects. To compare the three QTL mapping methods, CIM method performs
well for the estimation of QTL effects and the ECI for QTLs with median effects,
partially due to the high power of QTL detection for these kinds of QTL.
Table 3-12. The simulation results of the position for the detected QTLs under the Model-I and Model-II when the threshold value is setting to LOD = 2.5.
QTL 1Pos IM CIM MCIM Genome 2Est 3ECI Est ECI Est ECI
For these three QTL mapping methods (IM, CIM, and MCIM), the average
mapping results of the two QTL models were showed in Figure 3-1. For Model-I, all
three mapping methods performed quite well because of the unbiased estimation of
QTL positions and effects as well as the LR values depended on the QTL effects. For
Model-II, the two QTLs with small effects (Q3-1S and Q3-2S) are undetectable by all
the three methods of QTL mapping. These three methods have very larger power to
detect QTLs with large effects (Q2-1L and Q2-2L). However, comparing to CIM and
MCIM methods, IM method has more noise between these two QTLs with large
effects and this kind of noise could be harmful when these LR peaks were considered
as QTLs. For the three QTLs with median effects on chromosome 1, the highest LR
value is obtained by CIM method for Q1-1M and Q1-3M, but by MCIM method for
Q1-2M. IM method has very low LR value for Q1-3M with the possible reason of no
−54−
Figuprofshorhorishow
Q
z
Q1-1L Q2-1M Q3-1S Q4-1L Q5-1M Q7-1M Q9-1S
1-1M Q1-2M Q1-3M Q2-1L Q2-2L Q3-1S Q3-2S
re 3-1. The simulation 500 average QTL mapping LR profiles and additive effect iles for the two QTL setting models. The long vertical bars are chromosomes and the t vertical bars are QTL positions and effects. The small dots distributed along the ontal bars are genetic markers. Only the chromosomes with QTL (1, 2, 3) have been ed for Model-II.
−55−
detection power for opposite effects of linked QTLs. For Q1-2M, CIM method has
very low LR value because of the closeness of the first two QTLs.
3-4 Consider the Complicated QTL Mapping Situations
In Section 3-3, the performances of IM, CIM, and MCIM QTL mapping methods
under the simple additive situation have been studied. However, for many real QTL
mapping experiments, more complicated situations such as QTL by environment
interaction and QTL epistasis are existed generally. In this section, we will conduct
the simulation studies for IM and CIM methods under these complicated QTL
mapping situations. The performance of MCIM method will also be studied when the
relative mixed linear models have been used for QTL by environment model (Model
2-13) and QTL epistatic model (Model 2-14).
1. Parameters Setting
For studying the QTL by environment interaction, the Model-AE for the
simulation study is based on following parameters setting: total replications for the
simulation is 300, using DH population with 100 individuals, 3 environments each has
2 repeats, population mean of 15.6, Haldane mapping function, whole genome has 3
chromosomes each with 11 markers, and average marker distance of 10 cM with
positions having certain deviation. Each chromosome has one QTL and the
heritability is 0.36. The QTL positions, the QTL main effects and QE interaction
effects are showed in Table 3-14. Notice that among these three QTLs, Q1-1 has large
QE interaction effect but no main effect. Q2-1 and Q3-1 have the same QTL main
effects. Q2-1 has QE interaction effect but Q3-1 has no QE interaction effect.
Table 3-14. QTL parameters setting of Model-AE for simulation study of QTL by environment interaction.
1Use data from all environments together. 2E1, E2, and E3 represent only using the data from environment 1, environment 2, and environment 3 respectively.
For using data of various environments, the power of QTL detection and the
probability of false QTL detected are showed in Table 3-17. For the first QTL (Q1-1),
there is no QTL detect power when using data from all environments together because
its main QTL effect is 0. The power is quite low when using the data of environment
1 and environment 2 due to the small effects of QE1 and QE2. However, the power of
QTL detection is quite high in environment 3 because the QTL effect of QE3 is
relatively high (0.72). Q2-1 has both main effect and QE interaction effects and the
power of QTL detection is quite high over main, QE1, and QE2. It is interesting that
the power of QTL detection over environment 3 is almost 0. The reason is not because
that QTL by environment 3 has no effect but that the effects of main effect and QE3
effect are cancelled out. For the last QTL (Q3-1), the difference for power of QTL
detection between using date from all environments together and using date from only
one environment is caused by the change of sample size and not by the QE interaction
effects.
In case of QTL detecting power, the overall performance between IM method and
CIM method is quite similar in this simulation case. There are two possible reasons.
First, the performance of IM method is quite good under the one-QTL model, and
second, small sample size (only 100 individuals and the 2 repeats seems no much help)
do more harm on CIM method than IM method. However, to consider the probability
of false QTL detected, the performance of CIM is still much better than IM method.
−58−
Table 3-17. The power of QTL detection and the probability of false QTL detected for IM and CIM methods under various environments (Threshold LOD = 2.5).
E1-3 E1 E2 E3 QTLs IM CIM IM CIM IM CIM IM CIM Q1-1 1.95 0.98 14.33 21.00 15.67 23.67 91.67 96.67 Q2-1 98.05 100 94.33 97.33 92.67 86.67 0.67 0.00 Q3-1 97.07 98.05 51.67 51.67 45.33 50.33 61.33 73.00
1QTL positions in cM. 2Average QTL additive effects for the detected QTLs.
3. Using MCIM Method
- QTL by Environment Interaction
By using the mixed linear model (2-13) approach, MCIM method has the ability
to analyse the QTL mapping data for all environments together. As showed in Table
3-23, the simulation result indicated that the estimation of QTL main effect and
prediction of QTL by environment interactions are unbiased. On the other hand, it is
difficult to get the unbiased estimation or prediction for QE interaction effects by
using IM or CIM method (Table 3-16).
Table 3-23. Estimation of main effect and QE interaction effects on the QTL positions for Model-AE when using MCIM method with mixed linear model approach.
1The 95% ECI of the QTL positions. 2The QTL main effects. 3The QTL by environment 1 interaction effects.
- QTL Epistasis
Table 3-25. The estimation of additive and epistatic effects on the QTL positions for Model-AA by using MCIM method when the mixed linear model is used (300 replications).
1Parameters setting for QTL effects. 2Estimation of the QTL effects.
According to the simulation study (3-4-2), the epistatic effects will hurt the
efficiency and the results of QTL mapping when IM or CIM method has been used. In
addition, there is no ways to estimate the QTL epistatic effects by using these two
methods. On the other hand, by using mixed linear model approach (2-14), MCIM
method can be used for analysing the QTL additive effects as well as QTL epistatic
effects at the same time by fitting two intervals into the model. QTLs with additive
effects and (or) epistatic effects can be located through a two-dimensional search
procedure (Wang 1999 at al).
Table 3-25 shows the estimation of the QTL additive effects and QTL epistatic
effects for Model-AA by using MCIM method. The simulation result indicated that the
estimation for the most additive and epistatic effects is unbiased. The linkage between
Q1-1 and Q1-2 may cause the overestimation of the additive effects for these two
QTLs (Q1-1 and Q1-2).
−63−
Table 3-26. The power of QTL detection and the estimation of additive and epistatic effects on the QTL positions for Model-AA and Model-A by using MCIM method when the mixed linear model is used (300 replications and threshold is LOD = 2.5).
Table 3-26 shows the power of QTL detection and the estimation of QTL effects
on the known QTL positions for Model-AA and Model-A when MCIM method has
been used. Q2-2 and Q3-1 has obtained a big improvement in QTL detection power
for Model-AA in contrast with Model-A and it is also true for Q1-3. This result
implied that the QTL epistatic effects could improve the QTL detection power when
using MCIM method. The estimation of QTL epistatic effects for the simple additive
Model-A is almost 0. It proved that there is no harm for mapping the simple additive
model with the extended epistatic model of the MCIM method.
Table 3-27. The power of QTL detection, probability of false QTL detected, and the estimation of QTL positions for the detected QTLs for Model-AA and Model-A by using MCIM method with simple additive model (300 replications and threshold is LOD = 2.5).
Power QTL position (ECI) QTLs Model-AA Model-A Model-AA Model-A
simultaneously to construct multiple putative QTLs in the model for QTL mapping. It
is a multiple QTL oriented method combining QTL mapping analysis with the
analysis of genetic architecture of quantitative traits. Through a search algorithm, the
method can obtain the detail information about the QTLs simultaneously such as
number, positions, effects and interaction of the significant QTLs.
The search strategy of MIM method is to select the best (or better) genetic model
in the parameter space. In other words, it is a model selection problem. Therefore,
model selection is the key component of the analysis and the basis of the genetic
parameter estimation and data interpretation in any QTL mapping methods by using
multiple intervals. The analysis of model selection in a high and unknown dimension
is very complicated. The appropriate criteria or stopping rules used for model
selection are greatly important but very difficult to decide.
In this research, we will study the properties of the criteria for model selection in
the QTL mapping framework. Here only the idea case is considered. First the cross
design is backcross using pure-breeding parental lines for its simplicity. However, the
result can be extended to other experimental design with only two different marker
genotypes such as DH and RIL population. For more complicated population such as
F2, the basic principles will be hold. Secondly, assume all the effects of the QTLs are
the same and all markers are equally spaced for the sake of standardizing the criteria.
Finally, All QTLs are exactly position on the markers.
As an example, the parameters setting for the starting model (Model-S) is: sample
size is 300, whole genome has 3 chromosomes, 14 evenly distributed markers for each
chromosome and the marker distance is 8 cM, setting totally 8 QTLs with same effect
and 4 QTLs on chromosome 1, 1 QTL on chromosome 2, and 3 QTLs for
−65−
chromosome 3. When the heritability set to 0.8, 0.5, and 0.2, the QTL effects will be
1.014, 0.507, and 0.169 respectively.
Under this situation, we can simply do model selection on the markers by least
square and multiple regression means. The results obtained under this assumption are
still useful because first, if the marker density is high, the distance between the QTL
and the nearest marker is very small and can be ignored. Second, for the loose marker
situation, MIM can use maximum likelihood method (Kao and Zeng 1997) to estimate
the QTL position according to the information of marker genotypes and positions.
However, in case of model selection, the principle should be same. That means our
result is still useful for MIM model selection practice.
4-2 Model Evaluation Standard
Consider a multiple regression model for a Backcross population as
i
M
jijji Xy εβµ ++= ∑
=1
(4-1)
where is the trait value of individual i. is the mean of the model and M is the
number of marker fitting in the model. is partial regression coefficient (marker
effect) for maker j and is the marker indicate variable for individual i and maker
j. For the backcross population, has two possible values, for example, 1 for MM
and 0 for Mm marker genotype. is a random residual variable assumed normal
distribution with mean 0 and variance .
iy µ
jβ
2
ijX
ijX
iε
σ
The goal of model selection is to find a better (not necessary the best) model with
M markers through a search procedure. Hopefully, these markers are QTLs in our idea
situation. By doing this, there are two possible errors we will make. The first type of
error (called ) is that some selected markers in the final model are not QTLs. This
kind of error is related to the Type I error in some sense. The second type of error
(called ) is that some QTLs (markers) are not included in the final model and this
α
β
−66−
kind of error is related to the detection power (1- ) of QTL. It is very important to
balance these two types of error on the model selection practice.
β
cn
cn
,2,1
3010
As a model selection standard, we have defined three parameters
for measuring the degree of fitness between the selected model and the real model.
Assume the real QTL number is N and the identified QTL number is n, the real QTL
position (cM) is P and the identified QTL position is p and the positions are measured
from beginning of the chromosome.
χβα and,,
( ) cc
cc NifCcNnN
≥=−= ∑ ,,2,11Lα (4-2)
( )∑ ≤=−=c
ccc NifCcnNN
,,2,11Lβ (4-3)
),min(,,,2,1 cccc c
tcc
NnRtandCcR
pP===
−= ∑
∑LLχ (4-4)
where C is the chromosome number, is the percentage of wrong identified QTL
and is the percentage of missed QTL, is the average distance between the
identified QTLs and the real QTLs.
α
β χ
For each chromosome, if the real QTL number is not equal to the identified QTL
number, there will be many ways to associate the real and identified QTL positions.
Here the used criterion is to minimize the total distance for each chromosome.
4-3 Model Selection Strategy and Criteria
One of the difficult parts for model selection is that there are too many potential
models to be considered. In our situation, if the total marker number is M, there will
be about possible models exist. For example, if the total marker number is 100
for the whole genome, it will be more than1 possible models exist and it is
infeasible to test all the models for obtaining the best model.
M2
2. ×
We can divide all possible models into two major groups - models with the same
number of regressors and models with different number of regressors. If the whole
−67−
genome has M markers, all possible models can be divided into M+1 classes and each
class contains the same number of regressors (from the model with M regressors to
the one with no regressor, only the model mean). The criterion for selecting the best
model among the models with same number of regressors is relatively simple. The
best model is the model with the largest coefficient of determination ( ) or the
smallest residual sum of squares (RSS). From formula (4-5) and (4-6), it is easy to see
that can be considered as a measurement for goodness of fit about the model and
maximizing the value of is equivalent to minimizing the value of RSS.
2R
2R2R
( )( )∑
−
−= 2
22
ˆ
YYYY
Ri
i (4-5)
(∑ −−=22 ˆ)1( YYRRSS i ) (4-6)
To compare models with different regressors is the most difficult task for model
selection. The reason is that as the number of regressors increased, the value of
never decreased and the RSS value is always decreased. Therefore, it is impossible to
decide which model is better by simply comparing the value of or RSS. One
must make a decision about what increase in is required before accepting an
additional regressor or what decrease in is accepted before dropping a regressor
from the backward way of thinking.
2R
2R2R
2R
In summary, there are two kinds of criteria we should deal with in the model
selection practice. The first case is that to find the best model inside classes with the
same number of regressors. In this case, criterion itself is simple (use or RSS) but
the difficulty will be too many models to be considered. One way to solve this
problem is to use certain procedure to search through a limited space to find a better
(there no way to guarantee the best) model. These search methods include forward,
backward, stepwise, and branch-and-bound etc. The second case is to find the best
model amount the models with different regressors by using certain criteria. Our
research will deal with the first problem but the focus is on the second one, the criteria
or stopping rule for model selection.
2R
−68−
4-4 Procedure of Model Selection
The first step of model selection is to find the best (at least the better) model for
each class with same number of regressors without doing exhaustive search. For the
situation of M markers, we can find the M+1 models as
Mpp ...,,1,0=η
where p is the number of regressors in the models.
Forward stepwise selection (FW) or backward elimination (BW) method can be
used for this purpose. FW method chooses the subset models by adding one regressor
at a time to the previously chosen model. It starts by choosing the one-regressor
model by selecting the regressor with the largest sum of square (SS) contributed to the
model. At each successive step, the regressor not already in the model that causes the
largest decrease in the RSS (has largest partial sum of square value) is added to the
model. This procedure can go on until all regressors are in the model. BW method
chooses the subset models by starting with the full model and then eliminating, at
each step, one regressor whose deletion will cause the RSS to increase the least. This
will be the regressor in the current model that has the smallest partial sum of square.
This procedure can also go on until the model contains none regressor (only the
model’s mean is left).
Comparing to the exhaustive search, the FW or BW method saves great amount of
the computation time. The cost is obvious as that once a regressor is included, it will
be always stay in the further models for FW method and once a regressor is excluded,
it will be no chance to get in again in the further models for BW method. Therefore, it
is no guarantee that the models selected by FW or BW method is the best model in
each class with the same number of regressors. However, in our situation, due to the
linear structure of marker positions and QTL locations, it is expected that the model
selected by FW or BW method is the best model in each class. Zeng (1993) proved an
important property about the partial regression coefficient for multiple regression
analysis. It is that the partial regression coefficient is expected to depend only on
−69−
those QTLs that are located on the interval bracketed by the two neighboring markers
if there is no crossing-over interference and no epistasis, as showed in formula (4-7).
ixgx xgx
kxx
xgk
xx
gxi
iki iki ii
ik
ii
ki
rr
rr
b δδδ =+≈ ∑ ∑<< <<− + +
+
−
−
1 1 1
1
1
1 (4-7)
In our idea situation (Model-S), the partial regression coefficient is expected to
equal to the QTL effect for markers with QTL and 0 for makers without QTL. Even
when the QTLs are not just on the markers, the partial regression coefficient of
markers near the QTLs will had large values comparing to the markers far away from
the QTLs by expectation. Because that the partial regression coefficient is directly
related to the partial sum of square. It is easy to notice that by expectation, the
markers with QTL will be selected first in FW method and the markers without QTL
will be eliminated first in BW method.
Table 4-1 shows the simulation result of the average R2 value for the true model
and the selected model by using BW method when Model-S is used. For each
replicated sample, the R2 of the true model is calculated by fitting the true parameters
(real QTL number and positions) into the multiple regression models. The R2 of the
selected model is calculated by selecting the model with 8 regressors from the full
model by using BW method. Therefore, the only difference between these two models
will be the marker (QTL) positions. It is obvious that the BW method doing very well
and the average value of R2 is even larger than the one of true model in some cases.
Due to the sample variance, the true model is not necessary the model with maximum
R2, but it is usually a good one (has small standard deviation).
Table 4-1. Comparing the coefficient of determination (R2) between the model selected by backward procedure and the true model for the Model-S. The sample size is 300 and the replication is 1000 times.
Backward Selected Models True Models Heritability 1Low 2.5% 2Mean 3Up 2.5% Low 2.5% Mean Up 2.5%
Figure 4-2. The value of M under various parameters setting for α = 0.05. Fromtop to bottom, marker density, samples, chromosomes and the solid line isModel-S.