1
Genes and MS in Tasmania, cont.
Lecture 5, Statistics 246February 3, 2004
Mapping genes contributing to complex diseases
3
MS susceptibility genes are difficult to map
MS is a complex disease. Analyses with traditional methods such as single marker association studies and standard linkage approaches (affected sib-pairs, pedigrees etc) have failed to agree on genomic regions other than the HLA region.
There are a variety of possible reasons for this: • Allelic and locus heterogeneity
(no single gene model fits all)• Significant environmental influences• Imprecise phenotyping
4
Linkage vs Association studies
• Linkage mapping: tests for cosegregation of a marker allele with the disease within families
• Association mapping: seeks a marker allele that is present more frequently in cases than in controls; all affected individuals are treated as distant relatives– Case/control studies– Transmission disequilibrium test (needs triads)
We will do a quick review of association mapping before turning to our MS study.
5
Linkage disequilibrium
Suppose that we have a marker with just two alleles, M and m say, having frequencies p and 1-p, and a (not necessarily linked) disease locus with alleles D and d, having frequencies q and 1-q. A (haploid) gamete must have one of the four combinations (haplotypes) DM, Dm, dM or dm. Let the frequencies in a population of these four haplotypes be x1, x2, x3 and x4 .
Under independence, we would have x1 = pq, etc. Deviations of the observed haplotype frequencies from these products is
termed linkage disequilibrium (LD), or, better, gametic association.
If inheriting the allele D at the disease locus increases the chance of getting the disease, and the disease and marker loci are in LD, then the frequencies of the marker alleles M and m will differ between diseased and non-diseased individuals. This observation is the basis of association studies.
6
Case-control studies in genetic epidemiology
Case-control studies compare case and control allele frequencies at markers or candidate genes (the “exposure” variables). All the standard potential drawbacks of such studies apply, with the similarity of the two base populations being the most critical here. It is thought to be relatively easy for samples from racially mixed populations to differ in allele frequencies, and hard to deal with this in the genetic context. Key term: population structure.
If our cases are MS patients, who are our controls? It would be rare for a study to be able to afford or get ethics approval to carry out random sampling of the relevant background population. More commonly, controls are people such as blood donors, whose blood (DNA) has been collected for other purposes. How close will they be to a random sample from the case population?
In an effort to deal with this, the TDT which follows in effect uses untransmitted genotypes as controls, bypassing any population structure.
7
The transmission-disequilibrium test
The TDT, as it is called, in its simplest form, starts with parents and an affected child, and considers a biallelic marker locus at which all three are typed, and we can determine which maternal and paternal alleles were transmitted, and which were not.
For example, if the parents were a1/a2 and a1/a1 , and the affected offspring was a1/a2 , then a2 was transmitted and a1 was not transmitted by the first parent.
From a random sample of such trios (called triads), a 22 table
can be built up giving the number of times a1 and a2 were transmitted and were not transmitted, respectively, and a simple test can be derived. Many generalizations of this procedure now exist, see notes for Stat 260, 1998 Week 5.
8
2 1361
9 15174
1 962
9 172 12
12714671
18 181 410 10
Genotypes Haplotypes
13115492171276118410
26917
16921214718110
Haplotype
Re-construction
• A collection of alleles derived from the same chromosome
What is a haplotype?
Chromosome phase is knownChromosome phase is unknown
9
Haplotype mapping
If alleles at a disease locus are associated with alleles at one nearby marker locus on gametes, they are likely to be associated with alleles at other nearby marker loci, and hence with marker haplotypes.
A potentially more powerful way to locate disease genes is to search for associations between marker haplotypes and disease.
There are two possible problems here stemming from the fact that there can be a very large number of marker haplotypes: we may have to deal with very small frequencies, and we have a multiple testing problem.
10
Searching for common or rare haplotypes in cases alone is one form of association mapping. It has been successful, as very substantial LD can arise around disease loci. In general controls are necessary as the background LD can be large.
That is, there can be substantial LD between putative disease gene alleles and alleles of nearby markers, without there being any causal link between the gene and the disease. We call this background LD.
Background LD can be large – when the population is young– when the # of founders is small (bottlenecks)– through admixture of populations
LD, haplotype mapping and background LD
11
Exercises on LD
1. Under a random mating assumption, the long term values of the frequencies x1, x2, x3 and x4 on page 5 above are pq, (1-p)q, p(1-q) and (1-p)(1-q). (Week 5, Stat 260, 1998).
2. Demonstrate that a mixture (e.g. 50:50) of two populations initially in linkage equilibrium at two loci, will typically not be in LE.
3. Explain why a single mutant arising by chance, will initially be in strong LD with alleles at loci near the locus on which it arises.
Mapping MS genes in Tasmania
13
Area: 67,800 km2
Population: 470,000
Tasmania
Capital city: Hobart (~200,000)
Tasmanian Population Growth
1 : First settled by Europeans (1803)
2 : 24,000 free settlers19,000 convicts (1836)
4 : End of convicttransportation (1853)
3 : Civil registration of births and marriages (1838)
5 : “The Gold Rush” (1860’s)
1 2345
15
Mapping with haplotype sharing
Time1800-1850’s
6-8 generations2000
Premise: Tasmanians share large(ish) segments of haplotypes because they are distantly relatedSimilarly our MS patients should share these large(ish) segments but even more so (in size and in number) in regions around MS susceptibility genes
16
Haplotypes are “eroded” by recombinationAncestral
chromosome
Time/generations/meioses
MS MS MS MSMS MS 25 cM (SD=18)
Recombination events can help to map genes with precision, but erode haplotypes making them more difficult to detect
17
What might have happened in the population?
• A mutation arises in, or is introduced to, a population leading to disease (say MS) in those individuals
• The mutation arises on the background of a unique haplotype
• As this mutation spreads through the population (by chance, or inbreeding) so do remnants of this original haplotype by hitchhiking (linkage disequilibrium)
timeMS
Ancestralsusceptibility
haplotype
MS
MS
MS
MS
Design of the Tasmanian MS study
19
What strategy could be used to map MS susceptibility genes in Tasmania?
• Too few affected sib pairs/multiplex families for a conventional linkage approach• Prefer a model free (non-parametric) approach A haplotype-based case-control study design seemed appropriate
20
MS study in Tasmania: design
• Collect as many MS cases with ancestral links to Tasmania as possible, and a suitable (not necessarily equal) number of similar, socioeconomically and geographically matched unrelated controls
• Around each case and each control, collect a constellation of ~ 4 close relatives for (probabilistic) haplotype reconstruction
• Infer genome-wide haplotypes for all cases and controls
• Carry out a case/control study with the haplotypes, seeking regions of the genome shared more by the cases, in comparison with the controls
21
Analysis options
Transmitted
Case Haplotypes
Untransmitted Case
Haplotypes
Transmitted Control
Haplotypes
Untransmitted Control
Haplotypes
Green: hope to find signal Red: hope to find nothing
First mathematical questions
• Resolution of genome-wide scan (length of likely shared chromosomal segments)
• Nature and number of relatives needed to permit the reconstruction of accurate haplotypes with high probability
Average length of shared chromosomal segments
Exercise. Assume the Poisson model for crossovers along a chromosome. What is the mean and variance of the length in cM of the chromosomal segments shared by individuals with a common ancestor 7 generations back?
Nature and numbr of relatives needed to give accurate haplotypes
Exercise. Explain why it is that when we have both sets of parental genotypes, and the markers are reasonably polymorphic, we can reconstruct an individual’s haplotypes with high probability. What are the difficult cases?
If we have no parents, or just one parent, and grandparents’, siblings’ or offsprings’ genotypes are available, which are most informative for an individual’s haplotype reconstruction?
25
Reconstructing haplotypes from genotypes
• Observe genotyping data for an individual
At marker 1 : (1,3)
At marker 2 : (b,d)
• Reconstruct the haplotype by inferring recombination events from genotypes of relatives
At marker 1 : Mum (1,2) Dad (3,4)
At marker 2 : Mum (a,b) Dad (c,d)
1
b
3
d
Marker 1
Marker 2
26
Genotyping
Use STR (short tandem repeat)
also known as microsatellite markers
…AGCTAGCGCGC….GCGCGGCATTA…
…AGCTAGCGCGC….GCGCGGCGCATTA…
Eventual plan: 5 cM genome wide scan (~ 800 markers) with dinucleotide STRs