ASSOCIATION MAPPING
INTRODUCTION TO GENE MAPPING: Association Mapping
Seeks to identify specific functional variants (i.e. loci,
alleles) linked to phenotypic difference in a trait, to facilitate
detection of trait-causing DNA sequence polymorphisms and/or
selection of genotypes that closely resemble in phenotype Also
known as Linkage Disequilibrium (LD) mapping or Association
Genetics, is a population based survey used to identify
trait-marker relationships based on LD.
Leilani NoraAssistant Scientist
ASSOCIATION VS. QTL MAPPINGAttribute Detection goal QTL mapping
Quantitative trait locus, wide region within specific pedigrees Low
moderate density linkage maps only required Association Genetics
Quantitative Trait Nucleotide, physically close as possible to
causative sequences High disequilibrium within small physical
regions requiring many markers. LD experiments; unrelated
individuals Unstructured populations, large numbers of small
unrelated families.
ASSOCIATION VS. QTL MAPPINGAttribute Marker discovery costs
Extent of inference QTL mapping Moderate Association Genetics
Moderate for few traits, high for many traits
Resolution of causative trait polymorphism Experimental
populations
Pedigree specific, except Species or sub species wide where
species has high extant LD 102- low 103 105 for small genomes ~109
for large genomes
Defined pedigrees, eg. Backcross, F2, RI, three and two
generation pedigrees
Number of markers required for genome coverage
WHY ASSOCIATION GENETICS? The higher resolution afforded by use
of unstructured populations allows the intriguing possibility of
identifying the genes or even specific nucleotides underpinning
trait variation. The opportunity to use molecular markers to
enhance rates of genetic gain, including utilization of specific
genes from non-elite germplasm in a more directed and efficient
manner.
HARDY WEINBERG EQUILIBRIUM
HARDY WEINBERG EQUILIBRIUM (HWE) The Hardy-Weinberg model,
describes and predicts genotype and allele frequencies in a
nonevolving population. Is an expression of the notion of a
population in genetic equilibrium and is a basic principle of
population genetics.
HWE ASSUMTIONSAssumptionRandom Mating No migration No Mutation
No selection
ExceptionInbreeding /Outbreeding Migration Source of all
variation Directional / Disruptive / Stabilizing Selection Genetic
Drift
EffectDecrease or increase in heterozygosity Homogenize
different populations Increase heterozygosity Reduce variation
Increase variation Increase variation Reduce Variation
Infinite population
HARDY WEINBERG EQUILIBRIUM1st Generation : Genotype and Allele
Frequencies Consider a locus with two alleles: A and a We can use
the Punnetts square to produce all possible combinations of these
gametes (Table1) Assume in the first generation the alleles are not
in HWE and the genotype frequency is shown in Table 2.
HARDY WEINBERG EQUILIBRIUM1st Generation : Genotype and Allele
Frequencies Allele frequencies for population not in HWE
1 P(AA) = p2 + pq 2Next Generation
1 P(aa) = q 2 + pq 2
Table 1Male Female A a A AA Aa a Aa aa
Table 2Genotype AA Aa aawhere : p2+ 2pq + q2 =1
Freq p2 pq q2
When a population is in HWE, the next generation will result in
the same genotype frequency as well as the same allele frequency.
Allele frequencies for a population in HWE: P(AA) = P(A) P(A) =p2
P(Aa) = 2P(A) P(a) = 2pq P(aa) = P(a)P(a) = q2
HARDY WEINBERG EQUILIBRIUM When a population is in HWE, the next
generation will result in the same genotype frequency as well as
the same allele frequency. Allele frequencies for a population in
HWE: P(AA) = P(A) P(A) =p2 P(Aa) = 2P(A) P(a) = 2pq P(aa) =
P(a)P(a) = q2 For example, consider a diallelic locus with alleles
A and a with frequencies 0.85 and 0.15, respectively. If the locus
is in HWE, calculate the allele frequencies.
HARDY WEINBERG EQUILIBRIUMViolation in HWE Assumption When a
locus is not in HWE, then this suggests one or more of the
Hardy-Weinberg assumptions is false. Departure from HWE has been
used to infer the existence of natural selection, argue for
existence of assortive (non-random) mating, and infer genotyping
errors. It is therefore of interest to test whether a population is
in HWE at a locus. Two most popular ways of testing HWE -
Chi-Square test - Exact test
CHI-SQUARE GOODNESS OF FIT Compares observed genotype counts
with the values expected under Hardy-Weinberg For a locus with two
alleles, we might construct a table as follows:Genotype AA Aa aa
Observed Expected Under HWE nAA nAa naa np2 2npq nq2
CHI-SQUARE GOODNESS OF FIT Test Statistic for Allelic
Association is:
2 =
Genotypes
(Observed count - Expected count )2Expected Count
under Ho (2 df ) 1
where: n is the number of individuals in the sample p is the
probability that a random allele in a population is of type A q -
is the probability that a random allele in a population is of type
a
DATAFRAME: ge03d1p1.csv Dataframe with 250 observations and 7
variables.
DATAFRAME: ge03d1p1.csvRead data file ge03d1p1.csv > assoc1
library(genetics) > summary(Snp4)Number of samples typed: 243
(97.2%) Allele Frequency: (2 alleles) Count Proportion A 323 0.66 B
163 0.34 NA 14 NA Heterozygosity (Hu) Poly. Inf. Content =
0.4467269 = 0.3464355 Genotype Frequency: Count Proportion A/A 109
0.45 A/B 105 0.43 B/B 29 0.12 NA 7 NA
> table(Snp4) A/A 109 A/B 105 B/B 29
PACKAGE genetics : HW.chisq() Test the null hypothesis (Ho) that
Hardy-Weinberg equilibrium holds using chi-square method >
HWE.chisq(x, ) # x genotype or haplotype object Illustration >
HWE.chisq(Snp4)Pearson's Chi-squared test with simulated p-value
(based on 10000 replicates) data: tab X-squared = 0.2298, df = NA,
p-value = 0.6657
PACKAGE genetics : HW.exact() Exact test of HWE for 2 Allele
Markers > HWE.exact(x, ) # x genotype or haplotype object
Illustration > HWE.exact(Snp4)Exact Test for Hardy-Weinberg
Equilibrium data: snp4 N11 = 109, N12 = 105, N22 = 29, N1 = 323, N2
= 163, p-value = 0.666
LINKAGE DISEQUILIBRIUM Also known as gametic phase
disequilibrium, gametic disequilibrium and allelic association. Non
random association of alleles at different loci
LINKAGE DISEQUILIBRIUM
It is the correlation between polymorphisms (SNPs) that is
caused by their shared history of mutation and recombination. LD
and Linkage are related but they are distinctly different.
LINKAGE DISEQUILIBRIUM Two loci, A and B are said to be in
linkage (or gametic) disequilibrium if their respective alleles do
not associate independently in the studied population. Occurs when
genotypes at the two loci are not independent of another. If all
polymorphism were independent at the population level, association
studies would have to examine every one of them Linkage
disequilibrium makes tightly linked variants strongly correlated
producing cost savings for association studies.
MEASURE OF LINKAGE DISEQUILIBRIUM Consider two loci (A and B),
each segregating for two alleles (A, a, B, b) There are four
possible gametes (or haplotypes) present in the populations:
Locus A Locus B B b Total A XAB XAb pA a XaB Xab qa Totals pB qb
1.0
Gamete : AB, Ab, aB, ab Frequency : XAB, XAb, XaB, Xab Allele
frequency can be expressed as gamete frequencies : pA, pa, pB,
pb
MEASURE OF LINKAGE DISEQUILIBRIUM If the alleles at the two loci
are randomly associated with one another, then the frequencies of
the four gametes are equal to the product of the frequencies of
alleles.
COEFFICIENT OF LD If alleles at the two loci are not randomly
associated then there will be a deviation (D) in the expected
frequencies
Locus A Locus A Locus B B b Total A pAB = pA pB pAqb= pA (1-pB)
pA a qa pB = (1-pA) pB qaqb = (1-pA) (1-pB) qa Totals pB qb 1.0
This parameter D is the Coefficient of Linkage Disequilibrium first
proposed by Lewontin and Kojima (1960) . The most common expression
of D is: Dij = pij pipj or DAB = pAB pApB
Locus B B b Total
A pAB = pApB+DAB pAb= pApb-DAB pA
a pa B =papB-DAB pab =papb+DAB qa
Totals pB qb 1.0
In this situation there is no linkage disequilibrium and gamete
frequencies can be accurately followed using allele
frequencies.
MEASURE OF LINKAGE DISEQUILIBRIUMNormalized Measure Of Lewontin,
D D D' = DmaxWhere:
TEST OF LDChi-square Test of Linkage Disequilibrium (D)
2 =
2nD 2 ~ (1) p A (1 p A ) pB (1 pB )
Dmax = min[pApB, qaqb], min[pAqb, qapB],
if DAB < 0 if DAB > 0
Compared with the threshold value obtained from the chisquare
table with 1 df at certain level of significance. n is the number
of individuals in the population. If significant, this means that D
is significantly different from 0 and that the population under
study is in linkage disequilibrium If not significant, this means
that D is not significantly different from 0 and that the
population under study in in linkage equilibrium.
Varies between 0 and 1 and allows to assess the extent of
linkage disequilibrium relative to the maximum possible value it
can take. D will only be less than one if all four possible
haplotypes are observed.
MEASURE OF LINKAGE DISEQUILIBRIUMCorrelation between A and B
alleles, 2 or r2
ILLUSTRATION OF LDFigure 1: Completely CorrelatedA C
DAB 2 = = p A (1 p A ) pB (1 pB ) 2n2
2
G
T
If allele frequencies are equal, then r2 varies between 0 to 1 1
when the two markers provide identical information 0 when they are
in perfect equilibrium As for D, the maximum value of r2 depends on
the allele frequencies and one can determine r value in a manner
analogous to a D. Shows an example of LD where the two
polymorphisms are completely correlated with one another Two linked
mutations occur at a similar point in time and no recombination has
occurred between sites. In this case, the history of mutation and
recombination for the sites is the same.
ILLUSTRATION OF LDFigure 2: Not Completely CorrelatedA C
LD ANALYSIS IN R: LD() Computes pairwise linkage disequilibrium
between genetic markers Usage
T G
> LD(g1, g2, ) # g1 genotype object or dataframe containing
genotype objects # g2 genotype object (ignored if g1 is a
dataframe)
Polymorphisms are not completely correlated, but there is no
evidence of recombination. This type of LD structure develop when
mutations occur on different allelic lineages. This is the
situation in which r2 and D act differently, with D still equal to
1, but where r2 can be much smaller .
Sample: LD()> > > > > > library (genetics)
Snp4 |z|) (Intercept) -1.3749 0.2386 -5.761 8.35e-09 *** snp4A/B
0.1585 0.3331 0.476 0.634 snp4B/B 0.2297 0.4952 0.464 0.643
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion
parameter for binomial family taken to be 1) Null deviance: 254.91
on 242 degrees of freedom Residual deviance: 254.58 on 240 degrees
of freedom (7 observations deleted due to missingness)AIC: 260.58
Number of Fisher Scoring iterations: 4
> modelQT2 summary(modelQT2)Deviance Residuals: Min 1Q Median
-0.7433 -0.7204 -0.6715 3Q -0.6715 Max 1.7890
DETECTING POPULATION STRUCTURENon-Parametric method
DETECTING POPULATION STRUCTURE
Cluster Analysis Multi-Dimensional Scaling Principal Component
Analysis
Parametric method STRUCTURE
CLUSTER ANALYSIS Exploratory technique which may be used to
search for category structure based on natural groupings in the
data, or reduce a very large body of data to a relatively compact
description No assumptions are made concerning the number of groups
or the group structure. Grouping is done on the basis of
similarities or distances (dissimilarities). There are various
techniques in doing this which may give different results. Thus
researcher should consider the validity of the clusters found.
HIERARCHICAL CLUSTER ANALYSISSteps in Performing Agglomerative
Hierarchical Clustering 1. Obtain the Data Matrix 2. Standardize
the data matrix if needed be 3. Generate the resemblance or
distance matrix 4. Execute the Clustering Method
DISTANCE / DISSIMILARITY MATRIX
CLUSTERING METHODTypes of Hierarchical Agglomerative Clustering
1. Single Linkage (SLINK) 2. Complete Linkage (CLINK) 3. Average
Linkage (ALINK) 4. Wards Method - minimize the error SS 5. Centroid
Method Partitioning Method 1. K-means clustering 2. K-centroids
DATAFRAME: AMP2009.csvRead data file AMP2009.csv > AMP09
abline(v=0,lty=2) > abline(h=0,lty=2)
PRINCIPAL COMPONENT ANALYSIS Data analytic method which provides
a specific set of projections which represent a given data set in a
fewer dimensions. Use to transform correlated variables into
uncorrelated ones, in other words to sphere the data. The final
rationale for this technique is that it finds the linear
combinations of data which have relatively large (or relatively
small) variability.
PRINCIPAL COMPONENT ANALYSIS : prcomp() Performs a principal
component analysis on the given data matrix and returns the results
as an object of class prcomp
> cmsdscale(x, center=T, scale.= F, ) # x a numeric or
complex matrix (or dataframe) which provides the data for PCA #
center a logical value indicating whether the variables should be
shifted to be 0 centered. # scale. a logical value indicating
whether the variables should be scaled to have a unit variance
before the analysis.
PRINCIPAL COMPONENT ANALYSIS : predict() A generic function for
predictions from the results of various model fitting functions.
> predict(object, ) # object a model object for which prediction
is desired.
DATA FRAME: GenonumConvert SNPs data to numeric > GenoNum
str(GenoNum)int [1:207, 1:184] 3 3 3 2 3 2 2 2 2 2 ... - attr(*,
"dimnames")=List of 2 ..$ : chr [1:207] "AMP001" "AMP002" "AMP003"
"AMP004" ... ..$ : chr [1:184] "SNP1" "SNP2" "SNP3" "SNP4" ...
PRINCIPAL COMPONENT ANALYSIS : prcomp()> PCAMP scores
plot(PCAMP$"x"[,1],PCAMP$"x"[,2], xlab="PC1",ylab="PC2",
type="p")
STRUCTURE VERSION 2.3 Implements a model-based clustering method
for inferring population structure using genotype data of unlinked
markers. Bayesian Statistics based which can accommodate prior
knowledge about the population structure. Individuals in the sample
are assigned (probabilistically) to populations, or jointly to two
or more populations if their genotypes indicate that they are
admixed. To download the software use below link:
http://pritch.bsd.uchicago.edu/structure.html For more details
on how to use the software you can contact Dr. Ken McNally
SAMPLE OUTPUT USING STRUCTURE
SAMPLE OUTPUT USING STRUCTURE Figure 2. Triangle plot of the
Q-matrix. Each individual is represented by colored point. The
colors correspond to the prior population labels.
Figure 1. Bar plot of estimates of Q. Each individual is
represented by a single vertical line broken into K colored
segments, with lengths proportional to each of the K inferred
clusters. The numbers 1 to 4 correspond to the predefined
populations.
When K=3 the ancestry vectors can be plotted onto triangle, as
shown. For a given point, each of the three components is given by
the distance to one edge of the triangle. Individuals who are in
one of the corners are therefore assigned completely to one
population or another.
FEATURES OF PACKAGE GenABEL Specifically designed for GWAS
ASSOCIATION ANALYSIS Using GenABEL
Provides specific facilities for storage and manipulation of
large data Very fast tests for GWAS Specific functions to analyze
and display the results. More efficient than the package
genetics
DESCRIPTION OF gwaa.data-class In GenABEL, special data class,
gwaa.data-class is used to store GWA data. Includes the phenotypic
and genotypic data, chromosome, and location of every SNP. An
object of some class has slots which may contain actual data or
objects of other classes. At first level gwaa.data-class object has
slot phdata, which contains all the phenotypic information in a
dataframe. The other slot is gtdata which contains all GWA genetic
information in an object of class snp.data For every SNP it is
desirable to know the details of coding and strand (+ , -, top,
bot)
EXPLORING gwaa.data-class> library(GenABEL) >
data(ge03d2ex) > str(ge03d2ex)Formal class 'gwaa.data' [package
"GenABEL"] with 2 slots ..@ phdata:'data.frame': 136 obs. of 8
variables: .. ..$ id : chr [1:136] "id199" "id287" "id300"... ..
..$ sex : int [1:136] 1 0 1 0 0 1 1 0 0 1 ... ..... ..@
gtdata:Formal class 'snp.data' [package "GenABEL"] with 11 slots ..
.. ..@ nbytes : num 34 .. .. ..@ nids : int 136 .. .. ..@ nsnps :
int 4000 .. .. ..@ idnames : chr [1:136] "id199" "id287"... .. ..
..@ snpnames : chr [1:4000] "rs7435137"... .. .. ..@ chromosome:
Factor w/ 4 levels "1","2","3","X ....
STURCTURE OF gwaa.data-class
EXPLORING gwaa.data-class# Summary of Phenotype data >
summary(ge03d2ex@phdata) # No. of people in a study >
ge03d2ex@gtdata@nids # No. of SNPs > ge03d2ex@gtdata@nsnps # SNP
Names > ge03d2ex@gtdata@snpnames[1:10] # Chromosome labels >
ge03d2ex@gtdata@chromosome[1:10] # SNPs map position/location >
ge03d2ex@gtdata@map[1:10]
IMPORT DATA TO GenABEL To import data to GenABEL, need to
prepare two files - Phenotypic data - Genotypic data Description of
Phenotypic data file - First line must consists of variable name -
First column must contain the unique ID, named id. - Second column
should be named sex (0=female, 1=male) - Other columns in the file
should contain phenotypic information. - Missing values should be
coded as NA
IMPORT DATA TO GenABEL Example of few phenotypic file :
Pheno.csv
Save this file as Pheno.dat
IMPORT DATA TO GenABEL Description of Genotypic data file - For
every SNP, information on map position, chromosome, and strand
should be provided. - For every individual, every SNP genotype
should be provided. - GenABEL provided a number of function to
convert these data from different formats to the internal GenABEL
raw format. > convert.snp.illumina() > convert.snp.tped()
> convert.snp.ped() > convert.snp.txt()
IMPORT DATA TO GenABEL: snp.convert.txt() Converts genotypic
data file to raw internal data formatted file >
convert.snp.text(infile, outfile,..) # infile input data file 1st
line - contains IDs 2nd line names of all SNPs 3rd line list of
chromosomes the SNPs belongs to 4th line genomic position of the
SNPs 5th line genetic data. # outfile output data file
IMPORT DATA TO GenABEL: snp.convert.txt() Sampe genotypic data
file : Geno.csv
IMPORT DATA TO GenABEL: load.gwaa.data() Load data (genotypes
and phenotypes) from files to gwaa.data object >
load.gwaa.data(phenofile=pheno.dat, genofile=geno.raw, sort=T) #
phenofile data table with phenotypes # genofile internally
formatted genotypic data file using convert.snp.txt # sort logical
value indicating whether SNPs should be sorted in ascending order
according to chromosome and position
Save this file as Geno.dat
IMPORT DATA TO GenABEL: load.gwaa.data()>
convert.snp.text("Geno.dat","Geno.raw") > genphen
descriptive.trait(data, by.var) # data an object of snp.data-class
or gwaa.dataclass # by.var a binary trait; which will separated
analysis for each group
DESCRIPTIVE STATISTICS OF PHENOTYPE: descriptive.trait()>
descriptive.trait(ge03d2ex) No Mean SD id 136 NA NA sex 136 0.529
0.501 age 136 49.069 12.926 dm2 136 0.632 0.484 height 135 169.440
9.814 weight 135 87.397 25.510 diet 136 0.059 0.236 bmi 135 30.301
8.082
DESCRIPTIVE STATISTICS OF PHENOTYPE: descriptive.trait()>
descriptive.trait(ge03d2ex, by=ge03d2ex@phdata$dm2)No(by.var=0) id
sex age dm2 height weight diet bmi Pexact id NA sex 0.074 age NA
dm2 NA height NA weight NA diet 1.000 bmi NA Mean SD No (by.var=1)
50 NA NA 50 0.420 0.499 50 47.038 13.971 50 NA NA 49 167.671 8.586
49 76.534 17.441 50 0.060 0.240 49 27.304 6.463 Mean SD Ptt Pkw 86
NA NA NA NA 86 0.593 0.494 0.053 0.052 86 50.250 12.206 0.179 0.205
86 NA NA NA NA 86 170.448 10.362 0.097 0.141 86 93.587 27.337 0.000
0.000 86 0.058 0.235 0.965 0.965 86 32.008 8.441 0.000 0.001
DESCRIPTIVE STATISTICS OF MARKER: descriptive.marker() Generate
descriptive summary tables for genotypic data >
descriptive.marker(data, digits) # data an object of snp.data-class
or gwaa.dataclass # digits number of digits to be printed
DESCRIPTIVE STATISTICS OF MARKERS: descriptive.marker()>
descriptives.marker(ge03d2ex)$'Cumulative distr. of different
alpha' X pop pop[1] [26] [51] [76] [101] [126] 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
STRATIFIED ASSOCIATION> data1.sa plot(cleanqt, cex=0.5,
pch=19, ylim=c(1,4)) > add.plot(data1.sa, col="green", cex=1.2)
> add.plot(origdata, col="red", cex=1.2)
STRATIFIED ASSOCIATION Comparison of Structured Association
analysis
PCA USING PRICES METHOD: egscore() Fast score test for
association (FASTA) between a trait and genetic polymorphism,
adjusted for possible stratification by principal components.
> egscore(formula, data, kin) # formula formula describing
fixed effects (y ~ a +b) - mean the outcome y depends on two
covariates, a and b. # data An object of gwaa.data-class # kin
kinship matrix as returned by ibs
PCA USING PRICES METHOD: egscore()> data1.eg plot(cleanqt,
cex=0.5, pch=19, ylim=c(1,5)) > add.plot(data1.sa, col="green",
cex=1.2) > add.plot(data1.eg, col=red", cex=1.3)
1
2 Chromosome
3
X
REFERENCES Zhao JH. Use of R in Genome-wide Association Studies
(GWASs)
THANK YOU!
Aluchencko, Yurii. GenABEL Tutorial (March 14, 2008)