SINGLE NUCLEOTIDE POLYMORPHISM ANALYSIS IN APPLICATION TO FINE GENE MAPPING by Manish Sampat Pungliya A Thesis submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUTE in partial fulfillment of the requirements for the Degree of Master of Science In Biology By ___________________ May 2001 APPROVED BY: Dr. Julia Krushkal , Major Advisor Dr. Elizabeth Ryder, Advisor on Record Dr. Carolina Ruiz, Thesis Committee Dr. David Adams, Thesis Committee Dr. Ronald Cheetham, Head of the Department
78
Embed
SINGLE NUCLEOTIDE POLYMORPHISM ANALYSIS IN …...Single Nucleotide Polymorphism SNPs are the most common single base variations in the human population. A SNP is a variation where
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
in partial fulfillment of the requirements for the
Degree of Master of Science
In
Biology
By
___________________
May 2001
APPROVED BY:
Dr. Julia Krushkal , Major Advisor Dr. Elizabeth Ryder, Advisor on Record
Dr. Carolina Ruiz, Thesis Committee Dr. David Adams, Thesis Committee
Dr. Ronald Cheetham, Head of the Department
ii
ABSTRACT
Single nucleotide polymorphisms (SNPs) are single base variations among
groups of individuals. In order to study their properties in fine gene mapping, I
considered their occurrence as transitions and transversions. The aim of the study was
to classify each polymorphism depending upon whether it was a transition or
transversion and to calculate the proportions of transitions and transversions in the
SNP data from the public databases. This ratio was found to be 2.35 for data from the
Whitehead Institute for Genome Research database, 2.003 from the Genome
Database, and 2.086 from the SNP Consortium database. These results indicate that
the ratio of the numbers of transitions to transversions was very different than the
expected ratio of 0.5. To study the effect of different transition to transversion ratios
in fine gene mapping, a simulation study was performed to generate nucleotide
sequence data. The study investigated the effect of different transition to transversion
ratios on linkage disequilibrium parameter (LD), which is frequently used in
association analysis to identify functional mutations. My results showed no
considerable effect of different transition to transversion ratios on LD. I also studied
the distribution of allele frequencies of biallelic SNPs from the Genome Database.
My results showed that the most common SNPs are normally distributed with mean
allele frequency of 0.7520 and standard deviation of 0.1272. These results can be
useful in future studies for simulating SNP behavior. I also studied the simulated data
provided by the Genetic Analysis Workshop 12 to identify functional SNPs in
candidate genes by using the genotype-specific linkage disequilibrium method.
iii
ACKNOWLEDGMENTS
I am very thankful to many people who were directly and indirectly involved
in my completion of my thesis. This is the only way to express my sincere gratitude
towards them. First of all I would like to thank my advisor Dr. Julia Krushkal for
giving me the unique opportunity of working in the exciting field of Bioinformatics.
Her continuous encouragement and support made this thesis as one of the most
important things to be cherished throughout my life.
I extend my special gratitude to Dr. Elizabeth Ryder, Dr. David Adams and
Dr. Carolina Ruiz for their sincere suggestions and helpful discussions throughout my
years of graduate study. Special thanks goes to Dr. Matthew Ward for providing me
with the program ‘Scansort’. I am also thankful to Christopher Shoemaker and
Michael Sao Pedro who did the data conversion and provided me with the formatted
data for Genetic Analysis Workshop 12 (GAW12). I would also like to make a
mention of my friend Raju Subramanian for guiding me in writing algorithms in C++.
Analysis of the GAW12 data presented in this thesis was funded by the grant
“Computational algorithms for analysis of genomic data” from the Research
Development Council of Worcester Polytechnic Institute. I am also thankful to the
National Institutes of Health grant GM31575 to GAW12.
Finally, I express my gratitude to my parents for their unconditional love and
moral support throughout my graduate study. And last but not the least I would like to
thank all my friends at WPI who have made my stay here a memorable thing in my
life.
iv
TABLE OF CONTENTS
Page no.
Abstract………………………………………………………………………………..ii
Acknowledgments..…………………………………………………………………..iii
List of figures and tables……………………………………………………………....v
Introduction………..……………………………………………………………….….1
Thesis objectives……………………………………………………………………..13
Part I: Investigation of the effect of transition to transversion ratio Methods………………………………………………………………………………14 Results………………………………………………………………………………..28
Discussion……………………………………………………………………………38 Part II: Application of SNPs in fine gene mapping Introduction……………….………………………………………………………….41 Methods………………………………………………………………………………45
Figure 7: A sample output of TREEVOLVE for sample size of 7 with sequence length of 43. There are no SNPs in nucleotide sequences shown because only few initial bases are shown for each sequence for illustration.
Table 2: The table indicating different simulations performed for different
parameters.
Recombination rate Transition
Transversion 0.0 10-8 3×10-5 10-3
0.5 √ √ √ √
1.0 √ √ √ √
2.35 √ √ √ √
3.0 √ √ √ √
5.0 √ √ √ √
25
LINKAGE DISEQUILIBRIUM ANALYSIS OF SIMULATED
DATA FROM TREEVOLVE
Different types of software are used in order to compute values of LD (linkage
disequilibrium) using the sequence data. In the present study, program Arlequin1
(Schneider et al. 1997) was used for calculating the LD in the data simulated by
TREEVOLVE. It is versatile software available for analysis of genetic data of various
different types, including RFLPs, DNA sequence, and microsatellite data. It allows
one to perform a number of statistical tests using population data, including linkage
disequilibrium analysis. Arlequin has a graphical interface that is user-friendly. Its
other advantage is that one can run a number of input files simultaneously with the
same or different parameter lists by creating a batch file, thus reducing the analysis
time of the user. An example of an input file of Arlequin used in this study is shown
in figure 8.
During analysis by Arlequin, there were 200 input files (replicates from
TREEVOLVE) for each set of parameters. All these input files were analyzed
sequence1 1 ACAACAGCTAATGAGCTTATATTTTCATGACATAACGGGAAC sequence2 1 ACAACAGCTAATGAGCTTATATTTTCATGACATAACGGGAAC sequence3 1 ACAACAGCTAATGAGCTTATATTTTCATGACATAACGGGAAC sequence4 1 ACAACAGCTAATGAGCTTATATTTTCATGACATAACGGGAAC sequence5 1 ACAACAGCTAATGAGCTTATATTTTCATGACATAACGGGAAC } Figure 8: Sequence.arp – An input file for Arlequin for the calculation of LD. The field under SampleData is the output of treevolve with transition to transversion ratio of 0.5 and recombination rate of 0.0 (Only first five sequences are shown with only forty nucleotides in each sequence. The data simulated by TREEVOLVE had 200 such sequences with 1000 nucleotides in each sequence for each set of parameters).
The main aim was to understand the distribution of SNP variants in human
population. This knowledge may be helpful in the future when generating simulated
SNP data.
To see the distribution of the allele frequencies in the human population of the
SNPs, I analyzed the allele frequency data collected for the SNPs from the Genome
Database. I selected those alleles which were common (i.e. they have a frequency of
more than 0.5). The distribution of allele frequencies is shown in figure 13. This
distribution suggests that the allele frequency of most common alleles in SNP data
follows a normal distribution with mean allele frequency of 0.7520 and standard
deviation of 0.1272.
From the Kurtosis plot analysis performed to see the symmetry and normality
of the allele frequency data,
K2 = 2.016915 (0.25 < P < 0.50) (From the Chi-squared distribution)
table with two degrees of freedom)
Thus, the result of K2 indicates that the data are normally distributed. These
results will help in the future for simulations of SNP data while considering the allele
frequencies.
33
Figure 13: Allele frequency distribution for all common alleles of SNPs from human chromosome 6.
Allele frequency distribution
02468
1012
0.52
0.58
0.64 0.7 0.7
60.8
20.8
80.9
4M
ore
Allele frequency
Freq
uenc
y
34
EFFECT OF THE RATIO OF TRANSITIONS TO
TRANSVERSIONS ON LINKAGE DISEQUILIBRIUM
In order to study the effect of different ratios of transition to transversion on
linkage disequilibrium, which is a frequently used parameter in association analysis, a
simulation study was performed. After running the simulations for each set of
parameters and analyzing the simulated data by Arlequin, I calculated the mean
linkage disequilibrium values for each set of parameters. The results of this analysis
are shown in the table 5.
Table 5: Mean linkage disequilibrium values ±± standard deviation for different values of the transition to transversion ratios at different recombination rates.
The relationships between different transition to transversion ratios and mean
LD at three recombination rates of 0.0, 10-8 and 3×10-5 are shown in figures 14, 15
and 16, respectively, with error bars indicating standard deviation (SD). No variable
loci were observed for data simulated under the highest recombination rate (10-3). I
also observed a variation in the number of polymorphic loci for different
35
recombination rates, with the number of SNP loci decreasing as the recombination
rate increased.
36
Figure 14: Plot of Mean LD Vs. αα /2ββ at recombination rate (r) equal to 0.0.
Error bars indicate the standard deviations (SD)
Figure 15: Plot of Mean LD Vs. αα /2ββ at recombination rate (r) equal to 10-8.
Error bars indicate the standard deviations (SD)
37
Figure 16: Plot of Mean LD Vs. αα /2ββ at recombination rate (r) equal to 3××10-5
Error bars indicate the standard deviations (SD)
38
DISCUSSION
According to the one-parameter model, the expected ratio of transitions to
transversions is 0.5 (since the transition rate is expected to be equal to the rate of the
transversion). To see the actual proportion of transitions to transversions in the
general human population, I collected data from publicly available databases. The
ratio observed from the SNP data from the public databases showed that transitions
are more frequent than transversions (transitions are approximately four times higher
than transversions). Therefore, the two-parameter model (where transition and
transversion rates differ) may better approximate the SNP behavior. My analyses
showed that the proportion of transitions to transversions was very different (70% to
30%, 66% to 33%, and 68% to 32% in the Whitehead Institute of Genome Research,
the Genome Database and the SNP Consortium databases respectively) than that
expected (33% to 66%) under the scenario of no differences in mutation rates
between different nucleotide changes.
To study the distribution of the frequencies of the most common alleles in the
SNP data obtained from The Genome Database (GDB), I considered all biallelic
SNPs in this data set. The allele frequency distribution of the most common alleles
turned out to be normal with mean allele frequency of 0.7520 and standard deviation
of 0.1272. After performing the Kurtosis plot analysis on this allele frequency data, it
showed that the distribution of allele frequency is symmetric as well as normal (K2 =
2.016915, corresponding to 0.25 < P < 0.50). This analysis will help in the future for
simulating SNP behavior, because allele frequencies play an important role in
detecting strong association between a marker SNP and a susceptibility loci. If
39
marker allele frequencies are substantially different from susceptibility allele
frequencies, then one needs a large sample size or a large number of markers or both
to have a strong association (McCarthy and Hilfiker 2000). Therefore, the allele
frequency distribution of common SNPs would help for future simulation studies
involving SNP behavior in application to association studies involving allele
frequencies.
I analyzed the effect of different transition to transversion ratios (α/2β) on
linkage disequilibrium. Results from the linkage disequilibrium analysis of simulated
population showed that the linkage disequilibrium remained approximately same for
different transition to transversion ratios for the parameters used in the simulations.
The results obtained were for a sequence lengths of 1000 bp. As the recombination
rate increased, the mean LD value decreased. The increase in recombination rate also
resulted in reduction of number of polymorphic loci, thus reducing the strength of
LD. The reduction in number of loci could be related to the small sequence length
(1000 bp) in simulations and high recombination rate. An average extent of useful
levels of LD in the general human population is approximately 3kb as shown by
Kruglyak (1999). It is possible that the α/2β ratio might have an effect on LD when
longer sequence length is used. Longer sequences can be investigated in future
studies. One can also analyze the transitions and transversions separately from the
SNP data obtained from the simulation studies. In the present study, the LD was
evaluated between the polymorphic loci without considering whether each SNP was a
transition or transversion.
40
No polymorphic sites were observed in the sequences simulated by
TREEVOLVE for recombination rate of 10-3. This result was observed with short
sequence length and high recombination rate, which reduces the strength of LD
considerably, because the chances of a polymorphism being fixed in a population are
considerably less at a higher recombination rate while simulating the sequences. This
phenomenon is because of the way the TREEVOLVE simulates the data. The
program tries to simulate a polymorphism in the population while generating the
sequences and not while simulating the population tree. When the recombination rate
is much higher than the mutation rate, TREEVOLVE simulates a population that does
not have polymorphism. These results support those previously obtained by Kruglyak
(1999), in an independent analysis, in which he found no LD at recombination rate of
3×10-4 or higher (corresponding to physical distance of approximately 30kb). At such
a high rate of recombination, there is higher separation between polymorphic loci.
Such a high separation is probably unable to be accommodated in a sequence length
of 1000 bp, resulting in the absence of any polymorphic loci in the population.
Part II
Application of SNPs in fine gene mapping
41
INTRODUCTION
As described in the earlier part of this thesis, association analysis has been of
considerable importance in various fields ranging from population genetics,
pharmacogenomics, population genetic epidemiology and toxicology. An association
is said to exist between two phenotypes (of which one is resulting into a disease) if
they occur in the same individual more often than expected by chance. To investigate
whether the two phenotypes are associated or not, one collects the two groups of
individuals, one with the affected individuals and one for controls (unaffected
individuals). Then by counting the proportion of individuals having disease and the
other phenotype and individuals with disease and not having the other phenotype, one
can perform a standard 2×2 chi-square allelic association analysis or a 3×2 chi-square
genotypic association analysis to examine the significance of the association.
A combination of alleles at a specific gene characterizes a genotype of an
individual. Alleles and genotypes play a very important role in association analysis. If
a gene is a disease pre-disposing gene, it is possible that only one of the alleles of that
gene is actually responsible for the disease predisposition. Therefore, an allelic
association was a common way of fine gene mapping until recently, although allelic
heterogeneity may complicate this method (Terwilliger and Weiss 1998). Thus,
genotypic association analysis can be used in place of allelic association. In allelic
association, alleles are used to identify an association with disease. In contrast, in
genotypic analysis one looks at association of a disease with genotypes (similar to
genotypic linkage disequilibrium described by Weir 1996). In genotypic association
42
analysis one studies differences in genotype frequencies in healthy and affected
individuals to find if any genotype is associated with the disease.
Objective
In this part, I describe an approach for fine gene mapping using SNPs from a
general population by using genotype-specific disequilibrium analysis. The approach
is different from the allelic disequilibrium analysis, because it considers the frequency
of a genotype and not of individual alleles.
Description of the simulated data
A committee of Genetic Analysis Workshop 12 (GAW12) organized by
Southwest Foundation for Biomedical Research, San Antonio, Texas, USA, simulated
the data for GAW12 2000. The data used in this thesis were simulated for a large
general population. The disease prevalence in the population was about 25%. The
data were provided for 23 extended pedigrees with 1497 total individuals (1000
living) in the population. The disease was more prevalent in females than in males.
Five quantitative risk factors (Q1 to Q5) and two environmental factors (E1 and E2)
were also associated with the disease. Seven major genes (MG1 to MG7) influence
these five quantitative risk factors as shown in the figure 17. A major gene is a gene
related to the disease risk. The overall summary of the generating model is shown in
figure 17. This model was not known during the analysis. In the data available to me,
each living individual had information on affectation status, age at last exam, age at
onset of disease if affected, five quantitative risk factors and two environmental
43
factors as well as genotypic data for SNPs present in 7 candidate genes. These genes
were named from 1 through 7. These seven candidate genes were the genes, which
were potential candidates that might affect the disease. The data for these seven
candidate genes were provided by the GAW12 for analysis. These candidate genes
were present in the major genes. The goal of the study described in this thesis was to
test whether any of these candidate genes were contributing to the disease, and to
identify any functional SNPs that could be related to the disease risk. The data were
provided for 50 such replicates (50000 living individuals). There were 165 founders
in each replicate. There was a total of 9515 original SNPs in the population of 50,000
individuals. The organizers also provided us with the information about which was
the best replicate in the data. The best replicate was replicate 42 that had the data with
contributions from all the factors discussed above. It represented the best sample
population among all 50 replicates, with the mean simulated parameters being the
closest to the parameters originally used I simulations.
44
Figure 17: The phenotypic model of the data simulated by GAW12. Boxes indicate items provided in the GAW12 data set, including quantitative traits, affectation status, age at onset, environmental factors (E1 and E2) and household membership (HH). Circles indicate genetic factors including seven major genes and a mitochondrial (Mito) component.
MG5 MG1
MG6
45
METHODS
Individuals used in the study
The study was performed on pedigree founders as well as all the living
pedigree members. A founder is an individual in a family tree who does not have any
living ancestors and is not related to anyone in his or her generation (Figure 18).
Figure 18: Founders in a pedigree.
I considered founders initially, because they were unrelated. Such an approach
minimizes the correlations among family members in the SNP association analysis.
Several data sets were used in this study.
1) 8250 pedigree founders from all 50 replicates studying all candidate genes
1 to 7.
FOUNDERS
NONFOUNDERS
DEAD
DEAD
46
2) 8250 pedigree founders from all 50 replicates, analyzing only for genes 1
and 2 separately.
3) 165 pedigree founders from the best replicate 42, using only genes 1 and 2
separately.
4) 1000 living individuals from the best replicate 42 using only genes 1 and 2
separately.
Genotypic and phenotypic data
I considered the genotypic data for biallelic SNPs from 7 candidate genes for
each individual along with his or her affectation status. The genotypic data were for
715 candidate SNPs selected after the data reduction. The data for each SNP genotype
were provided by the GAW12 in binary format, i.e. 11, 12, or 22 instead of the
nucleotides. Therefore, the transition or transversion nature of the polymorphisms
was not taken into account.
Data reduction
The genotypic data set was very large. Considering all the 50 replicates, there
were 50000 individuals. To minimize the dimensionality of the data set, only those
SNPs were considered that were present in the pedigree founders in each of the 50
replicates by using a program called DATACONVERT (SaoPedro, unpublished). As
a result, 715 SNPs were obtained from the total of 9515 SNPs after the data
reduction. I considered the SNPs from the both coding and non-coding regions in the
analysis.
47
Sorting technique
The first step in the analysis was to sort the data. I considered affection status
as the variable of interest. I used program Scansort (Ward 2000, unpublished), which
sorts the data according to the sum of the absolute differences between frequencies of
healthy and affected subjects with the three SNP genotypes (11, 12, or 22, where 1 is
a wild type, or ancestral, allele and 2 is a mutated allele). The output of the program
is an eleven-dimensional data set, which contains one record for each SNP position in
the following order:
a. SNP index (original order)
b. Frequency of healthy individuals with genotype 11 (f11h)
c. Frequency of healthy individuals with genotype 12 (f12h)
d. Frequency of healthy individuals with genotype 22 (f22h)
e. Frequency of affected (sick) individuals with genotype 11 (f11s)
f. Frequency of affected (sick) individuals with genotype 12 (f12s)
g. Frequency of affected (sick) individuals with genotype 22 (f22s)
h. Sum of all the absolute differences (d)
i. Absolute difference between c and f
j. Absolute difference between d and g
k. Absolute difference between e and h
Table 6 represents this information in tabular form.
48
Table 6: Calculation of maximum difference by Scansort program.
Individual Genotype
11 12 22
Healthy f11h f12h f22h
Sick (Affected) f11s f12s f22s
Scansort calculates d, the sum of the absolute differences, by using the
following formula.
d = (| f11h – f11s| + |f12h-f12s| + |f22h-f22s|) / Number of individuals in the data set
Scansort sorts the SNP data according to the d value. Therefore, Scansort
orders the SNP data in a useful way where SNPs that have very different proportions
in healthy and affected individuals are present at the top of the list. As a result, the
output of the Scansort was used for the conventional 3×2 chi-square analysis.
Comparative statistical analysis of genotype frequencies
I used the chi-square statistic to find out the significant SNPs associated with
the disease. The Bonferroni correction was applied to correct the significance levels
for multiple testing. Statistical analysis was performed using Microsoft Excel.
Chi-square statistic is calculated by using the following formula,
(observed frequency – expected frequency) 2
expected frequency χ2
=
49
The chi-square analysis for individual SNP position was performed by
considering a 3×2 contingency table (Table 7).
Table 7: Chi-square analysis for an individual SNP position.
27) The International SNP Map Working Group. A map of human genome
sequence variation containing 1.42 million single nucleotide polymorphisms.
Nature 409: 928-933 (2001).
28) Venter, C. et al. The sequence of the human genome. Science 291(5507):
1304-1351 (2001).
29) Weir, B. Disequilibrium. In: Weir, B. S. editor. Genetic data analysis II -
methods for discrete population genetic data. Sunderland, MA: Sinauer
Associates. Inc. p. 91-139 (1996).
30) Zar, J. H. Testing for goodness of fit. In: Zar, J. H. editor. Biostatistical
analysis. New Jersey: Prentice Hall. p. 461-485 (1999) (a).
Zar, J. H. The normal distribution. In: Zar, J. H. editor. Biostatistical analysis.
New Jersey: Prentice Hall. p. 65-90 (1999) (b).
31) Zollner, S. and Haeseler, A. A Coalescent approach to study linkage
disequilibrium between single-nucleotide polymorphisms. Am. J. Hum.
Genet. 66: 615-628 (2000).
64
APPENDICES Appendix 1: An example of the SNP data for human chromosome 6 from the Whitehead Institute of Genome Research. In all there were 146 SNPs. (cR corresponds to the radiation hybrid distance). VNTR (variable number of tandem repeats) are the top and bottom markers for that particular SNP position. SNP NAME TYPE GENETIC
TSC0069979 Chr6 6p25 AL021328.1 NT_000213 c/t Note: This example represents 20 SNPs out of 5102 SNPs collected from the SNP Consortium database for this project.