The First International Seminar on Science and Technology January 24, 2009 ISBN : 978 – 979 – 19201 – 0 – 0 Objective Bayesian Approach for SNP Data: Method, Simulation Study and Application Adi Setiawan Program Studi Matematika, Fakultas Sains dan Matematika Universitas Kristen Satya Wacana Jl. Diponegoro 52-60 Salatiga 50711 Indonesia (adi_setia_03@yahoo.com) Abstract Bayesian inference is often criticized for its reliance on prior distributions, whose choice influences the conclusion. In particular, in testing theory the necessity of assigning prior probabilities to the two hypotheses appears awkward. The objective Bayesian approach overcomes this criticism by an objective choice of priors. It aims at producing inference statements that only depend on the assumed model and the available data. In this paper we propose to use objective Bayesian approach to find genes associated to the complex disease of interest by using SNP (single nucleotide polymorphism) as a marker. Simulation study is then used to describe the properties of this approach. Finally this approach is used in the whole-genome SNP data of cases and controls sample in case-control association study. Keywords: Objective Bayesian, single nucleotide polymorphism, case-control association study, intrinsic statistics. Introduction Genetic epidemiologists have taken the challenge to identify genetic polymorphisms involved in the development of complex diseases. Statistical methods have been developed for analyzing the relation between genetic polymorphisms such as SNP (single nucleotide polymorphism) to complex disease in genetic association studies (for example, see [6], [4] and [1]). In this paper, it is described an objective Bayesian approach for analyzing SNP data in case- control association studies. In case control studies, we compare the genotypes of individuals who have a disease (cases) with the genotype of individuals without the disease (controls). The proportions in each group having a characteristic of interest (for instance the number of genotypes of a given type) are then compared to determine whether there is an association between the complex disease and the characteristic of interest. The properties of objective Bayesian approach are described by using simulation. Finally this approach is used in the whole-genome SNP data. Statistical Method Suppose that we observe data X whose distribution is governed by a parameter θ belonging to a parameter set Θ and we wish to investigate whether θ belongs to a subset Θ 0 ⊂ Θ. Given a prior on Θ and a function δ( θ , Θ 0 ) measuring discrepancy between a parameter θ and the null parameter set Θ 0 , it is natural to base inference on the posterior discrepancy ∫ θ θ π Θ θ δ = Θ d X X d ) | ( ) , ( ) , ( 0 0 , where π(θ | X) is the posterior distribution of θ. Bernardo and Rueda [2] propose to use a reference prior and the (symmetrized) Kullback-Leibler divergence as the discrepancy measure. The latter is defined as } ) ) | ( , ) | ( ( , )) | ( ), | ( ( min{ inf ) , ( 0 0 0 0 0 θ θ θ θ = Θ θ δ Θ ∈ θ x p x p K x p x p K (1) for x → p(x | θ ) the density of X and K the Kullback- Leibler divergence of p 1 from p 2 dx x p x p x p p p K ∫ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = ) ( ) ( log ) ( ) , ( 2 1 1 2 1 . Bernardo and Rueda [2] call equation (1) based on these choices the intrinsic statistic. Reference priors are proposed as consensus priors designed to have a minimal effect, relative to the data, on the posterior inference. If the observed data X follows a smooth one-parameter model, the reference prior will be the Jeffreys' prior, i.e. π(θ) ∝ i(θ) 1/2 and for a k- parameter model where k > 1, we can also use Jeffreys' prior, but then [ ] 2 1 ) ) ( det( ) ( θ ∝ θ π i , where i(θ) is the Fisher information function. We shall apply this approach to testing whether the distributions of two independent observations X and Y are governed by the same parameters. The full parameter is a pair ( θ,θ ′) ranging over a product set and the observation (X,Y) possesses density (x,y) → p(x|θ) p(y|θ ′ ). The null model corresponds to the parameter set Θ 0 = { (θ, θ ′) : θ , θ ′ ∈ Θ }. Because X and Y are independent, the reference prior is the product of the reference priors for the two parameters θ and θ ′. In Proceeding Book 48
5
Embed
Objective Bayesian Approach for SNP Data: Method ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The First International Seminar on Science and Technology January 24, 2009
ISBN : 978 – 979 – 19201 – 0 – 0
Objective Bayesian Approach for SNP Data: Method, Simulation Study and Application
Adi Setiawan
Program Studi Matematika, Fakultas Sains dan Matematika
Universitas Kristen Satya Wacana Jl. Diponegoro 52-60 Salatiga 50711 Indonesia ([email protected])
Abstract
Bayesian inference is often criticized for its reliance on prior distributions, whose choice influences the conclusion. In particular, in testing theory the necessity of assigning prior probabilities to the two hypotheses appears awkward. The objective Bayesian approach overcomes this criticism by an objective choice of priors. It aims at producing inference statements that only depend on the assumed model and the available data. In this paper we propose to use objective Bayesian approach to find genes associated to the complex disease of interest by using SNP (single nucleotide polymorphism) as a marker. Simulation study is then used to describe the properties of this approach. Finally this approach is used in the whole-genome SNP data of cases and controls sample in case-control association study. Keywords: Objective Bayesian, single nucleotide polymorphism, case-control association study,
intrinsic statistics.
Introduction Genetic epidemiologists have taken the
challenge to identify genetic polymorphisms involved in the development of complex diseases. Statistical methods have been developed for analyzing the relation between genetic polymorphisms such as SNP (single nucleotide polymorphism) to complex disease in genetic association studies (for example, see [6], [4] and [1]). In this paper, it is described an objective Bayesian approach for analyzing SNP data in case-control association studies. In case control studies, we compare the genotypes of individuals who have a disease (cases) with the genotype of individuals without the disease (controls). The proportions in each group having a characteristic of interest (for instance the number of genotypes of a given type) are then compared to determine whether there is an association between the complex disease and the characteristic of interest. The properties of objective Bayesian approach are described by using simulation. Finally this approach is used in the whole-genome SNP data. Statistical Method
Suppose that we observe data X whose distribution is governed by a parameter θ belonging to a parameter set Θ and we wish to investigate whether θ belongs to a subset Θ0 ⊂ Θ. Given a prior on Θ and a function δ( θ , Θ0) measuring discrepancy between a parameter θ and the null parameter set Θ0, it is natural to base inference on the posterior discrepancy
∫ θθπΘθδ=Θ dXXd )|(),(),( 00 , where π(θ | X) is the posterior distribution of θ. Bernardo and Rueda [2] propose to use a reference
prior and the (symmetrized) Kullback-Leibler divergence as the discrepancy measure. The latter is defined as
}))|(,)|((
,))|(),|((min{inf),(
0
0000
θθ
θθ=ΘθδΘ∈θ
xpxpK
xpxpK (1)
for x → p(x | θ ) the density of X and K the Kullback-Leibler divergence of p1 from p2
dxxpxpxpppK ∫ ⎟⎟⎠
⎞⎜⎜⎝
⎛=
)()(log)(),(
2
1121 .
Bernardo and Rueda [2] call equation (1) based on these choices the intrinsic statistic. Reference priors are proposed as consensus priors designed to have a minimal effect, relative to the data, on the posterior inference. If the observed data X follows a smooth one-parameter model, the reference prior will be the Jeffreys' prior, i.e. π(θ) ∝ i(θ)1/2 and for a k-parameter model where k > 1, we can also use Jeffreys' prior, but then
[ ]21
))(det()( θ∝θπ i , where i(θ) is the Fisher information function.
We shall apply this approach to testing whether the distributions of two independent observations X and Y are governed by the same parameters. The full parameter is a pair ( θ,θ ′) ranging over a product set and the observation (X,Y) possesses density (x,y) → p(x|θ) p(y|θ ′ ). The null model corresponds to the parameter set Θ0 = { (θ, θ ′) : θ , θ ′ ∈ Θ }. Because X and Y are independent, the reference prior is the product of the reference priors for the two parameters θ and θ ′. In
Proceeding Book 48
The First International Seminar on Science and Technology January 24, 2009
ISBN : 978 – 979 – 19201 – 0 – 0
our examples these are the Jeffreys' priors. The intrinsic statistic can be considered a measure of evidence against the null model. Bernardo and Rueda [2] claim that the numerical value may be interpreted on an absolute scale independent of sample size and dimensionality. The null model is false if the intrinsic statistic d(Θ0,(X, Y)) is sufficiently large. Formally, if d(Θ0,(X, Y)) ≤ 2.5 then we have no evidence that the null model is false, if 2.5 < d(Θ0,(X, Y)) ≤ 5 then we have mild evidence that the null model is false, if 5 < d(Θ0,(X, Y)) ≤ 8.5 then we have strong evidence that the null model is false, and finally if d(Θ0,(X, Y )) > 8.5 then we have conclusive evidence that the null model is false. Single Marker Genotype-Based Method
The methods are based on a case-control design and try to find marker loci that are associated to the disease by comparing genotype frequencies between random samples of cases (diseased individuals) and controls. The methods can be classified as single marker, double marker or multiple markers according to whether they take into account frequencies of markers at one locus or combinations of markers at two or more loci. In current practice, phase is not observed, i.e. the raw data consists of unordered genotypes at the marker loci and not of multi-locus haplotypes. For single marker methods phase is not considered important, because it is commonly assumed that parental and maternal origins of alleles are not important for determining phenotype. Single marker methods are, however, classified as allele-
based or genotype-based according to whether they assume Hardy-Weinberg equilibrium or not. If pAA, pAa and paa are the relative frequencies of genotypes AA, Aa, aa in the population, then Hardy-Weinberg equilibrium is said to hold if pAA = pA
2, pAa = 2 pA pa, paa = pa2
for pA and pa = 1 - pA the frequencies of alleles A and a in the population. Allele based single marker methods parameterize the genotype frequencies through the single parameter pA, using the preceding identities, whereas genotype-based methods let the vector ( pAA, pAa, paa ) vary freely over the two-dimensional unit simplex. Under assumptions of infinite population size, discrete generations, random mating, no selection, no migration, no mutation and equal initial genotype frequencies in the two sexes, Hardy-Weinberg equilibrium arises after one generation and thereafter the genotype frequencies in the population are constant from generation to generation.
Let (p1, p2, p3) and (q1, q2, q3) be the genotype frequencies in the populations of controls and cases, respectively. We take random samples of n controls and m cases, respectively. The layout of the data is given in Table 1. Testing association between the genotype of the marker locus and the disease is equivalent to testing the null hypothesis H0 : p = q versus the alternative hypothesis H1 : p ≠ q where p = (p1, p2, p3) and q = (q1, q2, q3).
AA Aa aa Total Controls X1 X2 X3 n Cases Y1 Y2 Y3 m Pooled X1 + Y1 X2 + Y2 X3 + Y3 n + m
Table 1. Table of the number of genotypes in control and case samples.
Suppose that random vectors X = ( X1, X2, X3 )
and Y = ( Y1, Y2, Y3 ) are the numbers of genotype AA, Aa and aa in the samples of controls and cases, respectively. These vectors possess Multi(n, p) and Multi(m, q) distributions, respectively. The probability density function of X is
321 21)|( xxx pppxn
pxf ⎟⎟⎠
⎞⎜⎜⎝
⎛=
where x1, x2, x3 = 0, 1, 2, ….. n, x1 + x2 + x3 = n, 0 < p1, p2, p3 < 1, and p3 = 1 - p1 - p2 . The Fisher information matrix can be computed to be
⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜
⎝
⎛
−−−
−−
−−−−−
=
)1()1(
1
11
)1()1(
),(
212
1
21
21211
2
21
ppppn
ppn
ppppppn
ppI
The reference prior is given by
121
31
21
21
21
121
3211
1)( −−−=⎟⎟⎠
⎞⎜⎜⎝
⎛∝π ppp
pppp
Thus, the reference prior is the Dirichlet (21
, 21
,
21
)-distribution. The reference prior for q is the same
distribution, and the reference prior for (p,q) is the independent combination of the two marginal priors. The joint probability density function of X and Y is h(x,y | p, q) = f( x | p) g(y | q)
= . 321321321321
yyyxxx qqqym
pppxn
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
Proceeding Book 49
The First International Seminar on Science and Technology January 24, 2009
ISBN : 978 – 979 – 19201 – 0 – 0
Then the posterior density function of (p; q) given X = x and Y = y is [
]⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟
⎠
⎞⎜⎜⎝
⎛+
⎟⎟⎠
⎞⎜⎜⎝
⎛=
3
3
2
22
1
11),(
loglog
log]),(|),([
rp
Xrp
X
rp
XEsrqpK
n
qp
π3(p,q | x, y) ∝ π1(p) π2(q) h(x, y | p,q)
∝ 321321 321321
121
31
21
21
21
11
21
31
21
21
21
1yyyxxx qqqppp
qqqppp −−−−−−
= 1
21
31
21
21
21
1
121
31
21
21
21
1
321
321
−+−+−+
−+−+−+
yyy
xxx
qqq
ppp.
Thus the posterior π3(p,q | x, y) is the independent combination of the
Dirichlet ( y1 + 21
, y2 + 21
, y3 + 21
)
-distribution for p and the Dirichlet ( y1 + 21
, y2 +
21
, y3 + 21
)-distribution for q.
We wish to test the null hypothesis H0 : p = q. Let Θ = { ( p, q) : 0 < p1, p2, p3 < 1, p1 + p2 + p3 = 1, 0 < q1, q2, q3 < 1, q1 + q2 + q3 = 1 }. The quotient of the densities of (X;Y) for two parameter values (p; q) and (r; s) is given by
321321
321321
321321
321321
),|,(),|,(
yyyxxx
yyyxxx
sssrrrym
xn
qqqpppym
xn
sryxhqpyxh
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
=
321321
3
3
2
2
1
1
3
3
2
2
1
1yyyxxx
sq
sq
sq
rp
rp
rp
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
and
⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟⎠
⎞⎜⎜⎝
⎛+
⎟⎟⎠
⎞⎜⎜⎝
⎛=⎟⎟
⎠
⎞⎜⎜⎝
⎛
3
33
2
22
1
11
loglog
log),|,(),|,(log
rp
xrp
x
rp
xsryxhqpyxh
⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟⎠
⎞⎜⎜⎝
⎛+
3
33
2
22
1
11 logloglog
sq
xsqx
sqy .
The Kulback-Leibler divergence between the probability density functions h(x, y | p, q) and h(x, y | r, s) is the expected value of this expression under parameter value (p, q)t, given by
+ [ ]⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟⎠
⎞⎜⎜⎝
⎛
3
33
2
22
1
11),( logloglog
sq
YsqY
sqYE qp
⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟⎠
⎞⎜⎜⎝
⎛=
3
33
2
22
1
11 logloglog
rp
pnrppn
rppn
+ ⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟⎠
⎞⎜⎜⎝
⎛
3
33
2
22
1
11 logloglog
sq
qmsqqm
sqqm
= n L( p | r ) + m L( q | s ), where
⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟⎠
⎞⎜⎜⎝
⎛+⎟⎟⎠
⎞⎜⎜⎝
⎛=
3
33
2
22
1
11 logloglog)|(
rp
prpp
rpprpL .
The intrinsic discrepancy between h(x, y | p, q) and h(x, y | p0, q0) where (p0, q0) ranges over Θ0 is
}]),(|),([
,]),(|),([{mininf)],(,[
00
000 0
qpqpK
qpqpKqp p=θδ,
where the infimum is taken over all probability vectors p0 = (p01, p02, p03). Then the intrinsic statistic is given by
∫ ∫ Θ
=Θ
p q
dqdpyxqpqp
qpd
),|,(]),(,[
]),(,[
0
0
πδ .
The intrinsic statistic cannot be found in closed form, but may easily be computed by Monte Carlo integration.
In the following, we illustrate the techniques on data collected by the Department of Medical Genetics at the Vrije Universiteit Medical Center Amsterdam. The genotype of data came from a genetically isolated population in Turkey with current population size around 6000 people. Ninety percent of these people are supposed to be descendants of 23 families that originally inhabited the region approximately 400 years ago. Genotyping was done using the Affimetrix 10K SNP chip to 27 controls and 31 cases We summarized characteristics of the 11229 SNPs, such as the identity of the SNP in the chromosome and the genotype of every individual in the control and case samples The genotype of individuals are defined as AA, AB or BB, a missing genotype is coded as “NoCall”, meaning that the marker did not pass the discrimination filter [7].
A case-control study with a biallelic marker was conducted from SNPs analysis with identity 1513978 in chromosome 2 and the results are given in Table 2. Using Table 2 and the objective Bayesian approach for single marker genotype-based methods, we find the intrinsic statistic 9.8827. Thus, we have
Proceeding Book 50
The First International Seminar on Science and Technology January 24, 2009
ISBN : 978 – 979 – 19201 – 0 – 0
conclusive evidence that there is an association between the marker and the disease.
AA Aa aa Total Controls 11 14 2 27 Cases 29 2 0 31 Pooled 40 16 2 58
Table 2. Table of the number of genotypes in control and case samples.
Simulation Study and Application
Genotype AA, Aa and aa are generated for 50 individuals in controls sample by using parameter (pAA, pAa, paa) = (0.2, 0.2, 0.6). Simulated data for 50 individuals in cases sample can be generated in a similar way. The result of simulated data in controls sample, cases sample and the related intrinsic statistic are given in Table 3. Based on the table, we conclude that the intrinsic statistic tends to give no evidence that the model is false (there is no association between the marker and the disease). Table 4 presents simulated data and their intrinsic statistic when we use parameter (pAA, pAa, paa) = (0.2, 0.2, 0.6) and (0.2, 0.6, 0.2) to generate controls sample and cases sample, respectively. Based on the table, as we expect, the intrinsic statistic tends to give strong
evidence that the model is false (there is an association between the marker and the disease).
Simulation can be extended to different sample size. Table 5 presents the result of intrinsic statistic by using sample size 50, 100 and 500. We conclude that the greater sample size will have the greater intrinsic statistic.
Objective Bayesian approach can then be applied to each SNP in the whole-genome 10K SNP data. In this approach, a SNP is called associated to the complex disease of interest if the intrinsic statistic is larger than 5. As the method claims to produce a measure of evidence that can be interpreted on an absolute scale, no correction for multiple testing appears to be necessary. Based on this approach, we find 111 associated markers.
Table 3. The result of simulated data in 50 controls sample, 50 cases sample and the related intrinsic statistic.
In this paper, we have explained an objective Bayesian approach to analyze SNP data in case-control association studies. A simulation study is done to describe the properties of objective Bayesian approach and then it is applied in the whole-genome association studies. The research can be extended for SNP data that use more than 200 K SNPs in whole-genome as in paper [5].
References [1] Balding, D. J. 2006. A tutorial on statistical
methods for population association studies. Nature Reviews Genetics 7 : 781.
[2] Bernardo, J. M. & Rueda, R. 2002. Bayesian Hypothesis Testing : A Reference Approach, International Statistical Review 70, 351-372.
[4] Heidema, A. G., J. M.A. Boer, N. Nagelkerke, E. C. M. Mariman, D. L. Van der A, E. J. M. Feskens. 2006. The challenge for genetic epidemiologist : how to analyze large numbers
of SNPs in relation to complex diseases, BMC Genet. 7:23
[5] Hoggart, C. J, J. C. Whittaker, Maria De Iorio, D. J. Balding. 2008. Simultaneous Analysis of All SNPs in Genome-Wide dan Re-Sequencing Association Studies, Plos Genetics, Volume 4 Issue 7.
[6] Hu, N., C. Wang, Y. Hu, H. H. Yang, C. Giffen, Z. Tang, X. Han, A. M. Goldstein, M. R. Emmer-Buck, K. H. Buetow, P. R. Taylor, M. P. Lee. 2005. Genome-Wide Association Study in Esophageal cancer Using GeneChip mapping 10K Array. Cancer Research 65 (7) : 2542-2546.
[7] Setiawan, A. 2007. Statistical Data Analysis of Genetic Data in Twin Studies and Association Studies, Vrije Universiteit, Amsterdam, Ph.D Thesis, ISBN 978-90-9021728.