Research Article Detecting Genetic Interactions for ...downloads.hindawi.com/journals/bmri/2015/523641.pdf · Research Article Detecting Genetic Interactions for Quantitative Traits
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Research ArticleDetecting Genetic Interactions for Quantitative Traits Using119898-Spacing Entropy Measure
Jaeyong Yee1 Min-Seok Kwon2 Seohoon Jin3 Taesung Park4 and Mira Park5
1Department of Physiology and Biophysics Eulji University Daejeon Republic of Korea2Department of Bioinformatics Seoul National University Seoul Republic of Korea3Department of Informational Statistics Korea University Jochiwon Republic of Korea4Department of Statistics Seoul National University Seoul Republic of Korea5Department of Preventive Medicine Eulji University Daejeon Republic of Korea
Correspondence should be addressed to Mira Park miraeuljiackr
Received 14 November 2014 Revised 4 February 2015 Accepted 8 March 2015
Academic Editor Xiang-Yang Lou
Copyright copy 2015 Jaeyong Yee et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
A number of statistical methods for detecting gene-gene interactions have been developed in genetic association studies with binarytraits However many phenotype measures are intrinsically quantitative and categorizing continuous traits may not always bestraightforward and meaningful Association of gene-gene interactions with an observed distribution of such phenotypes needsto be investigated directly without categorization Information gain based on entropy measure has previously been successful inidentifying genetic associations with binary traits We extend the usefulness of this information gain by proposing a nonparametricevaluation method of conditional entropy of a quantitative phenotype associated with a given genotype Hence the informationgain can be obtained for any phenotype distribution Because any functional form such as Gaussian is not assumed for the entiredistribution of a trait or a given genotype this method is expected to be robust enough to be applied to any phenotypic associationdata Here we show its use to successfully identify the main effect as well as the genetic interactions associated with a quantitativetrait
1 Introduction
Recent advances in high-throughput genotyping techniqueshave produced massive volumes of genetic data Althoughit is common to analyze single SNP effects extensively suchapproaches cannot adequately explain the intricate geneticcontributions to complex diseases such as hypertensiondiabetes and certain psychiatric disorders Consequentlythere are still large amounts of genetic components thatremain unexplained Gene-gene interaction analysis may beone method to adequately address this missing heritabilityproblem [1]
For case-control studies which formulate the measuresfor a binary trait a number of statisticalmethods for detectinggene-gene interactions have been proposed One of the mostpopular methods is multifactor dimensionality reduction(MDR) [2] that converts a high-dimensional contingencytable to a one-dimensional model without raising the issue
of sparse cells Several variants of MDR have been recentlydeveloped [3ndash8] while another approach was developed[9ndash11] from information theory [12 13] More recently anentropy-based approach which utilizes the relative gain ofinformation as well as its standardized measure has alsobeen proposed [14]
However for quantitative traits such as the blood pres-sure body mass index and patient survival times relativelyfew attempts have been made to analyze the genetic inter-actions Because many phenotype measures are intrinsicallyquantitative and categorizing a continuous trait may notalways be straightforward and meaningful association ofgene-gene interactions with an observed distribution ofsuch phenotypes needs to be investigated directly withoutcategorization To that end introducing a new statistic is oneway to tackle the problem [15] Extending theMDRalgorithmto continuous traits as in the ways of the generalized MDR(GMDR) and the model-based MDR (MB-MDR) has been
Hindawi Publishing CorporationBioMed Research InternationalVolume 2015 Article ID 523641 10 pageshttpdxdoiorg1011552015523641
2 BioMed Research International
proposed [3 6] More recently a quantitative MDR (QMDR)was proposed to replace the balanced accuracy metric witha 119905-test statistic [16] However these MDR-based approachesmay oversimplify the original data to some degree throughclassification of phenotypes An entropy-based approachmaywell be an alternative model Entropy is commonly used ininformation theory to measure the uncertainty of randomvariables [12 13] and information gain ormutual informationhas been shown useful to represent association strengths [17ndash19] Although the usefulness of such information theoreticalmethods is well known the statistical methods based onthis approach for analyzing gene-gene interactions of thequantitative traits are rarely found with the exception ofone specific case [20] However the application may also belimited by assuming a normal distribution
Here we extend the usefulness of the information conceptto quantitative traits by considering nonparametric estimatesbased on sample-spacing or 119898-spacing [22ndash25] for theconditional entropy of a quantitative phenotype based ona given genotype The challenge therefore is to couplea nonparametric entropy estimator to correct and stableinformation gainsWe thus developed the useful informationgain standardized (IGS) approach and applied it to datasetscomposed of several genotypes and the quantitative traitThis approach could be considered an extension of previouswork on categorical traits [14] to the quantitative phenotypesThe proposed method however does not attempt in anyway to classify quantitative phenotypes like other methodssuch as variants of MDR but instead handles them directlyproviding an intrinsic advantage of removing the chanceof misclassification While previous entropy-based methodsof analyzing quantitative traits assumed the shape of itsdistribution to be normal [20] our method does not need tospecify the distribution to estimate the association Any reg-ular or irregular distribution would not cause any difficultiesAlthough this is also an advantage of GMDR or QMDR wepropose a method that takes the advantageous characteristicsfrom both of those methods We also performed extensivesimulation studies to compare the powers of the proposedmethod to QMDR and GMDR demonstrating its advantagein detection power
In the following sections after a brief review of nonpara-metric entropy estimation we describe a new method formodeling genetic interactions A nonparametric entropy esti-mator is shown to successfully couple with genetic datasetsthrough our modifying work in the Materials and MethodsApplication of this information gain standardized (IGS)approach is evaluated for both simulation and real datasetsin the Results and Discussions
2 Materials and Methods
21 Estimation of the Entropy for a Continuous Variable If119883is a random vector with probability density function119891(119909) itsdifferential entropy is defined by
A well-known approach for estimating a solution to thisequation is to use plug-in estimates In this approach 119891(119909)
is first estimated using a standard density estimation methodsuch as a histogram or kernel density estimator and theentropy is then computed Integral resubstitution splittingdata and cross-validation estimates are among the usualplug-in estimates [22] Another approach is based on sample-spacing Let 119883119896 be a set of independent and identicallydistributed real valued random variables with correspondingorder statistics of 119883119899119896 Here 119899 represents the total numberof measured samples For the arbitrary integers 119894 and 119898
satisfying the condition of 1 le 119894 lt 119894 + 119898 le 119899 a spacing oforder 119898 or 119898-spacing is defined as 119883119899119894+119898 minus 119883119899119894 A densityestimate based on sample-spacing119898 is then constructed as
119891119899 (119909) =119898
119899
1
119883119899119894119898 minus 119883119899(119894minus1)119898
(2)
where 119909 isin [119883119899(119894minus1)119898 119883119899119894119898) [14] This density estimate isconsistent if as 119899 rarr infin 119898 rarr infin and 119898119899 rarr 0
[22] Several variations of an entropy estimator with minordifferences have been proposed all based on the abovedensity estimates [23 24] Among them the following werereported to approximate with lowered variance [25]
119867119898119899 =1
119899 minus 119898
119899minus119898
sum
119896=1
ln(119899
119898(119883119899119896+119898 minus 119883119899119896)) (3)
Asymptotic bias of this estimator can be corrected by addingadditional terms including the digamma function [22 28]
119867119898119899 =1
119899 minus 119898
119899minus119898
sum
119896=1
ln(119899
119898(119883119899119896+119898 minus 119883119899119896)) minus
Γ1015840(119898)
Γ (119898)+ ln119898
(4)
As119898 increases the correctional terms become negligible andthe two estimators coincide Our evaluation of the entropyof a phenotype 119867(119875) of a quantitative trait is based on thisestimator
22 Modification of the 119898-Spacing Based Entropy EstimatorThe estimator in (4) has both 119899 and 119898 as parameters Ingenetic association studies the number of samples 119899 ofseveral hundreds is common However when the conditionalentropy is estimated there may be a minor allele that couldhave a much smaller number of samples corresponding tothat allele Moreover the choice of the sample-spacing 119898should affect the resulting estimation of an entropy valueTherefore it is required to have an entropy estimation schemeindependent of the number of samples without the needof choosing a particular value of the sample-spacing Toillustrate such a requirement an ensemble of 3000 sets of therandom deviation from 119873(0 1
2) was generated for each data
point in Figure 1 where the mean and standard deviation ofthe estimates are plotted for each ensemble On the left panelof Figure 1 119898 is fixed to 10 and 20 while 119899 is varied Theanalytic formula of the entropy for a normal distribution canbe obtained as follows [20] where 119890 is Eulerrsquos number
119867 = ln (120590radic2120587119890) (5)
BioMed Research International 3
15
14
13
12
11
10
101
102
103
104
105
n-sample⟨m-spacing⟩
H120590 = 10
10
20
mn
(a)
m-spacing
⟨n-sample⟩
0 100 200 300 400
15
14
13
12
11
10
H
120590 = 10
400
mn
(b)
Figure 1 The 119899-dependence (a) and 119898-dependence (b) of the entropy estimator 119867119898119899
An ensemble of 3000 sets of random sampling from119873(0 1
2)was constructed and used for each point in the plotThe sample-spacing119898 was fixed while varying the number of samples 119899 (a) to
evaluate the 119899-dependence of the entropy estimator In (b) 119899 was fixed and 119898 was varied to show the 119898-dependence Analytically obtainedtrue values are represented by the arrowed horizontal lines
The calculated value of (5) is pointed on the vertical axiswith a horizontal arrow with the corresponding 120590 aboveit The obvious 119899-dependence of the estimator can be seenin this plot where the estimation approaches the analyticvalue as 119899 increases with radic119899-consistency as expected [24]In Figure 1(b) 119899 is fixed to 400 while 119898 is varied In thisplot the estimated entropy again changes in value throughoutthe possible range of 119898 It is shown that the estimatedvalue is always smaller than the analytically calculated valueTherefore assigning a particular value to 119898 such as radic119899 thetypical choice [25] would not be appropriate in this samplingrange Because of these 119899- and119898-dependences the estimatorin (4) may need to be modified Therefore we modify theentropy estimator in (4) as follows
119867⟨119898⟩119899 =1
119899 minus 1
119899minus1
sum
119898=1
(1
119899 minus 119898
119899minus119898
sum
119896=1
ln(119899
119898(119883119899119896+119898 minus 119883119899119896))
minusΓ1015840(119898)
Γ (119898)+ ln119898)
(6)
In this modification an entropy estimator is averaged overthe possible 119898 values for each 119899 which is denoted by ⟨119898⟩This estimator is used to plot the entropy versus number ofsamples in Figure 2 Over a wide range of 119899 this entropy esti-mator yields very stable values in contrast to Figure 1(a) Anincrease in the extremely small 119899 range should be within thetolerable error in an application of genome-wide association
as the contribution to the conditional entropy by such aminorallele would be suppressed by the weighting factor of themarginal probability that should be proportional to the num-ber of corresponding samples Analytically obtained entropyvalues for 119873(0 120590
2) with three different 120590rsquos are marked on
the vertical axis on the right-hand side Regardless of thevalue of 120590 the differences between the analytically obtainedvalue and the values given by the estimator stay essentiallythe same Considering that the association study measuresthe difference between the entropy and the correspondingconditional entropy the stability should be a more criticalissue than the absolute value of the estimates Thereforecompensation of this Δ would not be necessary as long asit is stable Furthermore the underestimation of the entropyshown in the plot should have little effect on the associationstrength Hence an entropy estimator has been set up thatshould satisfy the practical 119899-independence without the needto find a proper sample-spacing
23 Evaluation of a Conditional Entropy Now let 119866 be acategorical variable assigned to each sample measurement119883119896 119866may be a genotype given by a measured SNP or a com-bination of SNPs while 119883119896 represents the measured value ofa phenotype For detecting the main effect of a single SNP 119866consists of three categories of 119866 = 0 119866 = 1 and 119866 = 2 Fordetecting the interaction between SNP119894 and SNP119895 119866 consistsof 9 categories such that 119866 = 0 = (SNP119894 = 0 SNP119895 = 0)
4 BioMed Research International
119866 = 1 = (SNP119894 = 0 SNP119895 = 1) 119866 = 2 = (SNP119894 = 0 SNP119895 =2) 119866 = 3 = (SNP119894 = 1 SNP119895 = 0) and 119866 = 8 = (SNP119894 =2 SNP119895 = 2) Detection of the higher order interaction can beperformed in the same way with expansion of the categoriesof 119866 Then an estimator for each specific component of theconditional entropy 119867(119875 | 119866 = 119892) can be constructedusing the genotype-selected subset measurements 119883119899119892 119896
along with an individual sample-spacing of 119898119892 Extending(6) while applying the above argument should now readilyproduce the estimators for the entropy of a phenotype and theconditional entropy Here 119889 denotes the order of a gene-geneinteraction
119867(119875) =1
119899 minus 1
119899minus1
sum
119898=1
(1
119899 minus 119898
119899minus119898
sum
119896=1
ln(119899
119898(119883119899119896+119898 minus 119883119899119896))
24 Standardized Measure of an Association Strength Sincethe differential entropy values are scale-dependent when theabove estimators are calculated with 119883119894 and 119888119883119894 (where 119888
is a constant scale factor) the difference would be ln 119888
For example if the phenotype is height it may be measuredin meters or centimeters In this case the scale factor is100 Nevertheless the association strength should also be thesame Also note that a negative value is perfectly legitimatefor a differential entropy Information gain IG as in theform defined with discrete entropies [14] should satisfy scaleindependence while correctly representing an associationstrength without being affected by negative valuesThereforeit should retain its usefulness as a measure of an associationstrength
IG = 119867 (119875) minus 119867 (119875 | 119866) (9)
IG would be readily estimated with the above estimator(7) IG standardized (IGS) is set up with the means andstandard deviations of IGs obtained from repeated shufflingof the phenotypes while all genotypes remained fixed [14]
20
15
10
05
101
102
103
104
105
n-sample
H⟨m
⟩
120590 = 10
120590 = 14
120590
120590 = 07
10
14
07
Δ = 0183
Δ = 0183
Δ = 0183
n
Figure 2 The 119899-independence and constant offset from the truevalue of the estimates averaged over all possible119898 values for each 119899Each symbol represents a result of samplings from 119873(0 120590
2) While
varying 119899 the number of samples the estimated entropy values wereaveraged over all the possible119898 sample-spacing values ⟨119898⟩ denotesthis averaging which should not depend on weighting due to thevirtually same standard deviations shown in Figure 1(b) Over awiderange of 119899 the estimated entropy stays effectively the same showing119899-independence in the range of practical number of samplingMoreover the almost flat line connecting each symbol shifts up ordown following exactly the change of the true value indicated by thehorizontal arrows The rise in the extremely small 119899 range shouldbe within the tolerable error of any specific application becausethe contribution to conditional entropy by such a case would besuppressed by weighting based on the marginal probability thatshould be proportional to 119899
Let IG(1)119894
denote the maximum IG of the 119894th permuteddataset Then the mean and standard deviation of IG(1)
1
IG(1)2
IG(1)119899
can be computed as follows
IG119901 =sum119899
119894=1IG(1)119894
119899 119878119901 =
radicsum119899
119894=1(IG(1)119894
minus IG119901)2
119899 minus 1
(10)
where 119899 is the number of permuted datasets Now IGS isdefined as follows
IGS =
IG minus IG119901119878119901
(11)
3 Results and Discussions
31 Demonstration of the 119898-Spacing Method To showthe plausibility of the proposed 119898-spacing method
BioMed Research International 5
07
06
05
BA (b
y G
MD
R) P lt 0001
P = 0003
Main effect2-order3-order
minus4 0 4 8
IGS (by m-spacing)
(a)
P lt 0001
P = 0003
t-st
atist
ic (b
y Q
MD
R)
6
2
0
4
Main effect2-order3-order
minus4 0 4 8
IGS (by m-spacing)
(b)
Figure 3 Comparison of the QMDR GMDR and119898-spacing methods Association strengths obtained by GMDR versus119898-spacing (a) andby QMDR versus119898-spacing (b) are compared for a simulated dataset All three methods were used to evaluate the main effect as well as 2ndand 3rd order interactions The dataset was designed to have one 2nd order interaction causal pair
a representative result is shown in Figure 3 using adataset whose quantitative trait was generated from anormal distribution with a single causal SNP pair simulatedas described in the next section The sample size of thedataset was 400 with 20 SNPs In panel (a) the associationstrengths obtained by 119898-spacing and GMDR are plottedas horizontal and vertical coordinates respectively Filledtriangles represent the main effects while open circles arefor the 2nd order interactions Both methods identify thesame single SNP pair having a prominent interaction plottedin the upper right corner One of the SNPs was found toproduce the main effect in contrast to others Again theresult is agreed by both methods 119875 values obtained bypermutation are given in the boxes for those selected pointsAssociation strengths of the 3rd order interactions areplotted with a plus sign Because no 3rd order interaction issimulated into the dataset the combinations of SNPs madeby adding a single SNP to the causal pair are expected tohave high association values Those points are clustered nearthe identified causal pair in the upper right corner In panel(b) of Figure 3 the same comparison was made using theresult from 119898-spacing and QMDR Both comparisons showconsistent results between the proposed 119898-spacing methodand GMDR or QMDR Note that IGS instead of IG was usedThe distribution of the IG values from a dataset would shift toa higher direction with increased order of interactionsThusthe more conditions applied the less entropy may be left tofind In other words as the order of interaction increasesthe conditional entropy 119867(119875 | 119866) tends to decrease while
119867(119875) remains the same Therefore IGS is vital if one needsto compare the association strengths between genotypesfrom different orders of interactions Figure 3 shows that thesimulated causal pair has the largest IGS value among allpoints from different orders of interactions
32 Generation of the Simulation Data To examine theperformance of the 119898-spacing method an extensive set ofsimulation data was necessarily generated First three typesof quantitative trait distributions were considered Two ofthem were normal and gamma distributions and anotherone was a mixture of those two types With single causalpair designed 70 different penetrance models based on [21]were incorporated For the case of a normal distribution aphenotype value 119910 associated with two interacting SNPswas selected from a normal distribution as defined bythe penetrance values tabled for possible combinations ofgenotypes associated as follows
Here 119891119894119895 represents the penetrance values tabled for everymodel simulated and can be found in [21] It is tabulatedfor each possible pair of genotypes (119894 119895) In 70 differentpenetrancemodels 14 different combinations of two differentminor allele frequencies (MAFs) and seven different heri-tability values were considered Specifically we consideredthe cases when the MAFs were 02 and 04 and when theheritability was 001 0025 005 01 02 03 and 04 Three
6 BioMed Research International
(0 0)
0486
171(0 1)
0960
78
minus4 minus2 4
(0 2)
0538
11
(1 0)
0947
80
minus3 minus1
(1 1)
0004
30(1 2)
0811
6
(2 0)
0640
16
minus2 minus1
(2 1)
0606
8
minus2 minus1
(2 2)
0909
0
minus2 0 2 4
00
02
04
minus2 0 2
1 2 3
1 2 3
0 24
00
02
04
00
02
04
minus2 0
0 1 2 30
2 4
00
02
04
minus2 0 2 4
00
02
04
00
02
04
minus2 0 2 4
00
02
04
00
02
04
00
02
04
Figure 4 Demonstration of the simulation scheme Phenotype distributions were plotted to associate with the genotypes by two interactingSNPs as denoted in the parentheses on top of each plot SNPs may take values of 0 1 and 2 or AA Aa and aa For this particular dataset theMAF was set to 0200 On the bottom of each plot the penetrance value for this particular model is given which is taken from [21] Insideeach plot the number of samples generated to satisfy the simulation constraint is given The vertical dotted lines are for the mean values ofthe high- and low-risk groups By constraint the line on the left is for the low-risk group
different values (08 10 and 12) of the variance 120590 were usedindependently for the high- and low-risk groups resulting in9 combinations The grouping constraint for the generatedevent was set such that the averaged 119910 of the high-risk groupshould be larger than or equal to the overall average Theaveraged 119910 of the low-risk group should be less than theoverall average In Figure 4 9 possible distributions of agenerated phenotype are shown In this example the sample
size is 400 The high- and low-risk groups have the samenumber of samples and both have a variance of 10 Forgammadistributions phenotype values follow the rule below
The shape and scale parameters 119896 and 120579 were determinedby 119891119894119895 and 120590 using the relationship 119891119894119895 = 119896120579 and 120590 = 119896120579
2Penetrance models were classified by 7 heritability values
BioMed Research International 7
15
10
05
00
GMDR
00 02 04
Heritability
Hit
ratio
m-spacing
QMDR
(a)
10
05
00
m-spacing
QMDR
GMDR
00 02 04
Heritability
Hit
ratio
(b)
10
05
00
GMDR
00 02 04
Heritability
Hit
ratio
m-spacing
QMDR
(c)
Figure 5 Comparison of the hit ratios or the detection probabilities among the proposed119898-spacing method QMDR and GMDR Genomicdatasets were generated based on 70 different penetrance functions [21] which were in turn classified into 7 distinct values of heritabilityFor each model the phenotype values are simulated with normal (a) gamma (b) and mixed (c) distributions High- and low-risk groupsin a quantitative trait overlapped with 9 different combinations of the standard deviations Considering all of the above 100 data files weregenerated for each case adding up to 9000 simulated files being examined for each point in the plot
001 002 005 01 02 03 and 04 resulting in 10 models foreach heritability The generated data files had a sample sizeof 400 with 20 SNPs In all 3 times 70 times 9 = 1890 differentconditionswere set up with 100 simulated data files generatedfor each condition
33 Comparison of the Detection Probability and Type IError The ldquohit ratiordquo or detection power of the IGS wasevaluated and compared Simulated data files described inthe previous subsection were used All of them had a singlecausal pair to identify In addition to our proposed119898-spacing
method QMDR and GMDR were used to compare theresults Figure 5 shows the comparison Panels (a) (b) and(c) are for the quantitative trait of normal gamma andmixeddistributions respectively Seventy penetrance models weregrouped into 7 cases of heritability on the horizontal axiswhile all 9 combinations of the variances in high- and low-risk distributions were merged into each heritability caseWith a normal distribution as shown in Figure 5(a) the119898-spacingrsquos performance was in between those of QMDRand GMDR for higher values of penetrance However in therange of penetrance less than 02 the 119898-spacing performs
8 BioMed Research International
best Note that theQMDR shows higher detection probabilitythan the GMDR throughout the range In the case of agamma distribution as shown in Figure 5(b) the QMDRrsquosperformance drops rapidly as the heritability decreases whenthe hit ratios of 119898-spacing as well as the GMDR stay betterthan that of QMDR and are comparable to each other Notethe switch of the GMDR and QMDRrsquos performance rankswith the change of the phenotype distribution What QMDRdoes is essentially the dichotomization of the observed valuesof the quantitative phenotypes Therefore it should do betterwith well-defined symmetric distributions such as a normaldistribution than with an asymmetric one (eg gamma dis-tribution)The proposed119898-spacingmethod is expected to beeffective regardless of the shape of the phenotype distributionbecause it makes no assumptions regarding the distributionand is therefore nonparametric as demonstrated in Figures5(a) and 5(b)This nonparameterization is again confirmed inFigure 5(c) showing that119898-spacing outperforms the QMDRand theGMDR throughout the whole range of heritability inthe case of themixed form of phenotype distribution Amongthe threemethods examined119898-spacing was themost robustperforming consistentlywithin the range of conditions for thesimulation
To estimate the type I error rate the null datasets weregenerated under the same scheme as used for the detectionpower analysis except that there was no causal pair intendedNow there are 20 SNPs that none of the pairs are expectedto have an association Permutation 119875 values for a particularpair were obtained by permuting each dataset 1000 timesWe took the significance level 120572 as 005 to get the ratio ofthe permutation 119875 values smaller than or equal to 120572 Wereport this ratio as the type I error rate in Table 1 whoseaccuracy to one decimal place when expressed in percent wasensured by the number of the permutation Table 1 presentsthe type I error rate for each combination of three traitdistributions two MAFs and seven heritability values alongwith the overall estimates Throughout these conditionsthe type I error rates are gathered tightly around 5 withmaximum and minimum of 54 and 43 respectivelyMoreover there exists no sign of the dependence on thetrait shape heritability and MAF Therefore our proposedmethod preserved the type I error rates on these condi-tions
34 Application to Real Data A full-scale real dataset fromthe Korean Association Resource (KARE) project [20] wasanalyzed to investigate the effectiveness of the 119898-spacingmethod Among the available phenotypes ldquoheightrdquo waschosenwith a sample size of 8842 from the population-basedcohortThe total number of SNPs was 327872 spanning over22 chromosomesThe ldquoheightrdquo phenotype showed to be closeto a normal distribution such that the119898-spacingmethodmaynot take advantage of the shape of the phenotype distributionas discussed in the previous subsection Table 2 lists theSNPs selected by the 119898-spacing method (IGS) that had thestrongest main effects Out of 10 selected SNPs rs2079795and rs6440003 coincide with two previous reports [26 27]although twomore matched SNPs rs11989122 and rs1344672could be found as results of our analysis using the same tool
Table 1 Type I error estimation with the significance level 120572 of 005
as in [26] but using the newly imputed dataset 119875 valueswere estimated by permutation of the phenotype values tomake null distributions Permutations were iterated 100000and 10000 times for the main effect and the interactionrespectively A clear distinction between rs11989122 and theother selected SNPs can be seen in the IGS values In Table 3the 2nd order gene-gene interaction result is given The topselected pair (rs6499786 rs1788421) was found to have thestrongest association with ldquoheightrdquo but the distinction wasnot so obvious compared to the case of the main effect
4 Conclusion
In this paper we present a modified 119898-spacing method forgenome-wide association studies with a quantitative traitThe robustness of this method makes it useful for a widerange of sample sizes while the original 119898-spacing methodyields a reliable result only for datasets with a large samplesize Extensive simulation was performed to produce thedatasets with different shapes of phenotype distributionswhile varying the penetrance functions and adjusting theheritability as well Causal pair detection probability wasunaffected the most by the compared methods based on thedistribution shape and heritability while GMDR and QMDRshowed more dependency The proposed 119898-spacing methodis proven to outperform the others regardless of the shape ofthe trait distribution and also the range of lower heritabilityIn the higher heritability region the performance of theproposedmethod is comparable to that of GMDR or QMDRwhichever shows better performance in that region Thiswould lead to versatile applicability of our nonparametricmethod for quantitative traits with various characteristicsWe applied this method to successfully identify the maineffect and gene-gene interactions for the phenotype ldquoheightrdquowith the full set of KARE samples Although several of themoverlapped with a previous report new interactions werealso found Because ldquoheightrdquo is presumed to be a trait with anormal distribution having a higher heritability our methodmay be said to have performed successfullywith no advantageover other methods More extensive study is needed forquantitative traits having various characteristics to furtherdemonstrate the expected robustness of our modified 119898-spacing method
BioMed Research International 9
Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect
Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10
minus5 lowast589 times 10
minus6
rs7316119 12 87531 1 times 10minus5 mdash
rs936634 18 86125 2 times 10minus5 mdash
rs7632381 3 78235 1 times 10minus5 mdash
rs2079795 17 76542 1 times 10minus5 292 times 10
minus6
Ref [26]rs1344672 3 76177 1 times 10
minus5 lowast521 times 10
minus7
rs2523865 6 76044 4 times 10minus5 mdash
rs3790199 20 75362 2 times 10minus5 mdash
rs6440003 3 75231 1 times 10minus5 387 times 10
minus7
Ref [27]rs17628655 19 75117 6 times 10
minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed
Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction
2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10
minus4
rs2529232 7 rs1788421 21 43869 1 times 10minus4
rs2241704 19 rs1788421 21 43855 1 times 10minus4
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)
References
[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012
[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001
[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007
[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007
[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007
[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008
[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008
[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013
[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008
[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008
[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012
10 BioMed Research International
[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948
[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011
[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013
[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011
[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013
[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004
[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003
[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009
[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007
[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997
[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992
[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009
[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004
[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009
[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008
[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003
proposed [3 6] More recently a quantitative MDR (QMDR)was proposed to replace the balanced accuracy metric witha 119905-test statistic [16] However these MDR-based approachesmay oversimplify the original data to some degree throughclassification of phenotypes An entropy-based approachmaywell be an alternative model Entropy is commonly used ininformation theory to measure the uncertainty of randomvariables [12 13] and information gain ormutual informationhas been shown useful to represent association strengths [17ndash19] Although the usefulness of such information theoreticalmethods is well known the statistical methods based onthis approach for analyzing gene-gene interactions of thequantitative traits are rarely found with the exception ofone specific case [20] However the application may also belimited by assuming a normal distribution
Here we extend the usefulness of the information conceptto quantitative traits by considering nonparametric estimatesbased on sample-spacing or 119898-spacing [22ndash25] for theconditional entropy of a quantitative phenotype based ona given genotype The challenge therefore is to couplea nonparametric entropy estimator to correct and stableinformation gainsWe thus developed the useful informationgain standardized (IGS) approach and applied it to datasetscomposed of several genotypes and the quantitative traitThis approach could be considered an extension of previouswork on categorical traits [14] to the quantitative phenotypesThe proposed method however does not attempt in anyway to classify quantitative phenotypes like other methodssuch as variants of MDR but instead handles them directlyproviding an intrinsic advantage of removing the chanceof misclassification While previous entropy-based methodsof analyzing quantitative traits assumed the shape of itsdistribution to be normal [20] our method does not need tospecify the distribution to estimate the association Any reg-ular or irregular distribution would not cause any difficultiesAlthough this is also an advantage of GMDR or QMDR wepropose a method that takes the advantageous characteristicsfrom both of those methods We also performed extensivesimulation studies to compare the powers of the proposedmethod to QMDR and GMDR demonstrating its advantagein detection power
In the following sections after a brief review of nonpara-metric entropy estimation we describe a new method formodeling genetic interactions A nonparametric entropy esti-mator is shown to successfully couple with genetic datasetsthrough our modifying work in the Materials and MethodsApplication of this information gain standardized (IGS)approach is evaluated for both simulation and real datasetsin the Results and Discussions
2 Materials and Methods
21 Estimation of the Entropy for a Continuous Variable If119883is a random vector with probability density function119891(119909) itsdifferential entropy is defined by
A well-known approach for estimating a solution to thisequation is to use plug-in estimates In this approach 119891(119909)
is first estimated using a standard density estimation methodsuch as a histogram or kernel density estimator and theentropy is then computed Integral resubstitution splittingdata and cross-validation estimates are among the usualplug-in estimates [22] Another approach is based on sample-spacing Let 119883119896 be a set of independent and identicallydistributed real valued random variables with correspondingorder statistics of 119883119899119896 Here 119899 represents the total numberof measured samples For the arbitrary integers 119894 and 119898
satisfying the condition of 1 le 119894 lt 119894 + 119898 le 119899 a spacing oforder 119898 or 119898-spacing is defined as 119883119899119894+119898 minus 119883119899119894 A densityestimate based on sample-spacing119898 is then constructed as
119891119899 (119909) =119898
119899
1
119883119899119894119898 minus 119883119899(119894minus1)119898
(2)
where 119909 isin [119883119899(119894minus1)119898 119883119899119894119898) [14] This density estimate isconsistent if as 119899 rarr infin 119898 rarr infin and 119898119899 rarr 0
[22] Several variations of an entropy estimator with minordifferences have been proposed all based on the abovedensity estimates [23 24] Among them the following werereported to approximate with lowered variance [25]
119867119898119899 =1
119899 minus 119898
119899minus119898
sum
119896=1
ln(119899
119898(119883119899119896+119898 minus 119883119899119896)) (3)
Asymptotic bias of this estimator can be corrected by addingadditional terms including the digamma function [22 28]
119867119898119899 =1
119899 minus 119898
119899minus119898
sum
119896=1
ln(119899
119898(119883119899119896+119898 minus 119883119899119896)) minus
Γ1015840(119898)
Γ (119898)+ ln119898
(4)
As119898 increases the correctional terms become negligible andthe two estimators coincide Our evaluation of the entropyof a phenotype 119867(119875) of a quantitative trait is based on thisestimator
22 Modification of the 119898-Spacing Based Entropy EstimatorThe estimator in (4) has both 119899 and 119898 as parameters Ingenetic association studies the number of samples 119899 ofseveral hundreds is common However when the conditionalentropy is estimated there may be a minor allele that couldhave a much smaller number of samples corresponding tothat allele Moreover the choice of the sample-spacing 119898should affect the resulting estimation of an entropy valueTherefore it is required to have an entropy estimation schemeindependent of the number of samples without the needof choosing a particular value of the sample-spacing Toillustrate such a requirement an ensemble of 3000 sets of therandom deviation from 119873(0 1
2) was generated for each data
point in Figure 1 where the mean and standard deviation ofthe estimates are plotted for each ensemble On the left panelof Figure 1 119898 is fixed to 10 and 20 while 119899 is varied Theanalytic formula of the entropy for a normal distribution canbe obtained as follows [20] where 119890 is Eulerrsquos number
119867 = ln (120590radic2120587119890) (5)
BioMed Research International 3
15
14
13
12
11
10
101
102
103
104
105
n-sample⟨m-spacing⟩
H120590 = 10
10
20
mn
(a)
m-spacing
⟨n-sample⟩
0 100 200 300 400
15
14
13
12
11
10
H
120590 = 10
400
mn
(b)
Figure 1 The 119899-dependence (a) and 119898-dependence (b) of the entropy estimator 119867119898119899
An ensemble of 3000 sets of random sampling from119873(0 1
2)was constructed and used for each point in the plotThe sample-spacing119898 was fixed while varying the number of samples 119899 (a) to
evaluate the 119899-dependence of the entropy estimator In (b) 119899 was fixed and 119898 was varied to show the 119898-dependence Analytically obtainedtrue values are represented by the arrowed horizontal lines
The calculated value of (5) is pointed on the vertical axiswith a horizontal arrow with the corresponding 120590 aboveit The obvious 119899-dependence of the estimator can be seenin this plot where the estimation approaches the analyticvalue as 119899 increases with radic119899-consistency as expected [24]In Figure 1(b) 119899 is fixed to 400 while 119898 is varied In thisplot the estimated entropy again changes in value throughoutthe possible range of 119898 It is shown that the estimatedvalue is always smaller than the analytically calculated valueTherefore assigning a particular value to 119898 such as radic119899 thetypical choice [25] would not be appropriate in this samplingrange Because of these 119899- and119898-dependences the estimatorin (4) may need to be modified Therefore we modify theentropy estimator in (4) as follows
119867⟨119898⟩119899 =1
119899 minus 1
119899minus1
sum
119898=1
(1
119899 minus 119898
119899minus119898
sum
119896=1
ln(119899
119898(119883119899119896+119898 minus 119883119899119896))
minusΓ1015840(119898)
Γ (119898)+ ln119898)
(6)
In this modification an entropy estimator is averaged overthe possible 119898 values for each 119899 which is denoted by ⟨119898⟩This estimator is used to plot the entropy versus number ofsamples in Figure 2 Over a wide range of 119899 this entropy esti-mator yields very stable values in contrast to Figure 1(a) Anincrease in the extremely small 119899 range should be within thetolerable error in an application of genome-wide association
as the contribution to the conditional entropy by such aminorallele would be suppressed by the weighting factor of themarginal probability that should be proportional to the num-ber of corresponding samples Analytically obtained entropyvalues for 119873(0 120590
2) with three different 120590rsquos are marked on
the vertical axis on the right-hand side Regardless of thevalue of 120590 the differences between the analytically obtainedvalue and the values given by the estimator stay essentiallythe same Considering that the association study measuresthe difference between the entropy and the correspondingconditional entropy the stability should be a more criticalissue than the absolute value of the estimates Thereforecompensation of this Δ would not be necessary as long asit is stable Furthermore the underestimation of the entropyshown in the plot should have little effect on the associationstrength Hence an entropy estimator has been set up thatshould satisfy the practical 119899-independence without the needto find a proper sample-spacing
23 Evaluation of a Conditional Entropy Now let 119866 be acategorical variable assigned to each sample measurement119883119896 119866may be a genotype given by a measured SNP or a com-bination of SNPs while 119883119896 represents the measured value ofa phenotype For detecting the main effect of a single SNP 119866consists of three categories of 119866 = 0 119866 = 1 and 119866 = 2 Fordetecting the interaction between SNP119894 and SNP119895 119866 consistsof 9 categories such that 119866 = 0 = (SNP119894 = 0 SNP119895 = 0)
4 BioMed Research International
119866 = 1 = (SNP119894 = 0 SNP119895 = 1) 119866 = 2 = (SNP119894 = 0 SNP119895 =2) 119866 = 3 = (SNP119894 = 1 SNP119895 = 0) and 119866 = 8 = (SNP119894 =2 SNP119895 = 2) Detection of the higher order interaction can beperformed in the same way with expansion of the categoriesof 119866 Then an estimator for each specific component of theconditional entropy 119867(119875 | 119866 = 119892) can be constructedusing the genotype-selected subset measurements 119883119899119892 119896
along with an individual sample-spacing of 119898119892 Extending(6) while applying the above argument should now readilyproduce the estimators for the entropy of a phenotype and theconditional entropy Here 119889 denotes the order of a gene-geneinteraction
119867(119875) =1
119899 minus 1
119899minus1
sum
119898=1
(1
119899 minus 119898
119899minus119898
sum
119896=1
ln(119899
119898(119883119899119896+119898 minus 119883119899119896))
24 Standardized Measure of an Association Strength Sincethe differential entropy values are scale-dependent when theabove estimators are calculated with 119883119894 and 119888119883119894 (where 119888
is a constant scale factor) the difference would be ln 119888
For example if the phenotype is height it may be measuredin meters or centimeters In this case the scale factor is100 Nevertheless the association strength should also be thesame Also note that a negative value is perfectly legitimatefor a differential entropy Information gain IG as in theform defined with discrete entropies [14] should satisfy scaleindependence while correctly representing an associationstrength without being affected by negative valuesThereforeit should retain its usefulness as a measure of an associationstrength
IG = 119867 (119875) minus 119867 (119875 | 119866) (9)
IG would be readily estimated with the above estimator(7) IG standardized (IGS) is set up with the means andstandard deviations of IGs obtained from repeated shufflingof the phenotypes while all genotypes remained fixed [14]
20
15
10
05
101
102
103
104
105
n-sample
H⟨m
⟩
120590 = 10
120590 = 14
120590
120590 = 07
10
14
07
Δ = 0183
Δ = 0183
Δ = 0183
n
Figure 2 The 119899-independence and constant offset from the truevalue of the estimates averaged over all possible119898 values for each 119899Each symbol represents a result of samplings from 119873(0 120590
2) While
varying 119899 the number of samples the estimated entropy values wereaveraged over all the possible119898 sample-spacing values ⟨119898⟩ denotesthis averaging which should not depend on weighting due to thevirtually same standard deviations shown in Figure 1(b) Over awiderange of 119899 the estimated entropy stays effectively the same showing119899-independence in the range of practical number of samplingMoreover the almost flat line connecting each symbol shifts up ordown following exactly the change of the true value indicated by thehorizontal arrows The rise in the extremely small 119899 range shouldbe within the tolerable error of any specific application becausethe contribution to conditional entropy by such a case would besuppressed by weighting based on the marginal probability thatshould be proportional to 119899
Let IG(1)119894
denote the maximum IG of the 119894th permuteddataset Then the mean and standard deviation of IG(1)
1
IG(1)2
IG(1)119899
can be computed as follows
IG119901 =sum119899
119894=1IG(1)119894
119899 119878119901 =
radicsum119899
119894=1(IG(1)119894
minus IG119901)2
119899 minus 1
(10)
where 119899 is the number of permuted datasets Now IGS isdefined as follows
IGS =
IG minus IG119901119878119901
(11)
3 Results and Discussions
31 Demonstration of the 119898-Spacing Method To showthe plausibility of the proposed 119898-spacing method
BioMed Research International 5
07
06
05
BA (b
y G
MD
R) P lt 0001
P = 0003
Main effect2-order3-order
minus4 0 4 8
IGS (by m-spacing)
(a)
P lt 0001
P = 0003
t-st
atist
ic (b
y Q
MD
R)
6
2
0
4
Main effect2-order3-order
minus4 0 4 8
IGS (by m-spacing)
(b)
Figure 3 Comparison of the QMDR GMDR and119898-spacing methods Association strengths obtained by GMDR versus119898-spacing (a) andby QMDR versus119898-spacing (b) are compared for a simulated dataset All three methods were used to evaluate the main effect as well as 2ndand 3rd order interactions The dataset was designed to have one 2nd order interaction causal pair
a representative result is shown in Figure 3 using adataset whose quantitative trait was generated from anormal distribution with a single causal SNP pair simulatedas described in the next section The sample size of thedataset was 400 with 20 SNPs In panel (a) the associationstrengths obtained by 119898-spacing and GMDR are plottedas horizontal and vertical coordinates respectively Filledtriangles represent the main effects while open circles arefor the 2nd order interactions Both methods identify thesame single SNP pair having a prominent interaction plottedin the upper right corner One of the SNPs was found toproduce the main effect in contrast to others Again theresult is agreed by both methods 119875 values obtained bypermutation are given in the boxes for those selected pointsAssociation strengths of the 3rd order interactions areplotted with a plus sign Because no 3rd order interaction issimulated into the dataset the combinations of SNPs madeby adding a single SNP to the causal pair are expected tohave high association values Those points are clustered nearthe identified causal pair in the upper right corner In panel(b) of Figure 3 the same comparison was made using theresult from 119898-spacing and QMDR Both comparisons showconsistent results between the proposed 119898-spacing methodand GMDR or QMDR Note that IGS instead of IG was usedThe distribution of the IG values from a dataset would shift toa higher direction with increased order of interactionsThusthe more conditions applied the less entropy may be left tofind In other words as the order of interaction increasesthe conditional entropy 119867(119875 | 119866) tends to decrease while
119867(119875) remains the same Therefore IGS is vital if one needsto compare the association strengths between genotypesfrom different orders of interactions Figure 3 shows that thesimulated causal pair has the largest IGS value among allpoints from different orders of interactions
32 Generation of the Simulation Data To examine theperformance of the 119898-spacing method an extensive set ofsimulation data was necessarily generated First three typesof quantitative trait distributions were considered Two ofthem were normal and gamma distributions and anotherone was a mixture of those two types With single causalpair designed 70 different penetrance models based on [21]were incorporated For the case of a normal distribution aphenotype value 119910 associated with two interacting SNPswas selected from a normal distribution as defined bythe penetrance values tabled for possible combinations ofgenotypes associated as follows
Here 119891119894119895 represents the penetrance values tabled for everymodel simulated and can be found in [21] It is tabulatedfor each possible pair of genotypes (119894 119895) In 70 differentpenetrancemodels 14 different combinations of two differentminor allele frequencies (MAFs) and seven different heri-tability values were considered Specifically we consideredthe cases when the MAFs were 02 and 04 and when theheritability was 001 0025 005 01 02 03 and 04 Three
6 BioMed Research International
(0 0)
0486
171(0 1)
0960
78
minus4 minus2 4
(0 2)
0538
11
(1 0)
0947
80
minus3 minus1
(1 1)
0004
30(1 2)
0811
6
(2 0)
0640
16
minus2 minus1
(2 1)
0606
8
minus2 minus1
(2 2)
0909
0
minus2 0 2 4
00
02
04
minus2 0 2
1 2 3
1 2 3
0 24
00
02
04
00
02
04
minus2 0
0 1 2 30
2 4
00
02
04
minus2 0 2 4
00
02
04
00
02
04
minus2 0 2 4
00
02
04
00
02
04
00
02
04
Figure 4 Demonstration of the simulation scheme Phenotype distributions were plotted to associate with the genotypes by two interactingSNPs as denoted in the parentheses on top of each plot SNPs may take values of 0 1 and 2 or AA Aa and aa For this particular dataset theMAF was set to 0200 On the bottom of each plot the penetrance value for this particular model is given which is taken from [21] Insideeach plot the number of samples generated to satisfy the simulation constraint is given The vertical dotted lines are for the mean values ofthe high- and low-risk groups By constraint the line on the left is for the low-risk group
different values (08 10 and 12) of the variance 120590 were usedindependently for the high- and low-risk groups resulting in9 combinations The grouping constraint for the generatedevent was set such that the averaged 119910 of the high-risk groupshould be larger than or equal to the overall average Theaveraged 119910 of the low-risk group should be less than theoverall average In Figure 4 9 possible distributions of agenerated phenotype are shown In this example the sample
size is 400 The high- and low-risk groups have the samenumber of samples and both have a variance of 10 Forgammadistributions phenotype values follow the rule below
The shape and scale parameters 119896 and 120579 were determinedby 119891119894119895 and 120590 using the relationship 119891119894119895 = 119896120579 and 120590 = 119896120579
2Penetrance models were classified by 7 heritability values
BioMed Research International 7
15
10
05
00
GMDR
00 02 04
Heritability
Hit
ratio
m-spacing
QMDR
(a)
10
05
00
m-spacing
QMDR
GMDR
00 02 04
Heritability
Hit
ratio
(b)
10
05
00
GMDR
00 02 04
Heritability
Hit
ratio
m-spacing
QMDR
(c)
Figure 5 Comparison of the hit ratios or the detection probabilities among the proposed119898-spacing method QMDR and GMDR Genomicdatasets were generated based on 70 different penetrance functions [21] which were in turn classified into 7 distinct values of heritabilityFor each model the phenotype values are simulated with normal (a) gamma (b) and mixed (c) distributions High- and low-risk groupsin a quantitative trait overlapped with 9 different combinations of the standard deviations Considering all of the above 100 data files weregenerated for each case adding up to 9000 simulated files being examined for each point in the plot
001 002 005 01 02 03 and 04 resulting in 10 models foreach heritability The generated data files had a sample sizeof 400 with 20 SNPs In all 3 times 70 times 9 = 1890 differentconditionswere set up with 100 simulated data files generatedfor each condition
33 Comparison of the Detection Probability and Type IError The ldquohit ratiordquo or detection power of the IGS wasevaluated and compared Simulated data files described inthe previous subsection were used All of them had a singlecausal pair to identify In addition to our proposed119898-spacing
method QMDR and GMDR were used to compare theresults Figure 5 shows the comparison Panels (a) (b) and(c) are for the quantitative trait of normal gamma andmixeddistributions respectively Seventy penetrance models weregrouped into 7 cases of heritability on the horizontal axiswhile all 9 combinations of the variances in high- and low-risk distributions were merged into each heritability caseWith a normal distribution as shown in Figure 5(a) the119898-spacingrsquos performance was in between those of QMDRand GMDR for higher values of penetrance However in therange of penetrance less than 02 the 119898-spacing performs
8 BioMed Research International
best Note that theQMDR shows higher detection probabilitythan the GMDR throughout the range In the case of agamma distribution as shown in Figure 5(b) the QMDRrsquosperformance drops rapidly as the heritability decreases whenthe hit ratios of 119898-spacing as well as the GMDR stay betterthan that of QMDR and are comparable to each other Notethe switch of the GMDR and QMDRrsquos performance rankswith the change of the phenotype distribution What QMDRdoes is essentially the dichotomization of the observed valuesof the quantitative phenotypes Therefore it should do betterwith well-defined symmetric distributions such as a normaldistribution than with an asymmetric one (eg gamma dis-tribution)The proposed119898-spacingmethod is expected to beeffective regardless of the shape of the phenotype distributionbecause it makes no assumptions regarding the distributionand is therefore nonparametric as demonstrated in Figures5(a) and 5(b)This nonparameterization is again confirmed inFigure 5(c) showing that119898-spacing outperforms the QMDRand theGMDR throughout the whole range of heritability inthe case of themixed form of phenotype distribution Amongthe threemethods examined119898-spacing was themost robustperforming consistentlywithin the range of conditions for thesimulation
To estimate the type I error rate the null datasets weregenerated under the same scheme as used for the detectionpower analysis except that there was no causal pair intendedNow there are 20 SNPs that none of the pairs are expectedto have an association Permutation 119875 values for a particularpair were obtained by permuting each dataset 1000 timesWe took the significance level 120572 as 005 to get the ratio ofthe permutation 119875 values smaller than or equal to 120572 Wereport this ratio as the type I error rate in Table 1 whoseaccuracy to one decimal place when expressed in percent wasensured by the number of the permutation Table 1 presentsthe type I error rate for each combination of three traitdistributions two MAFs and seven heritability values alongwith the overall estimates Throughout these conditionsthe type I error rates are gathered tightly around 5 withmaximum and minimum of 54 and 43 respectivelyMoreover there exists no sign of the dependence on thetrait shape heritability and MAF Therefore our proposedmethod preserved the type I error rates on these condi-tions
34 Application to Real Data A full-scale real dataset fromthe Korean Association Resource (KARE) project [20] wasanalyzed to investigate the effectiveness of the 119898-spacingmethod Among the available phenotypes ldquoheightrdquo waschosenwith a sample size of 8842 from the population-basedcohortThe total number of SNPs was 327872 spanning over22 chromosomesThe ldquoheightrdquo phenotype showed to be closeto a normal distribution such that the119898-spacingmethodmaynot take advantage of the shape of the phenotype distributionas discussed in the previous subsection Table 2 lists theSNPs selected by the 119898-spacing method (IGS) that had thestrongest main effects Out of 10 selected SNPs rs2079795and rs6440003 coincide with two previous reports [26 27]although twomore matched SNPs rs11989122 and rs1344672could be found as results of our analysis using the same tool
Table 1 Type I error estimation with the significance level 120572 of 005
as in [26] but using the newly imputed dataset 119875 valueswere estimated by permutation of the phenotype values tomake null distributions Permutations were iterated 100000and 10000 times for the main effect and the interactionrespectively A clear distinction between rs11989122 and theother selected SNPs can be seen in the IGS values In Table 3the 2nd order gene-gene interaction result is given The topselected pair (rs6499786 rs1788421) was found to have thestrongest association with ldquoheightrdquo but the distinction wasnot so obvious compared to the case of the main effect
4 Conclusion
In this paper we present a modified 119898-spacing method forgenome-wide association studies with a quantitative traitThe robustness of this method makes it useful for a widerange of sample sizes while the original 119898-spacing methodyields a reliable result only for datasets with a large samplesize Extensive simulation was performed to produce thedatasets with different shapes of phenotype distributionswhile varying the penetrance functions and adjusting theheritability as well Causal pair detection probability wasunaffected the most by the compared methods based on thedistribution shape and heritability while GMDR and QMDRshowed more dependency The proposed 119898-spacing methodis proven to outperform the others regardless of the shape ofthe trait distribution and also the range of lower heritabilityIn the higher heritability region the performance of theproposedmethod is comparable to that of GMDR or QMDRwhichever shows better performance in that region Thiswould lead to versatile applicability of our nonparametricmethod for quantitative traits with various characteristicsWe applied this method to successfully identify the maineffect and gene-gene interactions for the phenotype ldquoheightrdquowith the full set of KARE samples Although several of themoverlapped with a previous report new interactions werealso found Because ldquoheightrdquo is presumed to be a trait with anormal distribution having a higher heritability our methodmay be said to have performed successfullywith no advantageover other methods More extensive study is needed forquantitative traits having various characteristics to furtherdemonstrate the expected robustness of our modified 119898-spacing method
BioMed Research International 9
Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect
Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10
minus5 lowast589 times 10
minus6
rs7316119 12 87531 1 times 10minus5 mdash
rs936634 18 86125 2 times 10minus5 mdash
rs7632381 3 78235 1 times 10minus5 mdash
rs2079795 17 76542 1 times 10minus5 292 times 10
minus6
Ref [26]rs1344672 3 76177 1 times 10
minus5 lowast521 times 10
minus7
rs2523865 6 76044 4 times 10minus5 mdash
rs3790199 20 75362 2 times 10minus5 mdash
rs6440003 3 75231 1 times 10minus5 387 times 10
minus7
Ref [27]rs17628655 19 75117 6 times 10
minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed
Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction
2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10
minus4
rs2529232 7 rs1788421 21 43869 1 times 10minus4
rs2241704 19 rs1788421 21 43855 1 times 10minus4
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)
References
[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012
[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001
[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007
[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007
[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007
[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008
[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008
[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013
[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008
[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008
[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012
10 BioMed Research International
[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948
[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011
[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013
[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011
[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013
[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004
[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003
[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009
[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007
[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997
[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992
[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009
[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004
[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009
[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008
[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003
Figure 1 The 119899-dependence (a) and 119898-dependence (b) of the entropy estimator 119867119898119899
An ensemble of 3000 sets of random sampling from119873(0 1
2)was constructed and used for each point in the plotThe sample-spacing119898 was fixed while varying the number of samples 119899 (a) to
evaluate the 119899-dependence of the entropy estimator In (b) 119899 was fixed and 119898 was varied to show the 119898-dependence Analytically obtainedtrue values are represented by the arrowed horizontal lines
The calculated value of (5) is pointed on the vertical axiswith a horizontal arrow with the corresponding 120590 aboveit The obvious 119899-dependence of the estimator can be seenin this plot where the estimation approaches the analyticvalue as 119899 increases with radic119899-consistency as expected [24]In Figure 1(b) 119899 is fixed to 400 while 119898 is varied In thisplot the estimated entropy again changes in value throughoutthe possible range of 119898 It is shown that the estimatedvalue is always smaller than the analytically calculated valueTherefore assigning a particular value to 119898 such as radic119899 thetypical choice [25] would not be appropriate in this samplingrange Because of these 119899- and119898-dependences the estimatorin (4) may need to be modified Therefore we modify theentropy estimator in (4) as follows
119867⟨119898⟩119899 =1
119899 minus 1
119899minus1
sum
119898=1
(1
119899 minus 119898
119899minus119898
sum
119896=1
ln(119899
119898(119883119899119896+119898 minus 119883119899119896))
minusΓ1015840(119898)
Γ (119898)+ ln119898)
(6)
In this modification an entropy estimator is averaged overthe possible 119898 values for each 119899 which is denoted by ⟨119898⟩This estimator is used to plot the entropy versus number ofsamples in Figure 2 Over a wide range of 119899 this entropy esti-mator yields very stable values in contrast to Figure 1(a) Anincrease in the extremely small 119899 range should be within thetolerable error in an application of genome-wide association
as the contribution to the conditional entropy by such aminorallele would be suppressed by the weighting factor of themarginal probability that should be proportional to the num-ber of corresponding samples Analytically obtained entropyvalues for 119873(0 120590
2) with three different 120590rsquos are marked on
the vertical axis on the right-hand side Regardless of thevalue of 120590 the differences between the analytically obtainedvalue and the values given by the estimator stay essentiallythe same Considering that the association study measuresthe difference between the entropy and the correspondingconditional entropy the stability should be a more criticalissue than the absolute value of the estimates Thereforecompensation of this Δ would not be necessary as long asit is stable Furthermore the underestimation of the entropyshown in the plot should have little effect on the associationstrength Hence an entropy estimator has been set up thatshould satisfy the practical 119899-independence without the needto find a proper sample-spacing
23 Evaluation of a Conditional Entropy Now let 119866 be acategorical variable assigned to each sample measurement119883119896 119866may be a genotype given by a measured SNP or a com-bination of SNPs while 119883119896 represents the measured value ofa phenotype For detecting the main effect of a single SNP 119866consists of three categories of 119866 = 0 119866 = 1 and 119866 = 2 Fordetecting the interaction between SNP119894 and SNP119895 119866 consistsof 9 categories such that 119866 = 0 = (SNP119894 = 0 SNP119895 = 0)
4 BioMed Research International
119866 = 1 = (SNP119894 = 0 SNP119895 = 1) 119866 = 2 = (SNP119894 = 0 SNP119895 =2) 119866 = 3 = (SNP119894 = 1 SNP119895 = 0) and 119866 = 8 = (SNP119894 =2 SNP119895 = 2) Detection of the higher order interaction can beperformed in the same way with expansion of the categoriesof 119866 Then an estimator for each specific component of theconditional entropy 119867(119875 | 119866 = 119892) can be constructedusing the genotype-selected subset measurements 119883119899119892 119896
along with an individual sample-spacing of 119898119892 Extending(6) while applying the above argument should now readilyproduce the estimators for the entropy of a phenotype and theconditional entropy Here 119889 denotes the order of a gene-geneinteraction
119867(119875) =1
119899 minus 1
119899minus1
sum
119898=1
(1
119899 minus 119898
119899minus119898
sum
119896=1
ln(119899
119898(119883119899119896+119898 minus 119883119899119896))
24 Standardized Measure of an Association Strength Sincethe differential entropy values are scale-dependent when theabove estimators are calculated with 119883119894 and 119888119883119894 (where 119888
is a constant scale factor) the difference would be ln 119888
For example if the phenotype is height it may be measuredin meters or centimeters In this case the scale factor is100 Nevertheless the association strength should also be thesame Also note that a negative value is perfectly legitimatefor a differential entropy Information gain IG as in theform defined with discrete entropies [14] should satisfy scaleindependence while correctly representing an associationstrength without being affected by negative valuesThereforeit should retain its usefulness as a measure of an associationstrength
IG = 119867 (119875) minus 119867 (119875 | 119866) (9)
IG would be readily estimated with the above estimator(7) IG standardized (IGS) is set up with the means andstandard deviations of IGs obtained from repeated shufflingof the phenotypes while all genotypes remained fixed [14]
20
15
10
05
101
102
103
104
105
n-sample
H⟨m
⟩
120590 = 10
120590 = 14
120590
120590 = 07
10
14
07
Δ = 0183
Δ = 0183
Δ = 0183
n
Figure 2 The 119899-independence and constant offset from the truevalue of the estimates averaged over all possible119898 values for each 119899Each symbol represents a result of samplings from 119873(0 120590
2) While
varying 119899 the number of samples the estimated entropy values wereaveraged over all the possible119898 sample-spacing values ⟨119898⟩ denotesthis averaging which should not depend on weighting due to thevirtually same standard deviations shown in Figure 1(b) Over awiderange of 119899 the estimated entropy stays effectively the same showing119899-independence in the range of practical number of samplingMoreover the almost flat line connecting each symbol shifts up ordown following exactly the change of the true value indicated by thehorizontal arrows The rise in the extremely small 119899 range shouldbe within the tolerable error of any specific application becausethe contribution to conditional entropy by such a case would besuppressed by weighting based on the marginal probability thatshould be proportional to 119899
Let IG(1)119894
denote the maximum IG of the 119894th permuteddataset Then the mean and standard deviation of IG(1)
1
IG(1)2
IG(1)119899
can be computed as follows
IG119901 =sum119899
119894=1IG(1)119894
119899 119878119901 =
radicsum119899
119894=1(IG(1)119894
minus IG119901)2
119899 minus 1
(10)
where 119899 is the number of permuted datasets Now IGS isdefined as follows
IGS =
IG minus IG119901119878119901
(11)
3 Results and Discussions
31 Demonstration of the 119898-Spacing Method To showthe plausibility of the proposed 119898-spacing method
BioMed Research International 5
07
06
05
BA (b
y G
MD
R) P lt 0001
P = 0003
Main effect2-order3-order
minus4 0 4 8
IGS (by m-spacing)
(a)
P lt 0001
P = 0003
t-st
atist
ic (b
y Q
MD
R)
6
2
0
4
Main effect2-order3-order
minus4 0 4 8
IGS (by m-spacing)
(b)
Figure 3 Comparison of the QMDR GMDR and119898-spacing methods Association strengths obtained by GMDR versus119898-spacing (a) andby QMDR versus119898-spacing (b) are compared for a simulated dataset All three methods were used to evaluate the main effect as well as 2ndand 3rd order interactions The dataset was designed to have one 2nd order interaction causal pair
a representative result is shown in Figure 3 using adataset whose quantitative trait was generated from anormal distribution with a single causal SNP pair simulatedas described in the next section The sample size of thedataset was 400 with 20 SNPs In panel (a) the associationstrengths obtained by 119898-spacing and GMDR are plottedas horizontal and vertical coordinates respectively Filledtriangles represent the main effects while open circles arefor the 2nd order interactions Both methods identify thesame single SNP pair having a prominent interaction plottedin the upper right corner One of the SNPs was found toproduce the main effect in contrast to others Again theresult is agreed by both methods 119875 values obtained bypermutation are given in the boxes for those selected pointsAssociation strengths of the 3rd order interactions areplotted with a plus sign Because no 3rd order interaction issimulated into the dataset the combinations of SNPs madeby adding a single SNP to the causal pair are expected tohave high association values Those points are clustered nearthe identified causal pair in the upper right corner In panel(b) of Figure 3 the same comparison was made using theresult from 119898-spacing and QMDR Both comparisons showconsistent results between the proposed 119898-spacing methodand GMDR or QMDR Note that IGS instead of IG was usedThe distribution of the IG values from a dataset would shift toa higher direction with increased order of interactionsThusthe more conditions applied the less entropy may be left tofind In other words as the order of interaction increasesthe conditional entropy 119867(119875 | 119866) tends to decrease while
119867(119875) remains the same Therefore IGS is vital if one needsto compare the association strengths between genotypesfrom different orders of interactions Figure 3 shows that thesimulated causal pair has the largest IGS value among allpoints from different orders of interactions
32 Generation of the Simulation Data To examine theperformance of the 119898-spacing method an extensive set ofsimulation data was necessarily generated First three typesof quantitative trait distributions were considered Two ofthem were normal and gamma distributions and anotherone was a mixture of those two types With single causalpair designed 70 different penetrance models based on [21]were incorporated For the case of a normal distribution aphenotype value 119910 associated with two interacting SNPswas selected from a normal distribution as defined bythe penetrance values tabled for possible combinations ofgenotypes associated as follows
Here 119891119894119895 represents the penetrance values tabled for everymodel simulated and can be found in [21] It is tabulatedfor each possible pair of genotypes (119894 119895) In 70 differentpenetrancemodels 14 different combinations of two differentminor allele frequencies (MAFs) and seven different heri-tability values were considered Specifically we consideredthe cases when the MAFs were 02 and 04 and when theheritability was 001 0025 005 01 02 03 and 04 Three
6 BioMed Research International
(0 0)
0486
171(0 1)
0960
78
minus4 minus2 4
(0 2)
0538
11
(1 0)
0947
80
minus3 minus1
(1 1)
0004
30(1 2)
0811
6
(2 0)
0640
16
minus2 minus1
(2 1)
0606
8
minus2 minus1
(2 2)
0909
0
minus2 0 2 4
00
02
04
minus2 0 2
1 2 3
1 2 3
0 24
00
02
04
00
02
04
minus2 0
0 1 2 30
2 4
00
02
04
minus2 0 2 4
00
02
04
00
02
04
minus2 0 2 4
00
02
04
00
02
04
00
02
04
Figure 4 Demonstration of the simulation scheme Phenotype distributions were plotted to associate with the genotypes by two interactingSNPs as denoted in the parentheses on top of each plot SNPs may take values of 0 1 and 2 or AA Aa and aa For this particular dataset theMAF was set to 0200 On the bottom of each plot the penetrance value for this particular model is given which is taken from [21] Insideeach plot the number of samples generated to satisfy the simulation constraint is given The vertical dotted lines are for the mean values ofthe high- and low-risk groups By constraint the line on the left is for the low-risk group
different values (08 10 and 12) of the variance 120590 were usedindependently for the high- and low-risk groups resulting in9 combinations The grouping constraint for the generatedevent was set such that the averaged 119910 of the high-risk groupshould be larger than or equal to the overall average Theaveraged 119910 of the low-risk group should be less than theoverall average In Figure 4 9 possible distributions of agenerated phenotype are shown In this example the sample
size is 400 The high- and low-risk groups have the samenumber of samples and both have a variance of 10 Forgammadistributions phenotype values follow the rule below
The shape and scale parameters 119896 and 120579 were determinedby 119891119894119895 and 120590 using the relationship 119891119894119895 = 119896120579 and 120590 = 119896120579
2Penetrance models were classified by 7 heritability values
BioMed Research International 7
15
10
05
00
GMDR
00 02 04
Heritability
Hit
ratio
m-spacing
QMDR
(a)
10
05
00
m-spacing
QMDR
GMDR
00 02 04
Heritability
Hit
ratio
(b)
10
05
00
GMDR
00 02 04
Heritability
Hit
ratio
m-spacing
QMDR
(c)
Figure 5 Comparison of the hit ratios or the detection probabilities among the proposed119898-spacing method QMDR and GMDR Genomicdatasets were generated based on 70 different penetrance functions [21] which were in turn classified into 7 distinct values of heritabilityFor each model the phenotype values are simulated with normal (a) gamma (b) and mixed (c) distributions High- and low-risk groupsin a quantitative trait overlapped with 9 different combinations of the standard deviations Considering all of the above 100 data files weregenerated for each case adding up to 9000 simulated files being examined for each point in the plot
001 002 005 01 02 03 and 04 resulting in 10 models foreach heritability The generated data files had a sample sizeof 400 with 20 SNPs In all 3 times 70 times 9 = 1890 differentconditionswere set up with 100 simulated data files generatedfor each condition
33 Comparison of the Detection Probability and Type IError The ldquohit ratiordquo or detection power of the IGS wasevaluated and compared Simulated data files described inthe previous subsection were used All of them had a singlecausal pair to identify In addition to our proposed119898-spacing
method QMDR and GMDR were used to compare theresults Figure 5 shows the comparison Panels (a) (b) and(c) are for the quantitative trait of normal gamma andmixeddistributions respectively Seventy penetrance models weregrouped into 7 cases of heritability on the horizontal axiswhile all 9 combinations of the variances in high- and low-risk distributions were merged into each heritability caseWith a normal distribution as shown in Figure 5(a) the119898-spacingrsquos performance was in between those of QMDRand GMDR for higher values of penetrance However in therange of penetrance less than 02 the 119898-spacing performs
8 BioMed Research International
best Note that theQMDR shows higher detection probabilitythan the GMDR throughout the range In the case of agamma distribution as shown in Figure 5(b) the QMDRrsquosperformance drops rapidly as the heritability decreases whenthe hit ratios of 119898-spacing as well as the GMDR stay betterthan that of QMDR and are comparable to each other Notethe switch of the GMDR and QMDRrsquos performance rankswith the change of the phenotype distribution What QMDRdoes is essentially the dichotomization of the observed valuesof the quantitative phenotypes Therefore it should do betterwith well-defined symmetric distributions such as a normaldistribution than with an asymmetric one (eg gamma dis-tribution)The proposed119898-spacingmethod is expected to beeffective regardless of the shape of the phenotype distributionbecause it makes no assumptions regarding the distributionand is therefore nonparametric as demonstrated in Figures5(a) and 5(b)This nonparameterization is again confirmed inFigure 5(c) showing that119898-spacing outperforms the QMDRand theGMDR throughout the whole range of heritability inthe case of themixed form of phenotype distribution Amongthe threemethods examined119898-spacing was themost robustperforming consistentlywithin the range of conditions for thesimulation
To estimate the type I error rate the null datasets weregenerated under the same scheme as used for the detectionpower analysis except that there was no causal pair intendedNow there are 20 SNPs that none of the pairs are expectedto have an association Permutation 119875 values for a particularpair were obtained by permuting each dataset 1000 timesWe took the significance level 120572 as 005 to get the ratio ofthe permutation 119875 values smaller than or equal to 120572 Wereport this ratio as the type I error rate in Table 1 whoseaccuracy to one decimal place when expressed in percent wasensured by the number of the permutation Table 1 presentsthe type I error rate for each combination of three traitdistributions two MAFs and seven heritability values alongwith the overall estimates Throughout these conditionsthe type I error rates are gathered tightly around 5 withmaximum and minimum of 54 and 43 respectivelyMoreover there exists no sign of the dependence on thetrait shape heritability and MAF Therefore our proposedmethod preserved the type I error rates on these condi-tions
34 Application to Real Data A full-scale real dataset fromthe Korean Association Resource (KARE) project [20] wasanalyzed to investigate the effectiveness of the 119898-spacingmethod Among the available phenotypes ldquoheightrdquo waschosenwith a sample size of 8842 from the population-basedcohortThe total number of SNPs was 327872 spanning over22 chromosomesThe ldquoheightrdquo phenotype showed to be closeto a normal distribution such that the119898-spacingmethodmaynot take advantage of the shape of the phenotype distributionas discussed in the previous subsection Table 2 lists theSNPs selected by the 119898-spacing method (IGS) that had thestrongest main effects Out of 10 selected SNPs rs2079795and rs6440003 coincide with two previous reports [26 27]although twomore matched SNPs rs11989122 and rs1344672could be found as results of our analysis using the same tool
Table 1 Type I error estimation with the significance level 120572 of 005
as in [26] but using the newly imputed dataset 119875 valueswere estimated by permutation of the phenotype values tomake null distributions Permutations were iterated 100000and 10000 times for the main effect and the interactionrespectively A clear distinction between rs11989122 and theother selected SNPs can be seen in the IGS values In Table 3the 2nd order gene-gene interaction result is given The topselected pair (rs6499786 rs1788421) was found to have thestrongest association with ldquoheightrdquo but the distinction wasnot so obvious compared to the case of the main effect
4 Conclusion
In this paper we present a modified 119898-spacing method forgenome-wide association studies with a quantitative traitThe robustness of this method makes it useful for a widerange of sample sizes while the original 119898-spacing methodyields a reliable result only for datasets with a large samplesize Extensive simulation was performed to produce thedatasets with different shapes of phenotype distributionswhile varying the penetrance functions and adjusting theheritability as well Causal pair detection probability wasunaffected the most by the compared methods based on thedistribution shape and heritability while GMDR and QMDRshowed more dependency The proposed 119898-spacing methodis proven to outperform the others regardless of the shape ofthe trait distribution and also the range of lower heritabilityIn the higher heritability region the performance of theproposedmethod is comparable to that of GMDR or QMDRwhichever shows better performance in that region Thiswould lead to versatile applicability of our nonparametricmethod for quantitative traits with various characteristicsWe applied this method to successfully identify the maineffect and gene-gene interactions for the phenotype ldquoheightrdquowith the full set of KARE samples Although several of themoverlapped with a previous report new interactions werealso found Because ldquoheightrdquo is presumed to be a trait with anormal distribution having a higher heritability our methodmay be said to have performed successfullywith no advantageover other methods More extensive study is needed forquantitative traits having various characteristics to furtherdemonstrate the expected robustness of our modified 119898-spacing method
BioMed Research International 9
Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect
Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10
minus5 lowast589 times 10
minus6
rs7316119 12 87531 1 times 10minus5 mdash
rs936634 18 86125 2 times 10minus5 mdash
rs7632381 3 78235 1 times 10minus5 mdash
rs2079795 17 76542 1 times 10minus5 292 times 10
minus6
Ref [26]rs1344672 3 76177 1 times 10
minus5 lowast521 times 10
minus7
rs2523865 6 76044 4 times 10minus5 mdash
rs3790199 20 75362 2 times 10minus5 mdash
rs6440003 3 75231 1 times 10minus5 387 times 10
minus7
Ref [27]rs17628655 19 75117 6 times 10
minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed
Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction
2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10
minus4
rs2529232 7 rs1788421 21 43869 1 times 10minus4
rs2241704 19 rs1788421 21 43855 1 times 10minus4
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)
References
[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012
[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001
[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007
[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007
[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007
[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008
[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008
[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013
[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008
[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008
[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012
10 BioMed Research International
[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948
[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011
[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013
[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011
[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013
[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004
[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003
[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009
[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007
[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997
[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992
[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009
[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004
[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009
[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008
[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003
119866 = 1 = (SNP119894 = 0 SNP119895 = 1) 119866 = 2 = (SNP119894 = 0 SNP119895 =2) 119866 = 3 = (SNP119894 = 1 SNP119895 = 0) and 119866 = 8 = (SNP119894 =2 SNP119895 = 2) Detection of the higher order interaction can beperformed in the same way with expansion of the categoriesof 119866 Then an estimator for each specific component of theconditional entropy 119867(119875 | 119866 = 119892) can be constructedusing the genotype-selected subset measurements 119883119899119892 119896
along with an individual sample-spacing of 119898119892 Extending(6) while applying the above argument should now readilyproduce the estimators for the entropy of a phenotype and theconditional entropy Here 119889 denotes the order of a gene-geneinteraction
119867(119875) =1
119899 minus 1
119899minus1
sum
119898=1
(1
119899 minus 119898
119899minus119898
sum
119896=1
ln(119899
119898(119883119899119896+119898 minus 119883119899119896))
24 Standardized Measure of an Association Strength Sincethe differential entropy values are scale-dependent when theabove estimators are calculated with 119883119894 and 119888119883119894 (where 119888
is a constant scale factor) the difference would be ln 119888
For example if the phenotype is height it may be measuredin meters or centimeters In this case the scale factor is100 Nevertheless the association strength should also be thesame Also note that a negative value is perfectly legitimatefor a differential entropy Information gain IG as in theform defined with discrete entropies [14] should satisfy scaleindependence while correctly representing an associationstrength without being affected by negative valuesThereforeit should retain its usefulness as a measure of an associationstrength
IG = 119867 (119875) minus 119867 (119875 | 119866) (9)
IG would be readily estimated with the above estimator(7) IG standardized (IGS) is set up with the means andstandard deviations of IGs obtained from repeated shufflingof the phenotypes while all genotypes remained fixed [14]
20
15
10
05
101
102
103
104
105
n-sample
H⟨m
⟩
120590 = 10
120590 = 14
120590
120590 = 07
10
14
07
Δ = 0183
Δ = 0183
Δ = 0183
n
Figure 2 The 119899-independence and constant offset from the truevalue of the estimates averaged over all possible119898 values for each 119899Each symbol represents a result of samplings from 119873(0 120590
2) While
varying 119899 the number of samples the estimated entropy values wereaveraged over all the possible119898 sample-spacing values ⟨119898⟩ denotesthis averaging which should not depend on weighting due to thevirtually same standard deviations shown in Figure 1(b) Over awiderange of 119899 the estimated entropy stays effectively the same showing119899-independence in the range of practical number of samplingMoreover the almost flat line connecting each symbol shifts up ordown following exactly the change of the true value indicated by thehorizontal arrows The rise in the extremely small 119899 range shouldbe within the tolerable error of any specific application becausethe contribution to conditional entropy by such a case would besuppressed by weighting based on the marginal probability thatshould be proportional to 119899
Let IG(1)119894
denote the maximum IG of the 119894th permuteddataset Then the mean and standard deviation of IG(1)
1
IG(1)2
IG(1)119899
can be computed as follows
IG119901 =sum119899
119894=1IG(1)119894
119899 119878119901 =
radicsum119899
119894=1(IG(1)119894
minus IG119901)2
119899 minus 1
(10)
where 119899 is the number of permuted datasets Now IGS isdefined as follows
IGS =
IG minus IG119901119878119901
(11)
3 Results and Discussions
31 Demonstration of the 119898-Spacing Method To showthe plausibility of the proposed 119898-spacing method
BioMed Research International 5
07
06
05
BA (b
y G
MD
R) P lt 0001
P = 0003
Main effect2-order3-order
minus4 0 4 8
IGS (by m-spacing)
(a)
P lt 0001
P = 0003
t-st
atist
ic (b
y Q
MD
R)
6
2
0
4
Main effect2-order3-order
minus4 0 4 8
IGS (by m-spacing)
(b)
Figure 3 Comparison of the QMDR GMDR and119898-spacing methods Association strengths obtained by GMDR versus119898-spacing (a) andby QMDR versus119898-spacing (b) are compared for a simulated dataset All three methods were used to evaluate the main effect as well as 2ndand 3rd order interactions The dataset was designed to have one 2nd order interaction causal pair
a representative result is shown in Figure 3 using adataset whose quantitative trait was generated from anormal distribution with a single causal SNP pair simulatedas described in the next section The sample size of thedataset was 400 with 20 SNPs In panel (a) the associationstrengths obtained by 119898-spacing and GMDR are plottedas horizontal and vertical coordinates respectively Filledtriangles represent the main effects while open circles arefor the 2nd order interactions Both methods identify thesame single SNP pair having a prominent interaction plottedin the upper right corner One of the SNPs was found toproduce the main effect in contrast to others Again theresult is agreed by both methods 119875 values obtained bypermutation are given in the boxes for those selected pointsAssociation strengths of the 3rd order interactions areplotted with a plus sign Because no 3rd order interaction issimulated into the dataset the combinations of SNPs madeby adding a single SNP to the causal pair are expected tohave high association values Those points are clustered nearthe identified causal pair in the upper right corner In panel(b) of Figure 3 the same comparison was made using theresult from 119898-spacing and QMDR Both comparisons showconsistent results between the proposed 119898-spacing methodand GMDR or QMDR Note that IGS instead of IG was usedThe distribution of the IG values from a dataset would shift toa higher direction with increased order of interactionsThusthe more conditions applied the less entropy may be left tofind In other words as the order of interaction increasesthe conditional entropy 119867(119875 | 119866) tends to decrease while
119867(119875) remains the same Therefore IGS is vital if one needsto compare the association strengths between genotypesfrom different orders of interactions Figure 3 shows that thesimulated causal pair has the largest IGS value among allpoints from different orders of interactions
32 Generation of the Simulation Data To examine theperformance of the 119898-spacing method an extensive set ofsimulation data was necessarily generated First three typesof quantitative trait distributions were considered Two ofthem were normal and gamma distributions and anotherone was a mixture of those two types With single causalpair designed 70 different penetrance models based on [21]were incorporated For the case of a normal distribution aphenotype value 119910 associated with two interacting SNPswas selected from a normal distribution as defined bythe penetrance values tabled for possible combinations ofgenotypes associated as follows
Here 119891119894119895 represents the penetrance values tabled for everymodel simulated and can be found in [21] It is tabulatedfor each possible pair of genotypes (119894 119895) In 70 differentpenetrancemodels 14 different combinations of two differentminor allele frequencies (MAFs) and seven different heri-tability values were considered Specifically we consideredthe cases when the MAFs were 02 and 04 and when theheritability was 001 0025 005 01 02 03 and 04 Three
6 BioMed Research International
(0 0)
0486
171(0 1)
0960
78
minus4 minus2 4
(0 2)
0538
11
(1 0)
0947
80
minus3 minus1
(1 1)
0004
30(1 2)
0811
6
(2 0)
0640
16
minus2 minus1
(2 1)
0606
8
minus2 minus1
(2 2)
0909
0
minus2 0 2 4
00
02
04
minus2 0 2
1 2 3
1 2 3
0 24
00
02
04
00
02
04
minus2 0
0 1 2 30
2 4
00
02
04
minus2 0 2 4
00
02
04
00
02
04
minus2 0 2 4
00
02
04
00
02
04
00
02
04
Figure 4 Demonstration of the simulation scheme Phenotype distributions were plotted to associate with the genotypes by two interactingSNPs as denoted in the parentheses on top of each plot SNPs may take values of 0 1 and 2 or AA Aa and aa For this particular dataset theMAF was set to 0200 On the bottom of each plot the penetrance value for this particular model is given which is taken from [21] Insideeach plot the number of samples generated to satisfy the simulation constraint is given The vertical dotted lines are for the mean values ofthe high- and low-risk groups By constraint the line on the left is for the low-risk group
different values (08 10 and 12) of the variance 120590 were usedindependently for the high- and low-risk groups resulting in9 combinations The grouping constraint for the generatedevent was set such that the averaged 119910 of the high-risk groupshould be larger than or equal to the overall average Theaveraged 119910 of the low-risk group should be less than theoverall average In Figure 4 9 possible distributions of agenerated phenotype are shown In this example the sample
size is 400 The high- and low-risk groups have the samenumber of samples and both have a variance of 10 Forgammadistributions phenotype values follow the rule below
The shape and scale parameters 119896 and 120579 were determinedby 119891119894119895 and 120590 using the relationship 119891119894119895 = 119896120579 and 120590 = 119896120579
2Penetrance models were classified by 7 heritability values
BioMed Research International 7
15
10
05
00
GMDR
00 02 04
Heritability
Hit
ratio
m-spacing
QMDR
(a)
10
05
00
m-spacing
QMDR
GMDR
00 02 04
Heritability
Hit
ratio
(b)
10
05
00
GMDR
00 02 04
Heritability
Hit
ratio
m-spacing
QMDR
(c)
Figure 5 Comparison of the hit ratios or the detection probabilities among the proposed119898-spacing method QMDR and GMDR Genomicdatasets were generated based on 70 different penetrance functions [21] which were in turn classified into 7 distinct values of heritabilityFor each model the phenotype values are simulated with normal (a) gamma (b) and mixed (c) distributions High- and low-risk groupsin a quantitative trait overlapped with 9 different combinations of the standard deviations Considering all of the above 100 data files weregenerated for each case adding up to 9000 simulated files being examined for each point in the plot
001 002 005 01 02 03 and 04 resulting in 10 models foreach heritability The generated data files had a sample sizeof 400 with 20 SNPs In all 3 times 70 times 9 = 1890 differentconditionswere set up with 100 simulated data files generatedfor each condition
33 Comparison of the Detection Probability and Type IError The ldquohit ratiordquo or detection power of the IGS wasevaluated and compared Simulated data files described inthe previous subsection were used All of them had a singlecausal pair to identify In addition to our proposed119898-spacing
method QMDR and GMDR were used to compare theresults Figure 5 shows the comparison Panels (a) (b) and(c) are for the quantitative trait of normal gamma andmixeddistributions respectively Seventy penetrance models weregrouped into 7 cases of heritability on the horizontal axiswhile all 9 combinations of the variances in high- and low-risk distributions were merged into each heritability caseWith a normal distribution as shown in Figure 5(a) the119898-spacingrsquos performance was in between those of QMDRand GMDR for higher values of penetrance However in therange of penetrance less than 02 the 119898-spacing performs
8 BioMed Research International
best Note that theQMDR shows higher detection probabilitythan the GMDR throughout the range In the case of agamma distribution as shown in Figure 5(b) the QMDRrsquosperformance drops rapidly as the heritability decreases whenthe hit ratios of 119898-spacing as well as the GMDR stay betterthan that of QMDR and are comparable to each other Notethe switch of the GMDR and QMDRrsquos performance rankswith the change of the phenotype distribution What QMDRdoes is essentially the dichotomization of the observed valuesof the quantitative phenotypes Therefore it should do betterwith well-defined symmetric distributions such as a normaldistribution than with an asymmetric one (eg gamma dis-tribution)The proposed119898-spacingmethod is expected to beeffective regardless of the shape of the phenotype distributionbecause it makes no assumptions regarding the distributionand is therefore nonparametric as demonstrated in Figures5(a) and 5(b)This nonparameterization is again confirmed inFigure 5(c) showing that119898-spacing outperforms the QMDRand theGMDR throughout the whole range of heritability inthe case of themixed form of phenotype distribution Amongthe threemethods examined119898-spacing was themost robustperforming consistentlywithin the range of conditions for thesimulation
To estimate the type I error rate the null datasets weregenerated under the same scheme as used for the detectionpower analysis except that there was no causal pair intendedNow there are 20 SNPs that none of the pairs are expectedto have an association Permutation 119875 values for a particularpair were obtained by permuting each dataset 1000 timesWe took the significance level 120572 as 005 to get the ratio ofthe permutation 119875 values smaller than or equal to 120572 Wereport this ratio as the type I error rate in Table 1 whoseaccuracy to one decimal place when expressed in percent wasensured by the number of the permutation Table 1 presentsthe type I error rate for each combination of three traitdistributions two MAFs and seven heritability values alongwith the overall estimates Throughout these conditionsthe type I error rates are gathered tightly around 5 withmaximum and minimum of 54 and 43 respectivelyMoreover there exists no sign of the dependence on thetrait shape heritability and MAF Therefore our proposedmethod preserved the type I error rates on these condi-tions
34 Application to Real Data A full-scale real dataset fromthe Korean Association Resource (KARE) project [20] wasanalyzed to investigate the effectiveness of the 119898-spacingmethod Among the available phenotypes ldquoheightrdquo waschosenwith a sample size of 8842 from the population-basedcohortThe total number of SNPs was 327872 spanning over22 chromosomesThe ldquoheightrdquo phenotype showed to be closeto a normal distribution such that the119898-spacingmethodmaynot take advantage of the shape of the phenotype distributionas discussed in the previous subsection Table 2 lists theSNPs selected by the 119898-spacing method (IGS) that had thestrongest main effects Out of 10 selected SNPs rs2079795and rs6440003 coincide with two previous reports [26 27]although twomore matched SNPs rs11989122 and rs1344672could be found as results of our analysis using the same tool
Table 1 Type I error estimation with the significance level 120572 of 005
as in [26] but using the newly imputed dataset 119875 valueswere estimated by permutation of the phenotype values tomake null distributions Permutations were iterated 100000and 10000 times for the main effect and the interactionrespectively A clear distinction between rs11989122 and theother selected SNPs can be seen in the IGS values In Table 3the 2nd order gene-gene interaction result is given The topselected pair (rs6499786 rs1788421) was found to have thestrongest association with ldquoheightrdquo but the distinction wasnot so obvious compared to the case of the main effect
4 Conclusion
In this paper we present a modified 119898-spacing method forgenome-wide association studies with a quantitative traitThe robustness of this method makes it useful for a widerange of sample sizes while the original 119898-spacing methodyields a reliable result only for datasets with a large samplesize Extensive simulation was performed to produce thedatasets with different shapes of phenotype distributionswhile varying the penetrance functions and adjusting theheritability as well Causal pair detection probability wasunaffected the most by the compared methods based on thedistribution shape and heritability while GMDR and QMDRshowed more dependency The proposed 119898-spacing methodis proven to outperform the others regardless of the shape ofthe trait distribution and also the range of lower heritabilityIn the higher heritability region the performance of theproposedmethod is comparable to that of GMDR or QMDRwhichever shows better performance in that region Thiswould lead to versatile applicability of our nonparametricmethod for quantitative traits with various characteristicsWe applied this method to successfully identify the maineffect and gene-gene interactions for the phenotype ldquoheightrdquowith the full set of KARE samples Although several of themoverlapped with a previous report new interactions werealso found Because ldquoheightrdquo is presumed to be a trait with anormal distribution having a higher heritability our methodmay be said to have performed successfullywith no advantageover other methods More extensive study is needed forquantitative traits having various characteristics to furtherdemonstrate the expected robustness of our modified 119898-spacing method
BioMed Research International 9
Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect
Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10
minus5 lowast589 times 10
minus6
rs7316119 12 87531 1 times 10minus5 mdash
rs936634 18 86125 2 times 10minus5 mdash
rs7632381 3 78235 1 times 10minus5 mdash
rs2079795 17 76542 1 times 10minus5 292 times 10
minus6
Ref [26]rs1344672 3 76177 1 times 10
minus5 lowast521 times 10
minus7
rs2523865 6 76044 4 times 10minus5 mdash
rs3790199 20 75362 2 times 10minus5 mdash
rs6440003 3 75231 1 times 10minus5 387 times 10
minus7
Ref [27]rs17628655 19 75117 6 times 10
minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed
Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction
2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10
minus4
rs2529232 7 rs1788421 21 43869 1 times 10minus4
rs2241704 19 rs1788421 21 43855 1 times 10minus4
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)
References
[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012
[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001
[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007
[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007
[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007
[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008
[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008
[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013
[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008
[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008
[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012
10 BioMed Research International
[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948
[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011
[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013
[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011
[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013
[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004
[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003
[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009
[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007
[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997
[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992
[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009
[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004
[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009
[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008
[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003
Figure 3 Comparison of the QMDR GMDR and119898-spacing methods Association strengths obtained by GMDR versus119898-spacing (a) andby QMDR versus119898-spacing (b) are compared for a simulated dataset All three methods were used to evaluate the main effect as well as 2ndand 3rd order interactions The dataset was designed to have one 2nd order interaction causal pair
a representative result is shown in Figure 3 using adataset whose quantitative trait was generated from anormal distribution with a single causal SNP pair simulatedas described in the next section The sample size of thedataset was 400 with 20 SNPs In panel (a) the associationstrengths obtained by 119898-spacing and GMDR are plottedas horizontal and vertical coordinates respectively Filledtriangles represent the main effects while open circles arefor the 2nd order interactions Both methods identify thesame single SNP pair having a prominent interaction plottedin the upper right corner One of the SNPs was found toproduce the main effect in contrast to others Again theresult is agreed by both methods 119875 values obtained bypermutation are given in the boxes for those selected pointsAssociation strengths of the 3rd order interactions areplotted with a plus sign Because no 3rd order interaction issimulated into the dataset the combinations of SNPs madeby adding a single SNP to the causal pair are expected tohave high association values Those points are clustered nearthe identified causal pair in the upper right corner In panel(b) of Figure 3 the same comparison was made using theresult from 119898-spacing and QMDR Both comparisons showconsistent results between the proposed 119898-spacing methodand GMDR or QMDR Note that IGS instead of IG was usedThe distribution of the IG values from a dataset would shift toa higher direction with increased order of interactionsThusthe more conditions applied the less entropy may be left tofind In other words as the order of interaction increasesthe conditional entropy 119867(119875 | 119866) tends to decrease while
119867(119875) remains the same Therefore IGS is vital if one needsto compare the association strengths between genotypesfrom different orders of interactions Figure 3 shows that thesimulated causal pair has the largest IGS value among allpoints from different orders of interactions
32 Generation of the Simulation Data To examine theperformance of the 119898-spacing method an extensive set ofsimulation data was necessarily generated First three typesof quantitative trait distributions were considered Two ofthem were normal and gamma distributions and anotherone was a mixture of those two types With single causalpair designed 70 different penetrance models based on [21]were incorporated For the case of a normal distribution aphenotype value 119910 associated with two interacting SNPswas selected from a normal distribution as defined bythe penetrance values tabled for possible combinations ofgenotypes associated as follows
Here 119891119894119895 represents the penetrance values tabled for everymodel simulated and can be found in [21] It is tabulatedfor each possible pair of genotypes (119894 119895) In 70 differentpenetrancemodels 14 different combinations of two differentminor allele frequencies (MAFs) and seven different heri-tability values were considered Specifically we consideredthe cases when the MAFs were 02 and 04 and when theheritability was 001 0025 005 01 02 03 and 04 Three
6 BioMed Research International
(0 0)
0486
171(0 1)
0960
78
minus4 minus2 4
(0 2)
0538
11
(1 0)
0947
80
minus3 minus1
(1 1)
0004
30(1 2)
0811
6
(2 0)
0640
16
minus2 minus1
(2 1)
0606
8
minus2 minus1
(2 2)
0909
0
minus2 0 2 4
00
02
04
minus2 0 2
1 2 3
1 2 3
0 24
00
02
04
00
02
04
minus2 0
0 1 2 30
2 4
00
02
04
minus2 0 2 4
00
02
04
00
02
04
minus2 0 2 4
00
02
04
00
02
04
00
02
04
Figure 4 Demonstration of the simulation scheme Phenotype distributions were plotted to associate with the genotypes by two interactingSNPs as denoted in the parentheses on top of each plot SNPs may take values of 0 1 and 2 or AA Aa and aa For this particular dataset theMAF was set to 0200 On the bottom of each plot the penetrance value for this particular model is given which is taken from [21] Insideeach plot the number of samples generated to satisfy the simulation constraint is given The vertical dotted lines are for the mean values ofthe high- and low-risk groups By constraint the line on the left is for the low-risk group
different values (08 10 and 12) of the variance 120590 were usedindependently for the high- and low-risk groups resulting in9 combinations The grouping constraint for the generatedevent was set such that the averaged 119910 of the high-risk groupshould be larger than or equal to the overall average Theaveraged 119910 of the low-risk group should be less than theoverall average In Figure 4 9 possible distributions of agenerated phenotype are shown In this example the sample
size is 400 The high- and low-risk groups have the samenumber of samples and both have a variance of 10 Forgammadistributions phenotype values follow the rule below
The shape and scale parameters 119896 and 120579 were determinedby 119891119894119895 and 120590 using the relationship 119891119894119895 = 119896120579 and 120590 = 119896120579
2Penetrance models were classified by 7 heritability values
BioMed Research International 7
15
10
05
00
GMDR
00 02 04
Heritability
Hit
ratio
m-spacing
QMDR
(a)
10
05
00
m-spacing
QMDR
GMDR
00 02 04
Heritability
Hit
ratio
(b)
10
05
00
GMDR
00 02 04
Heritability
Hit
ratio
m-spacing
QMDR
(c)
Figure 5 Comparison of the hit ratios or the detection probabilities among the proposed119898-spacing method QMDR and GMDR Genomicdatasets were generated based on 70 different penetrance functions [21] which were in turn classified into 7 distinct values of heritabilityFor each model the phenotype values are simulated with normal (a) gamma (b) and mixed (c) distributions High- and low-risk groupsin a quantitative trait overlapped with 9 different combinations of the standard deviations Considering all of the above 100 data files weregenerated for each case adding up to 9000 simulated files being examined for each point in the plot
001 002 005 01 02 03 and 04 resulting in 10 models foreach heritability The generated data files had a sample sizeof 400 with 20 SNPs In all 3 times 70 times 9 = 1890 differentconditionswere set up with 100 simulated data files generatedfor each condition
33 Comparison of the Detection Probability and Type IError The ldquohit ratiordquo or detection power of the IGS wasevaluated and compared Simulated data files described inthe previous subsection were used All of them had a singlecausal pair to identify In addition to our proposed119898-spacing
method QMDR and GMDR were used to compare theresults Figure 5 shows the comparison Panels (a) (b) and(c) are for the quantitative trait of normal gamma andmixeddistributions respectively Seventy penetrance models weregrouped into 7 cases of heritability on the horizontal axiswhile all 9 combinations of the variances in high- and low-risk distributions were merged into each heritability caseWith a normal distribution as shown in Figure 5(a) the119898-spacingrsquos performance was in between those of QMDRand GMDR for higher values of penetrance However in therange of penetrance less than 02 the 119898-spacing performs
8 BioMed Research International
best Note that theQMDR shows higher detection probabilitythan the GMDR throughout the range In the case of agamma distribution as shown in Figure 5(b) the QMDRrsquosperformance drops rapidly as the heritability decreases whenthe hit ratios of 119898-spacing as well as the GMDR stay betterthan that of QMDR and are comparable to each other Notethe switch of the GMDR and QMDRrsquos performance rankswith the change of the phenotype distribution What QMDRdoes is essentially the dichotomization of the observed valuesof the quantitative phenotypes Therefore it should do betterwith well-defined symmetric distributions such as a normaldistribution than with an asymmetric one (eg gamma dis-tribution)The proposed119898-spacingmethod is expected to beeffective regardless of the shape of the phenotype distributionbecause it makes no assumptions regarding the distributionand is therefore nonparametric as demonstrated in Figures5(a) and 5(b)This nonparameterization is again confirmed inFigure 5(c) showing that119898-spacing outperforms the QMDRand theGMDR throughout the whole range of heritability inthe case of themixed form of phenotype distribution Amongthe threemethods examined119898-spacing was themost robustperforming consistentlywithin the range of conditions for thesimulation
To estimate the type I error rate the null datasets weregenerated under the same scheme as used for the detectionpower analysis except that there was no causal pair intendedNow there are 20 SNPs that none of the pairs are expectedto have an association Permutation 119875 values for a particularpair were obtained by permuting each dataset 1000 timesWe took the significance level 120572 as 005 to get the ratio ofthe permutation 119875 values smaller than or equal to 120572 Wereport this ratio as the type I error rate in Table 1 whoseaccuracy to one decimal place when expressed in percent wasensured by the number of the permutation Table 1 presentsthe type I error rate for each combination of three traitdistributions two MAFs and seven heritability values alongwith the overall estimates Throughout these conditionsthe type I error rates are gathered tightly around 5 withmaximum and minimum of 54 and 43 respectivelyMoreover there exists no sign of the dependence on thetrait shape heritability and MAF Therefore our proposedmethod preserved the type I error rates on these condi-tions
34 Application to Real Data A full-scale real dataset fromthe Korean Association Resource (KARE) project [20] wasanalyzed to investigate the effectiveness of the 119898-spacingmethod Among the available phenotypes ldquoheightrdquo waschosenwith a sample size of 8842 from the population-basedcohortThe total number of SNPs was 327872 spanning over22 chromosomesThe ldquoheightrdquo phenotype showed to be closeto a normal distribution such that the119898-spacingmethodmaynot take advantage of the shape of the phenotype distributionas discussed in the previous subsection Table 2 lists theSNPs selected by the 119898-spacing method (IGS) that had thestrongest main effects Out of 10 selected SNPs rs2079795and rs6440003 coincide with two previous reports [26 27]although twomore matched SNPs rs11989122 and rs1344672could be found as results of our analysis using the same tool
Table 1 Type I error estimation with the significance level 120572 of 005
as in [26] but using the newly imputed dataset 119875 valueswere estimated by permutation of the phenotype values tomake null distributions Permutations were iterated 100000and 10000 times for the main effect and the interactionrespectively A clear distinction between rs11989122 and theother selected SNPs can be seen in the IGS values In Table 3the 2nd order gene-gene interaction result is given The topselected pair (rs6499786 rs1788421) was found to have thestrongest association with ldquoheightrdquo but the distinction wasnot so obvious compared to the case of the main effect
4 Conclusion
In this paper we present a modified 119898-spacing method forgenome-wide association studies with a quantitative traitThe robustness of this method makes it useful for a widerange of sample sizes while the original 119898-spacing methodyields a reliable result only for datasets with a large samplesize Extensive simulation was performed to produce thedatasets with different shapes of phenotype distributionswhile varying the penetrance functions and adjusting theheritability as well Causal pair detection probability wasunaffected the most by the compared methods based on thedistribution shape and heritability while GMDR and QMDRshowed more dependency The proposed 119898-spacing methodis proven to outperform the others regardless of the shape ofthe trait distribution and also the range of lower heritabilityIn the higher heritability region the performance of theproposedmethod is comparable to that of GMDR or QMDRwhichever shows better performance in that region Thiswould lead to versatile applicability of our nonparametricmethod for quantitative traits with various characteristicsWe applied this method to successfully identify the maineffect and gene-gene interactions for the phenotype ldquoheightrdquowith the full set of KARE samples Although several of themoverlapped with a previous report new interactions werealso found Because ldquoheightrdquo is presumed to be a trait with anormal distribution having a higher heritability our methodmay be said to have performed successfullywith no advantageover other methods More extensive study is needed forquantitative traits having various characteristics to furtherdemonstrate the expected robustness of our modified 119898-spacing method
BioMed Research International 9
Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect
Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10
minus5 lowast589 times 10
minus6
rs7316119 12 87531 1 times 10minus5 mdash
rs936634 18 86125 2 times 10minus5 mdash
rs7632381 3 78235 1 times 10minus5 mdash
rs2079795 17 76542 1 times 10minus5 292 times 10
minus6
Ref [26]rs1344672 3 76177 1 times 10
minus5 lowast521 times 10
minus7
rs2523865 6 76044 4 times 10minus5 mdash
rs3790199 20 75362 2 times 10minus5 mdash
rs6440003 3 75231 1 times 10minus5 387 times 10
minus7
Ref [27]rs17628655 19 75117 6 times 10
minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed
Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction
2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10
minus4
rs2529232 7 rs1788421 21 43869 1 times 10minus4
rs2241704 19 rs1788421 21 43855 1 times 10minus4
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)
References
[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012
[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001
[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007
[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007
[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007
[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008
[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008
[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013
[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008
[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008
[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012
10 BioMed Research International
[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948
[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011
[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013
[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011
[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013
[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004
[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003
[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009
[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007
[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997
[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992
[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009
[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004
[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009
[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008
[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003
Figure 4 Demonstration of the simulation scheme Phenotype distributions were plotted to associate with the genotypes by two interactingSNPs as denoted in the parentheses on top of each plot SNPs may take values of 0 1 and 2 or AA Aa and aa For this particular dataset theMAF was set to 0200 On the bottom of each plot the penetrance value for this particular model is given which is taken from [21] Insideeach plot the number of samples generated to satisfy the simulation constraint is given The vertical dotted lines are for the mean values ofthe high- and low-risk groups By constraint the line on the left is for the low-risk group
different values (08 10 and 12) of the variance 120590 were usedindependently for the high- and low-risk groups resulting in9 combinations The grouping constraint for the generatedevent was set such that the averaged 119910 of the high-risk groupshould be larger than or equal to the overall average Theaveraged 119910 of the low-risk group should be less than theoverall average In Figure 4 9 possible distributions of agenerated phenotype are shown In this example the sample
size is 400 The high- and low-risk groups have the samenumber of samples and both have a variance of 10 Forgammadistributions phenotype values follow the rule below
The shape and scale parameters 119896 and 120579 were determinedby 119891119894119895 and 120590 using the relationship 119891119894119895 = 119896120579 and 120590 = 119896120579
2Penetrance models were classified by 7 heritability values
BioMed Research International 7
15
10
05
00
GMDR
00 02 04
Heritability
Hit
ratio
m-spacing
QMDR
(a)
10
05
00
m-spacing
QMDR
GMDR
00 02 04
Heritability
Hit
ratio
(b)
10
05
00
GMDR
00 02 04
Heritability
Hit
ratio
m-spacing
QMDR
(c)
Figure 5 Comparison of the hit ratios or the detection probabilities among the proposed119898-spacing method QMDR and GMDR Genomicdatasets were generated based on 70 different penetrance functions [21] which were in turn classified into 7 distinct values of heritabilityFor each model the phenotype values are simulated with normal (a) gamma (b) and mixed (c) distributions High- and low-risk groupsin a quantitative trait overlapped with 9 different combinations of the standard deviations Considering all of the above 100 data files weregenerated for each case adding up to 9000 simulated files being examined for each point in the plot
001 002 005 01 02 03 and 04 resulting in 10 models foreach heritability The generated data files had a sample sizeof 400 with 20 SNPs In all 3 times 70 times 9 = 1890 differentconditionswere set up with 100 simulated data files generatedfor each condition
33 Comparison of the Detection Probability and Type IError The ldquohit ratiordquo or detection power of the IGS wasevaluated and compared Simulated data files described inthe previous subsection were used All of them had a singlecausal pair to identify In addition to our proposed119898-spacing
method QMDR and GMDR were used to compare theresults Figure 5 shows the comparison Panels (a) (b) and(c) are for the quantitative trait of normal gamma andmixeddistributions respectively Seventy penetrance models weregrouped into 7 cases of heritability on the horizontal axiswhile all 9 combinations of the variances in high- and low-risk distributions were merged into each heritability caseWith a normal distribution as shown in Figure 5(a) the119898-spacingrsquos performance was in between those of QMDRand GMDR for higher values of penetrance However in therange of penetrance less than 02 the 119898-spacing performs
8 BioMed Research International
best Note that theQMDR shows higher detection probabilitythan the GMDR throughout the range In the case of agamma distribution as shown in Figure 5(b) the QMDRrsquosperformance drops rapidly as the heritability decreases whenthe hit ratios of 119898-spacing as well as the GMDR stay betterthan that of QMDR and are comparable to each other Notethe switch of the GMDR and QMDRrsquos performance rankswith the change of the phenotype distribution What QMDRdoes is essentially the dichotomization of the observed valuesof the quantitative phenotypes Therefore it should do betterwith well-defined symmetric distributions such as a normaldistribution than with an asymmetric one (eg gamma dis-tribution)The proposed119898-spacingmethod is expected to beeffective regardless of the shape of the phenotype distributionbecause it makes no assumptions regarding the distributionand is therefore nonparametric as demonstrated in Figures5(a) and 5(b)This nonparameterization is again confirmed inFigure 5(c) showing that119898-spacing outperforms the QMDRand theGMDR throughout the whole range of heritability inthe case of themixed form of phenotype distribution Amongthe threemethods examined119898-spacing was themost robustperforming consistentlywithin the range of conditions for thesimulation
To estimate the type I error rate the null datasets weregenerated under the same scheme as used for the detectionpower analysis except that there was no causal pair intendedNow there are 20 SNPs that none of the pairs are expectedto have an association Permutation 119875 values for a particularpair were obtained by permuting each dataset 1000 timesWe took the significance level 120572 as 005 to get the ratio ofthe permutation 119875 values smaller than or equal to 120572 Wereport this ratio as the type I error rate in Table 1 whoseaccuracy to one decimal place when expressed in percent wasensured by the number of the permutation Table 1 presentsthe type I error rate for each combination of three traitdistributions two MAFs and seven heritability values alongwith the overall estimates Throughout these conditionsthe type I error rates are gathered tightly around 5 withmaximum and minimum of 54 and 43 respectivelyMoreover there exists no sign of the dependence on thetrait shape heritability and MAF Therefore our proposedmethod preserved the type I error rates on these condi-tions
34 Application to Real Data A full-scale real dataset fromthe Korean Association Resource (KARE) project [20] wasanalyzed to investigate the effectiveness of the 119898-spacingmethod Among the available phenotypes ldquoheightrdquo waschosenwith a sample size of 8842 from the population-basedcohortThe total number of SNPs was 327872 spanning over22 chromosomesThe ldquoheightrdquo phenotype showed to be closeto a normal distribution such that the119898-spacingmethodmaynot take advantage of the shape of the phenotype distributionas discussed in the previous subsection Table 2 lists theSNPs selected by the 119898-spacing method (IGS) that had thestrongest main effects Out of 10 selected SNPs rs2079795and rs6440003 coincide with two previous reports [26 27]although twomore matched SNPs rs11989122 and rs1344672could be found as results of our analysis using the same tool
Table 1 Type I error estimation with the significance level 120572 of 005
as in [26] but using the newly imputed dataset 119875 valueswere estimated by permutation of the phenotype values tomake null distributions Permutations were iterated 100000and 10000 times for the main effect and the interactionrespectively A clear distinction between rs11989122 and theother selected SNPs can be seen in the IGS values In Table 3the 2nd order gene-gene interaction result is given The topselected pair (rs6499786 rs1788421) was found to have thestrongest association with ldquoheightrdquo but the distinction wasnot so obvious compared to the case of the main effect
4 Conclusion
In this paper we present a modified 119898-spacing method forgenome-wide association studies with a quantitative traitThe robustness of this method makes it useful for a widerange of sample sizes while the original 119898-spacing methodyields a reliable result only for datasets with a large samplesize Extensive simulation was performed to produce thedatasets with different shapes of phenotype distributionswhile varying the penetrance functions and adjusting theheritability as well Causal pair detection probability wasunaffected the most by the compared methods based on thedistribution shape and heritability while GMDR and QMDRshowed more dependency The proposed 119898-spacing methodis proven to outperform the others regardless of the shape ofthe trait distribution and also the range of lower heritabilityIn the higher heritability region the performance of theproposedmethod is comparable to that of GMDR or QMDRwhichever shows better performance in that region Thiswould lead to versatile applicability of our nonparametricmethod for quantitative traits with various characteristicsWe applied this method to successfully identify the maineffect and gene-gene interactions for the phenotype ldquoheightrdquowith the full set of KARE samples Although several of themoverlapped with a previous report new interactions werealso found Because ldquoheightrdquo is presumed to be a trait with anormal distribution having a higher heritability our methodmay be said to have performed successfullywith no advantageover other methods More extensive study is needed forquantitative traits having various characteristics to furtherdemonstrate the expected robustness of our modified 119898-spacing method
BioMed Research International 9
Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect
Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10
minus5 lowast589 times 10
minus6
rs7316119 12 87531 1 times 10minus5 mdash
rs936634 18 86125 2 times 10minus5 mdash
rs7632381 3 78235 1 times 10minus5 mdash
rs2079795 17 76542 1 times 10minus5 292 times 10
minus6
Ref [26]rs1344672 3 76177 1 times 10
minus5 lowast521 times 10
minus7
rs2523865 6 76044 4 times 10minus5 mdash
rs3790199 20 75362 2 times 10minus5 mdash
rs6440003 3 75231 1 times 10minus5 387 times 10
minus7
Ref [27]rs17628655 19 75117 6 times 10
minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed
Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction
2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10
minus4
rs2529232 7 rs1788421 21 43869 1 times 10minus4
rs2241704 19 rs1788421 21 43855 1 times 10minus4
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)
References
[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012
[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001
[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007
[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007
[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007
[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008
[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008
[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013
[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008
[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008
[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012
10 BioMed Research International
[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948
[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011
[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013
[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011
[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013
[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004
[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003
[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009
[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007
[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997
[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992
[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009
[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004
[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009
[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008
[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003
Figure 5 Comparison of the hit ratios or the detection probabilities among the proposed119898-spacing method QMDR and GMDR Genomicdatasets were generated based on 70 different penetrance functions [21] which were in turn classified into 7 distinct values of heritabilityFor each model the phenotype values are simulated with normal (a) gamma (b) and mixed (c) distributions High- and low-risk groupsin a quantitative trait overlapped with 9 different combinations of the standard deviations Considering all of the above 100 data files weregenerated for each case adding up to 9000 simulated files being examined for each point in the plot
001 002 005 01 02 03 and 04 resulting in 10 models foreach heritability The generated data files had a sample sizeof 400 with 20 SNPs In all 3 times 70 times 9 = 1890 differentconditionswere set up with 100 simulated data files generatedfor each condition
33 Comparison of the Detection Probability and Type IError The ldquohit ratiordquo or detection power of the IGS wasevaluated and compared Simulated data files described inthe previous subsection were used All of them had a singlecausal pair to identify In addition to our proposed119898-spacing
method QMDR and GMDR were used to compare theresults Figure 5 shows the comparison Panels (a) (b) and(c) are for the quantitative trait of normal gamma andmixeddistributions respectively Seventy penetrance models weregrouped into 7 cases of heritability on the horizontal axiswhile all 9 combinations of the variances in high- and low-risk distributions were merged into each heritability caseWith a normal distribution as shown in Figure 5(a) the119898-spacingrsquos performance was in between those of QMDRand GMDR for higher values of penetrance However in therange of penetrance less than 02 the 119898-spacing performs
8 BioMed Research International
best Note that theQMDR shows higher detection probabilitythan the GMDR throughout the range In the case of agamma distribution as shown in Figure 5(b) the QMDRrsquosperformance drops rapidly as the heritability decreases whenthe hit ratios of 119898-spacing as well as the GMDR stay betterthan that of QMDR and are comparable to each other Notethe switch of the GMDR and QMDRrsquos performance rankswith the change of the phenotype distribution What QMDRdoes is essentially the dichotomization of the observed valuesof the quantitative phenotypes Therefore it should do betterwith well-defined symmetric distributions such as a normaldistribution than with an asymmetric one (eg gamma dis-tribution)The proposed119898-spacingmethod is expected to beeffective regardless of the shape of the phenotype distributionbecause it makes no assumptions regarding the distributionand is therefore nonparametric as demonstrated in Figures5(a) and 5(b)This nonparameterization is again confirmed inFigure 5(c) showing that119898-spacing outperforms the QMDRand theGMDR throughout the whole range of heritability inthe case of themixed form of phenotype distribution Amongthe threemethods examined119898-spacing was themost robustperforming consistentlywithin the range of conditions for thesimulation
To estimate the type I error rate the null datasets weregenerated under the same scheme as used for the detectionpower analysis except that there was no causal pair intendedNow there are 20 SNPs that none of the pairs are expectedto have an association Permutation 119875 values for a particularpair were obtained by permuting each dataset 1000 timesWe took the significance level 120572 as 005 to get the ratio ofthe permutation 119875 values smaller than or equal to 120572 Wereport this ratio as the type I error rate in Table 1 whoseaccuracy to one decimal place when expressed in percent wasensured by the number of the permutation Table 1 presentsthe type I error rate for each combination of three traitdistributions two MAFs and seven heritability values alongwith the overall estimates Throughout these conditionsthe type I error rates are gathered tightly around 5 withmaximum and minimum of 54 and 43 respectivelyMoreover there exists no sign of the dependence on thetrait shape heritability and MAF Therefore our proposedmethod preserved the type I error rates on these condi-tions
34 Application to Real Data A full-scale real dataset fromthe Korean Association Resource (KARE) project [20] wasanalyzed to investigate the effectiveness of the 119898-spacingmethod Among the available phenotypes ldquoheightrdquo waschosenwith a sample size of 8842 from the population-basedcohortThe total number of SNPs was 327872 spanning over22 chromosomesThe ldquoheightrdquo phenotype showed to be closeto a normal distribution such that the119898-spacingmethodmaynot take advantage of the shape of the phenotype distributionas discussed in the previous subsection Table 2 lists theSNPs selected by the 119898-spacing method (IGS) that had thestrongest main effects Out of 10 selected SNPs rs2079795and rs6440003 coincide with two previous reports [26 27]although twomore matched SNPs rs11989122 and rs1344672could be found as results of our analysis using the same tool
Table 1 Type I error estimation with the significance level 120572 of 005
as in [26] but using the newly imputed dataset 119875 valueswere estimated by permutation of the phenotype values tomake null distributions Permutations were iterated 100000and 10000 times for the main effect and the interactionrespectively A clear distinction between rs11989122 and theother selected SNPs can be seen in the IGS values In Table 3the 2nd order gene-gene interaction result is given The topselected pair (rs6499786 rs1788421) was found to have thestrongest association with ldquoheightrdquo but the distinction wasnot so obvious compared to the case of the main effect
4 Conclusion
In this paper we present a modified 119898-spacing method forgenome-wide association studies with a quantitative traitThe robustness of this method makes it useful for a widerange of sample sizes while the original 119898-spacing methodyields a reliable result only for datasets with a large samplesize Extensive simulation was performed to produce thedatasets with different shapes of phenotype distributionswhile varying the penetrance functions and adjusting theheritability as well Causal pair detection probability wasunaffected the most by the compared methods based on thedistribution shape and heritability while GMDR and QMDRshowed more dependency The proposed 119898-spacing methodis proven to outperform the others regardless of the shape ofthe trait distribution and also the range of lower heritabilityIn the higher heritability region the performance of theproposedmethod is comparable to that of GMDR or QMDRwhichever shows better performance in that region Thiswould lead to versatile applicability of our nonparametricmethod for quantitative traits with various characteristicsWe applied this method to successfully identify the maineffect and gene-gene interactions for the phenotype ldquoheightrdquowith the full set of KARE samples Although several of themoverlapped with a previous report new interactions werealso found Because ldquoheightrdquo is presumed to be a trait with anormal distribution having a higher heritability our methodmay be said to have performed successfullywith no advantageover other methods More extensive study is needed forquantitative traits having various characteristics to furtherdemonstrate the expected robustness of our modified 119898-spacing method
BioMed Research International 9
Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect
Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10
minus5 lowast589 times 10
minus6
rs7316119 12 87531 1 times 10minus5 mdash
rs936634 18 86125 2 times 10minus5 mdash
rs7632381 3 78235 1 times 10minus5 mdash
rs2079795 17 76542 1 times 10minus5 292 times 10
minus6
Ref [26]rs1344672 3 76177 1 times 10
minus5 lowast521 times 10
minus7
rs2523865 6 76044 4 times 10minus5 mdash
rs3790199 20 75362 2 times 10minus5 mdash
rs6440003 3 75231 1 times 10minus5 387 times 10
minus7
Ref [27]rs17628655 19 75117 6 times 10
minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed
Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction
2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10
minus4
rs2529232 7 rs1788421 21 43869 1 times 10minus4
rs2241704 19 rs1788421 21 43855 1 times 10minus4
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)
References
[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012
[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001
[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007
[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007
[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007
[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008
[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008
[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013
[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008
[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008
[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012
10 BioMed Research International
[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948
[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011
[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013
[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011
[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013
[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004
[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003
[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009
[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007
[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997
[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992
[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009
[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004
[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009
[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008
[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003
best Note that theQMDR shows higher detection probabilitythan the GMDR throughout the range In the case of agamma distribution as shown in Figure 5(b) the QMDRrsquosperformance drops rapidly as the heritability decreases whenthe hit ratios of 119898-spacing as well as the GMDR stay betterthan that of QMDR and are comparable to each other Notethe switch of the GMDR and QMDRrsquos performance rankswith the change of the phenotype distribution What QMDRdoes is essentially the dichotomization of the observed valuesof the quantitative phenotypes Therefore it should do betterwith well-defined symmetric distributions such as a normaldistribution than with an asymmetric one (eg gamma dis-tribution)The proposed119898-spacingmethod is expected to beeffective regardless of the shape of the phenotype distributionbecause it makes no assumptions regarding the distributionand is therefore nonparametric as demonstrated in Figures5(a) and 5(b)This nonparameterization is again confirmed inFigure 5(c) showing that119898-spacing outperforms the QMDRand theGMDR throughout the whole range of heritability inthe case of themixed form of phenotype distribution Amongthe threemethods examined119898-spacing was themost robustperforming consistentlywithin the range of conditions for thesimulation
To estimate the type I error rate the null datasets weregenerated under the same scheme as used for the detectionpower analysis except that there was no causal pair intendedNow there are 20 SNPs that none of the pairs are expectedto have an association Permutation 119875 values for a particularpair were obtained by permuting each dataset 1000 timesWe took the significance level 120572 as 005 to get the ratio ofthe permutation 119875 values smaller than or equal to 120572 Wereport this ratio as the type I error rate in Table 1 whoseaccuracy to one decimal place when expressed in percent wasensured by the number of the permutation Table 1 presentsthe type I error rate for each combination of three traitdistributions two MAFs and seven heritability values alongwith the overall estimates Throughout these conditionsthe type I error rates are gathered tightly around 5 withmaximum and minimum of 54 and 43 respectivelyMoreover there exists no sign of the dependence on thetrait shape heritability and MAF Therefore our proposedmethod preserved the type I error rates on these condi-tions
34 Application to Real Data A full-scale real dataset fromthe Korean Association Resource (KARE) project [20] wasanalyzed to investigate the effectiveness of the 119898-spacingmethod Among the available phenotypes ldquoheightrdquo waschosenwith a sample size of 8842 from the population-basedcohortThe total number of SNPs was 327872 spanning over22 chromosomesThe ldquoheightrdquo phenotype showed to be closeto a normal distribution such that the119898-spacingmethodmaynot take advantage of the shape of the phenotype distributionas discussed in the previous subsection Table 2 lists theSNPs selected by the 119898-spacing method (IGS) that had thestrongest main effects Out of 10 selected SNPs rs2079795and rs6440003 coincide with two previous reports [26 27]although twomore matched SNPs rs11989122 and rs1344672could be found as results of our analysis using the same tool
Table 1 Type I error estimation with the significance level 120572 of 005
as in [26] but using the newly imputed dataset 119875 valueswere estimated by permutation of the phenotype values tomake null distributions Permutations were iterated 100000and 10000 times for the main effect and the interactionrespectively A clear distinction between rs11989122 and theother selected SNPs can be seen in the IGS values In Table 3the 2nd order gene-gene interaction result is given The topselected pair (rs6499786 rs1788421) was found to have thestrongest association with ldquoheightrdquo but the distinction wasnot so obvious compared to the case of the main effect
4 Conclusion
In this paper we present a modified 119898-spacing method forgenome-wide association studies with a quantitative traitThe robustness of this method makes it useful for a widerange of sample sizes while the original 119898-spacing methodyields a reliable result only for datasets with a large samplesize Extensive simulation was performed to produce thedatasets with different shapes of phenotype distributionswhile varying the penetrance functions and adjusting theheritability as well Causal pair detection probability wasunaffected the most by the compared methods based on thedistribution shape and heritability while GMDR and QMDRshowed more dependency The proposed 119898-spacing methodis proven to outperform the others regardless of the shape ofthe trait distribution and also the range of lower heritabilityIn the higher heritability region the performance of theproposedmethod is comparable to that of GMDR or QMDRwhichever shows better performance in that region Thiswould lead to versatile applicability of our nonparametricmethod for quantitative traits with various characteristicsWe applied this method to successfully identify the maineffect and gene-gene interactions for the phenotype ldquoheightrdquowith the full set of KARE samples Although several of themoverlapped with a previous report new interactions werealso found Because ldquoheightrdquo is presumed to be a trait with anormal distribution having a higher heritability our methodmay be said to have performed successfullywith no advantageover other methods More extensive study is needed forquantitative traits having various characteristics to furtherdemonstrate the expected robustness of our modified 119898-spacing method
BioMed Research International 9
Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect
Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10
minus5 lowast589 times 10
minus6
rs7316119 12 87531 1 times 10minus5 mdash
rs936634 18 86125 2 times 10minus5 mdash
rs7632381 3 78235 1 times 10minus5 mdash
rs2079795 17 76542 1 times 10minus5 292 times 10
minus6
Ref [26]rs1344672 3 76177 1 times 10
minus5 lowast521 times 10
minus7
rs2523865 6 76044 4 times 10minus5 mdash
rs3790199 20 75362 2 times 10minus5 mdash
rs6440003 3 75231 1 times 10minus5 387 times 10
minus7
Ref [27]rs17628655 19 75117 6 times 10
minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed
Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction
2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10
minus4
rs2529232 7 rs1788421 21 43869 1 times 10minus4
rs2241704 19 rs1788421 21 43855 1 times 10minus4
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)
References
[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012
[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001
[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007
[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007
[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007
[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008
[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008
[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013
[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008
[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008
[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012
10 BioMed Research International
[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948
[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011
[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013
[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011
[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013
[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004
[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003
[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009
[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007
[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997
[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992
[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009
[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004
[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009
[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008
[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003
Table 2 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo main effect
Main effectrs ID Chromosome IGS 119875 value Previous reportrs11989122 8 113892 1 times 10
minus5 lowast589 times 10
minus6
rs7316119 12 87531 1 times 10minus5 mdash
rs936634 18 86125 2 times 10minus5 mdash
rs7632381 3 78235 1 times 10minus5 mdash
rs2079795 17 76542 1 times 10minus5 292 times 10
minus6
Ref [26]rs1344672 3 76177 1 times 10
minus5 lowast521 times 10
minus7
rs2523865 6 76044 4 times 10minus5 mdash
rs3790199 20 75362 2 times 10minus5 mdash
rs6440003 3 75231 1 times 10minus5 387 times 10
minus7
Ref [27]rs17628655 19 75117 6 times 10
minus5 mdashlowastIdentified using the same method as [26] but with imputed data which is the same one we analyzed
Table 3 Application of the119898-spacing method to a full set of KARE samples with the phenotype ldquoheightrdquo 2nd order interaction
2nd order interactionrs ID Chromosome rs ID Chromosome IGS 119875 valuers6499786 16 rs1788421 21 46197 1 times 10
minus4
rs2529232 7 rs1788421 21 43869 1 times 10minus4
rs2241704 19 rs1788421 21 43855 1 times 10minus4
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This research was supported by the Basic Science ResearchProgram through the National Research Foundationof Korea (NRF) funded by the Ministry of EducationScience and Technology (NRF-2013R1A1A2062848 NRF-2012R1A3A2026438)
References
[1] O Zuk EHechter S R Sunyaev andE S Lander ldquoThemysteryof missing heritability genetic interactions create phantomheritabilityrdquo Proceedings of the National Academy of Sciences ofthe United States of America vol 109 no 4 pp 1193ndash1198 2012
[2] M D Ritchie L W Hahn N Roodi et al ldquoMultifactor-dimensionality reduction reveals high-order interactionsamong estrogen-metabolism genes in sporadic breast cancerrdquoAmerican Journal of Human Genetics vol 69 no 1 pp 138ndash1472001
[3] X-Y Lou G-B Chen L Yan et al ldquoA generalized combi-natorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine depen-dencerdquo American Journal of Human Genetics vol 80 no 6 pp1125ndash1137 2007
[4] Y Chung S Y Lee R C Elston and T Park ldquoOdds ratiobased multifactor-dimensionality reduction method for detect-ing gene-gene interactionsrdquo Bioinformatics vol 23 no 1 pp 71ndash76 2007
[5] S Yeoun Lee Y Chung R C Elston Y Kim and T ParkldquoLog-linear model-based multifactor dimensionality reductionmethod to detect gene-gene interactionsrdquo Bioinformatics vol23 no 19 pp 2589ndash2595 2007
[6] M L Calle V Urrea G Vellalta N Malats and K V SteenldquoImproving strategies for detecting genetic patterns of diseasesusceptibility in association studiesrdquo Statistics in Medicine vol27 no 30 pp 6532ndash6546 2008
[7] W S Bush T L Edwards SM Dudek B AMcKinney andMD Ritchie ldquoAlternative contingency tablemeasures improve thepower and detection of multifactor dimensionality reductionrdquoBMC Bioinformatics vol 9 article 238 2008
[8] K Kim M-S Kwon S Oh and T Park ldquoIdentification ofmultiple gene-gene interactions for ordinal phenotypesrdquo BMCMedical Genomics vol 6 supplement 2 article S9 2013
[9] G Kang W Yue J Zhang Y Cui Y Zuo and D Zhang ldquoAnentropy-based approach for testing genetic epistasis underlyingcomplex diseasesrdquo Journal ofTheoretical Biology vol 250 no 2pp 362ndash374 2008
[10] C Dong X Chu Y Wang et al ldquoExploration of gene-geneinteraction effects using entropy-based methodsrdquo EuropeanJournal of Human Genetics vol 16 no 2 pp 229ndash235 2008
[11] C Wu S Li and Y Cui ldquoGenetic association studies aninformation content perspectiverdquoCurrent Genomics vol 13 no7 pp 566ndash573 2012
10 BioMed Research International
[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948
[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011
[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013
[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011
[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013
[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004
[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003
[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009
[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007
[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997
[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992
[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009
[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004
[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009
[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008
[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003
[12] C E Shannon ldquoAmathematical theory of communicationrdquoTheBell System Technical Journal vol 27 pp 379ndash423 1948
[13] R M Gray Entropy and Information Theory Springer NewYork NY USA 2nd edition 2011
[14] J Yee M-S Kwon T Park and M Park ldquoA modified entropy-based approach for identifying gene-gene interactions in case-control studyrdquo PLoS ONE vol 8 no 7 Article ID e69321 2013
[15] M Li C Ye W Fu R C Elston and Q Lu ldquoDetecting geneticinteractions for quantitative traits with U-statisticsrdquo GeneticEpidemiology vol 35 no 6 pp 457ndash468 2011
[16] J Gui J H Moore S M Williams et al ldquoA simple and com-putationally efficient approach to multifactor dimensionalityreduction analysis of gene-gene interactions for quantitativetraitsrdquo PLoS ONE vol 8 no 6 Article ID e66545 2013
[17] L Paninski ldquoEstimation of entropy and mutual informationrdquoNeural Computation vol 15 no 6 pp 1191ndash1253 2003
[18] N N Schraudolph ldquoGrandient-based manipulation of non-parametric entropy estimatesrdquo IEEE Transactions on NeuralNetworks vol 14 no 2 pp 1ndash10 2004
[19] K Torkkola ldquoFeature extraction by non-parametric mutualinformation maximizationrdquo Journal of Machine LearningResearch vol 3 no 7-8 pp 1415ndash1438 2003
[20] P Chanda L Sucheston S Liu A Zhang andM RamanathanldquoInformation-theoretic gene-gene and gene-environment inter-action analysis of quantitative traitsrdquo BMC Genomics vol 10article 1471 pp 509ndash530 2009
[21] D R Velez B C White A A Motsinger et al ldquoA balancedaccuracy function for epistasismodeling in imbalanced datasetsusing multifactor dimensionality reductionrdquo Genetic Epidemi-ology vol 31 no 4 pp 306ndash315 2007
[22] J Beirlant E J Dudewicz L Gyorfi and E C van der MeulenldquoNonparametric entropy estimation an overviewrdquo InternationalJournal of Mathematical and Statistical Sciences vol 6 no 1 pp17ndash39 1997
[23] A B Tsybakov and E C van der Meulen ldquoRoot-n consistentestimators of entropy for densities with unbound supportrdquoScandinavian Journal of Statistics vol 23 pp 75ndash83 1992
[24] F El Haje Hussein and Y Golubev ldquoOn entropy estimation by119898-spacing methodrdquo Jounal of Mathematical Sciences vol 163pp 290ndash309 2009
[25] E G Learned-Miller and J W Fisher III ldquoICA using spacingsestimates of entropyrdquo Journal ofMachine Learning Research vol4 pp 1271ndash1295 2004
[26] Y S Cho M J Go Y J Kim et al ldquoA large-scale genome-wideassociation study of Asian populations uncovers genetic factorsinfluencing eight quantitative traitsrdquoNatureGenetics vol 41 no5 pp 527ndash534 2009
[27] M N Weedon H Lango C M Lindgren et al ldquoGenome-wide association analysis identifies 20 loci that influence adultheightrdquo Nature Genetics vol 40 no 5 pp 575ndash583 2008
[28] H Singh NMisra V Hnizdo A Fedorowicz and E DemchukldquoNearest neighbor estimates of entropyrdquo American Journal ofMathematical and Management Sciences vol 23 no 3-4 pp301ndash321 2003