Linkage Disequilibrium Estimation in Low Coverage 1 High-Throughput Sequencing Data 2 Timothy P. Bilton, John C. McEwan, Shannon M. Clarke, Rudiger Brauning, Tracey C. van Stijn, Suzanne J. Rowe and 3 Ken G. Dodds 4 AgResearch, Invermay Agricultural Centre, Mosgiel 9053, New Zealand 5 6 ABSTRACT High-throughput sequencing methods that multiplex a large number of individuals have provided a cost-effective approach for discovering genome-wide genetic variation in large populations. These sequencing methods are increasingly being utilized in population genetic studies across a diverse range of species. Two side-effects of these methods, however, are (1) sequencing errors and (2) heterozygous genotypes called as homozygous due to only one allele at a particular locus being sequenced, which occurs when the sequencing depth is insufficient. Both of these errors have a profound effect on the estimation of linkage disequilibrium and, if not taken into account, lead to inaccurate estimates. We developed a new likelihood method, GUS-LD, to estimate pairwise linkage disequilibrium using low coverage sequencing data that accounts for under-called heterozygous genotypes and sequencing errors. Our findings show that accurate estimates were obtained using GUS-LD, whereas underestimation of linkage disequilibrium results if no adjustment is made for the errors. 7 8 9 10 11 12 13 14 15 16 17 18 KEYWORDS genotyping-by-sequencing; linkage disequilibrium; maximum likelihood; allelic dropout; low coverage 19 20 21 1 AgResearch, Invermay Agricultural Centre, Private Bag 50034, Mosgiel 9053, New Zealand. E-mail: [email protected]1 Genetics: Early Online, published on March 27, 2018 as 10.1534/genetics.118.300831 Copyright 2018.
29
Embed
Linkage Disequilibrium Estimation in Low Coverage High …€¦ · 22 L INKAGE disequilibrium (LD) is the term given to the non-random association of alleles located 23 at different
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Linkage Disequilibrium Estimation in Low Coverage1
High-Throughput Sequencing Data2
Timothy P. Bilton, John C. McEwan, Shannon M. Clarke, Rudiger Brauning, Tracey C. van Stijn, Suzanne J. Rowe and3
Ken G. Dodds4
AgResearch, Invermay Agricultural Centre, Mosgiel 9053, New Zealand5
6
ABSTRACT High-throughput sequencing methods that multiplex a large number of individuals have provided
a cost-effective approach for discovering genome-wide genetic variation in large populations. These
sequencing methods are increasingly being utilized in population genetic studies across a diverse range of
species. Two side-effects of these methods, however, are (1) sequencing errors and (2) heterozygous
genotypes called as homozygous due to only one allele at a particular locus being sequenced, which occurs
when the sequencing depth is insufficient. Both of these errors have a profound effect on the estimation of
linkage disequilibrium and, if not taken into account, lead to inaccurate estimates. We developed a new
likelihood method, GUS-LD, to estimate pairwise linkage disequilibrium using low coverage sequencing
data that accounts for under-called heterozygous genotypes and sequencing errors. Our findings show
that accurate estimates were obtained using GUS-LD, whereas underestimation of linkage disequilibrium
results if no adjustment is made for the errors.
7
8
9
10
11
12
13
14
15
16
17
18
KEYWORDS genotyping-by-sequencing; linkage disequilibrium; maximum likelihood; allelic dropout; low coverage1920
21
1AgResearch, Invermay Agricultural Centre, Private Bag 50034, Mosgiel 9053, New Zealand. E-mail: [email protected]
1
Genetics: Early Online, published on March 27, 2018 as 10.1534/genetics.118.300831
Copyright 2018.
LINKAGE disequilibrium (LD) is the term given to the non-random association of alleles located22
at different loci in a population. Quantifying the level of LD or estimating the pairwise23
LD between all loci in a population is of interest to many researchers as it has many important24
applications. For example, in association mapping studies, LD is used to identify candidate regions of25
the genome associated with a particular trait or disease and can provide finer resolution in mapping26
compared to linkage based studies (Devlin and Risch 1995; Jorde 1995; Xiong and Guo 1997; Mackay27
and Powell 2007). LD is affected by many genetic and evolutionary forces such as recombination,28
admixture, migration, selection and gene flow among others (Terwilliger et al. 1998; Ardlie et al.29
2002; Gaut and Long 2003; Slatkin 2008). Consequently, LD patterns can be used to quantify genetic30
diversity and make inferences about the evolutionary history of natural populations (Nordborg and31
Tavaré 2002; Slatkin 2008; Zhu et al. 2015). In addition, the relationship between map distance and32
the level of LD can be used to estimate the effective population size (Sved 1971; Hill 1981; Hayes et al.33
2003; Waples 2006; Sved et al. 2013).34
Today, many species are being sequenced using high-throughput sequencing methods that multi-35
plex a large number of individuals. Some of the most popular sequencing methods are whole genome36
sequencing and the reduced representation approaches such as genotyping-by-sequencing (Elshire37
et al. 2011), whole-exome sequencing (Hodges et al. 2007), and restriction-site associated DNA (Baird38
et al. 2008). These sequencing methods provide a low cost approach for performing genome-wide39
genotyping and discovery of single nucleotide polymorphisms (SNPs) that does not require prior40
genomic information. As a result, they have been applied in a plethora of plant, aquaculture, and41
animal species, and is becoming the method of choice for many species, particularly for non-model42
organisms (Andrews et al. 2016; Kim et al. 2016; Chung et al. 2017; Li and Wang 2017; Robledo et al.43
2017). Genetic data generated using high-throughput sequencing methods are increasingly being44
used to compute pairwise linkage disequilibrium estimates (e.g., Hohenlohe et al. 2012; Wang et al.45
2 Bilton et al.
2013; Nimmakayala et al. 2014; Huang et al. 2014; Xu et al. 2014; Fè et al. 2015; Zhang et al. 2015;46
Covarrubias-Pazaran et al. 2016; Van Wyngaarden et al. 2016; Gur et al. 2017; Sieber et al. 2017; Faville47
et al. 2018).48
A major disadvantage with high-throughput sequencing methods is that one or both of the alleles49
at a particular locus may be missed for a given individual if the sequencing depth is low. If neither50
allele is seen, a missing genotype results while if only one of the two parental alleles is seen (possibly51
multiple times), a heterozygous genotype may be called as homozygous (Dodds et al. 2015; Fragoso52
et al. 2016). The latter case is also known as allelic dropout and is particularly problematic as genotype53
calls with this type of missingness behave like genotyping errors, which have a profound impact on54
the estimation of LD even when the error rates are low (Akey et al. 2001). An addition complication55
of sequencing data is the presence of sequencing errors, bases which have been miscalled, which also56
impact on estimation of genetic quantities such as recombination fractions (Bilton et al. 2018).57
One way of removing genotyping errors resulting from low sequencing depth is to set genotype58
calls with an associated read depth below some threshold value to missing. However, such filtering59
results in fewer individuals and SNPs for a given sequencing cost (Dodds et al. 2015), and for60
low coverage data may result in insufficient data to undertake the analysis. LD is often estimated61
using haplotypes phased from genotype data via various software packages and algorithms such as62
BEAGLE (Browning and Browning 2007), fastPHASE (Scheet and Stephens 2006), MaCH (Li et al.63
2010), and FILLIN (Swarts et al. 2014). However, all of these approaches require that the chromosomal64
order of the loci is known in order to infer haplotypes, which is not necessarily the case for reduced65
representation sequencing data, particularly if SNPs are called de novo. Furthermore, many species66
that are genotyped using sequencing methods are highly polymorphic and have low LD levels, where67
phasing in such species can be problematic (Bukowicki et al. 2016). A few alternative approaches68
for estimating LD from high-throughput sequencing data have been presented in the literature.69
LD estimation using sequencing data 3
Feder et al. (2012) proposed estimating pairwise LD using reads that cover both loci while estimating70
the allele frequencies using all the reads. This approach, however, is not applicable to short read71
sequencing data (e.g., genotyping-by-sequencing) where most of the reads do not cover both sites.72
Alternatively, it restricts the analysis to loci that are very close, which may not be that useful. Maruki73
and Lynch (2014) presented a likelihood method for estimating the disequilibrium coefficient in74
situations where there is a combination of reads that intersect both loci or only one of the two loci.75
Their method accounts for sequencing errors but requires that additional erroneous alleles are called76
in the alignment process, whereas most variant callers by default only allow for two alleles to be77
called at a SNP.78
We present a new method for estimating pairwise LD using low coverage sequencing data, without79
requiring haplotype phasing, a known chromosomal order or filtering with regard to read depth.80
In essence, our method is based on the likelihood method by Hill (1974), which estimates LD using81
genotypic data in random mating populations, but is extended to account for errors resulting from82
under-called heterozygotes and sequencing errors. Our method removes bias in LD estimation83
caused by these errors but results in more variable estimates at low depth. We also examine the effect84
genotyping errors from low read depths and sequencing errors have on the estimation of LD.85
Materials and Methods86
Estimation of pairwise LD87
Let Aj and Bj denote the reference and alternate allele at locus j respectively and let pAj and pBj88
denote the allele frequencies for the reference and alternate allele at locus j respectively. The linkage89
disequilibrium coefficient is defined as (Lewontin and Kojima 1960)90
D = pA1 A2 − pA1 pA2 , (1)
4 Bilton et al.
where pA1 A2 is the probability of observing a haplotype containing the major allele at both loci. Since91
probabilities are required to be non-negative, D must satisfy the constraints (Lewontin 1964)92
D ≥ max(−pA1 pA2 ,−(1− pA1)(1− pA2))
D ≤ min(pA1(1− pA2), pA2(1− pA1)).
(2)
We let Gij denote the true genotype for individual i at locus j and Gi = (Gij, Gik)T denote the true joint93
genotype for individual i between locus j and k, where j 6= k, i = 1, . . . , n and T denotes the transpose.94
We let AAj, ABj, and BBj denote the reference homozygous genotype, heterozygous genotype, and95
alternate homozygous genotype at locus j respectively. For two biallelic loci, the nine joint genotypes96
are (AA1, AA2)T, (AA1, AB2)
T, (AA1, BB2)T, (AB1, AA2)
T, (AB1, AB2)T, (AB1, BB2)
T, (BB1, AA2)T,97
(BB1, AB2)T, and (BB1, BB2)
T, which we denote by 1 to 9 respectively.98
In sequencing data, the true genotypes are latent while the observed data consists of the number99
of reads for the reference and alternate alleles. We denote the number of reads for the reference allele100
for individual i at locus j by Yij, where Yij is an integer value between 0 and the sequencing depth dij,101
which is the sum of reference and alternate allele counts at locus j in individual i. By the law of total102
probability,103
P(Yi) =9
∑g=1
P(Yi|Gi = g)P(Gi = g), (3)
where Yi = (Yi1, Yi2)T and g = (g1, g2)
T. If the number of observed reads for the reference allele104
given the true genotype are independent between loci, Equation (3) simplifies to,105
P(Yi) =9
∑g=1
(2
∏j=1
P(Yij|Gij = gj)
)P(Gi = g). (4)
LD estimation using sequencing data 5
Table 1 Joint genotype probabilities for two biallelic loci under the assumption of Hardy-Weinberg equilibrium
average read depth increased, the number of under-called heterozygous genotypes in the datasets192
decreased, resulting in less bias for LD estimates obtained from the standard likelihood method. For193
mean depths greater than 10, the estimates from the standard approach coincided with GUS-LD194
when the true LD was small or absent but were still biased when the true LD was large, which is due195
to the presence of sequencing errors.196
Figure 1 also shows the standard errors of the estimates for the three LD measures computed197
using the two approaches. In general, the standard errors of the LD estimates computed under198
GUS-LD were larger compared with those obtained under the standard likelihood approach, with199
the difference decreasing as the average read depth increased. This increase in the standard errors200
for GUS-LD was expected as there is extra sampling variation introduced into the sequencing data,201
caused by not all alleles being observed. On the other hand, when the true value of D was close to or202
on the lower or upper bound of its parameter space (Equation (2)), GUS-LD tended to yield smaller203
standard errors than the standard approach.204
The bias and standard errors of the LD estimates for alternative combinations of allele frequencies205
are given in Figure S1 (pA1 = 0.5, pA2 = 0.75) and Figure S2 (pA1 = pA2 = 0.9). The results from206
these simulations were mostly in agreement with those when pA1 = 0.5 and pA2 = 0.5. The standard207
errors for the allele frequency estimates from GUS-LD and the standard approach for all three sets of208
parameter values are given Figure S3. Overall, the standard errors of the allele frequency estimates209
were fairly similar between the two methods.210
The bias and standard errors of the sequencing error estimates from GUS-LD for the first set of211
simulations is given in Figure 2. At high mean depths, these estimates were unbiased across all the212
different combinations of parameter values, whereas for low mean read depths the estimates were213
generally biased upwards with the bias increasing as the mean depth decreased. The standard errors214
of the sequencing error estimates were also smallest at higher mean depths and increased as the215
LD estimation using sequencing data 11
Figure 1 Bias of the LD estimates for D (A), D′ (C) and r2 (E) and standard error (SE) of the LDestimates for D (B), D′ (D) and r2 (F) when pA1 = 0.5, pA2 = 0.5, ε = 0.01 and the true valuesof D were −0.05, 0, 0.05, 0.15, and 0.25. The dashed lines represents the estimates obtained usingGUS-LD whereas the solid lines represents the estimates obtained using the standard likelihoodapproach. The upper and lower bound for D are −0.25 and 0.25 respectively.
12 Bilton et al.
mean depth decreased.216
For the second set of simulations, the mean square error (MSE) of the LD estimates for the217
various pairwise LD measures are given in Figure 3, where the sequencing effort was 600 reads,218
pA1 = pA2 = 0.5, ε = 0.01 and a range of values of D were used. The MSE for GUS-LD was lower219
than the standard approach when the true LD was large or near its maximum value. When the true220
LD was small, the standard approach had lower MSE at low depths but higher MSE at high depths.221
This is due to LD estimates from the standard approach having a small bias and smaller standard222
errors compared to GUS-LD when the true LD was small at low depths, whereas the presence of223
sequencing errors results in the higher MSE at high depths. The MSE for GUS-LD was smallest224
between mean depths of 2 and 5, where the actual depth at which the minimum occurred depended225
on the true value of D and the LD measure. The MSE is larger at higher read depths for GUS-LD as226
the increase in variability from having fewer individuals in the data sets was larger than the decrease227
in variability from having high read depths. There was one exception to this trend that occurred228
when the true value of D was equal to its upper bound (D = 0.25) for all the LD measures. In this229
case, the mean square error was largest at smaller mean read depths and decreased as the mean read230
depth increased. This is due to the fact that there is no variation or bias when the genotypes are231
accurate for values of D that are on their upper or lower bound, but there is variation when there is232
uncertainty in the genotype calls associated with low read depths.233
The mean square errors of the LD estimates for alternative combinations of allele frequencies when234
the sequencing effort was fixed are given in Figure S4 (pA1 = 0.5, pA2 = 0.75) and Figure S5 (pA1 =235
pA2 = 0.9). The results from these simulations were very similar to the case when pA1 = pA2 = 0.5,236
although there were some differences. For example, the MSE across all the mean depths for D was237
larger as the true value of D increased when pA1 = pA2 = 0.9, whereas the reverse was true when238
pA1 = pA2 = 0.5 and when pA1 = 0.5 and pA2 = 0.75. Also, for pA1 = 0.5 and pA2 = 0.75, the MSE239
LD estimation using sequencing data 13
Figure 2 Bias of the sequencing error estimates, ε̂, from GUS-LD for simulation 1 (A), simulation 2(C) and simulation 3 (E), and standard error (SE) of the sequencing error estimates, ε̂, from GUS-LD for simulation 1 (B), simulation 2 (D) and simulation 3 (F), where the parameters used for eachsimulation are given in Table 2.
14 Bilton et al.
Figure 3 Mean square error (MSE) of the LD estimates for D (A), D′ (B) and r2 (C) for a fixed av-erage sequencing effort of 600 reads when pA1 = 0.5, pA2 = 0.5, ε = 0.01 and the true values ofD were −0.05, 0, 0.05, 0.15, and 0.25. The upper and lower bound for D are −0.25 and 0.25 respec-tively. The dashed lines represents MSE for GUS-LD whereas the solid lines represents MSE for thestandard likelihood approach. The number of individuals in the simulated data sets were 300, 150,100, 75, 60, 40, 30 and 20 at mean read depths of 1, 2, 3, 4, 5, 7.5, 10 and 15 respectively.
LD estimation using sequencing data 15
for the LD measure r2 was not decreasing as the read depth increased when the true value of D240
was on its upper boundary (D = 0.125), as for the other parameter combinations. This was due to241
unequal allele frequencies meaning that the estimates of r2 were not near its upper bound of 1. These242
differences were due to the complex sampling properties of the various LD measures. Nevertheless,243
the optimal sequencing depth was mostly between 2 and 5 across all scenarios and LD measures.244
Deer dataset245
The LD estimates between all pairs amongst a set of 38 SNPs are given in Figure 4 for the absolute246
value of D′ and Figure 5 for r2. For the former LD measure, a number of pairwise estimates computed247
using GUS-LD were larger compared to the estimates obtained from the standard likelihood approach,248
which is seen by the greater intensity of red across the heatmap in Figure 4B compared to Figure 4A.249
Similarly, there were some pairwise estimates of r2 that were larger under GUS-LD (Figure 5B)250
compared to the standard likelihood approach (Figure 5A), which is seen by the fact that some of the251
yellow squares in Figure 5A appear more orange in Figure 5B. The average value of all the pairwise252
estimates for the two LD measures was larger under GUS-LD than the standard likelihood approach253
(Table 3). Compared to the simulation results, the difference in the LD estimates between the two254
approaches was not particularly large. This was due to a number of SNPs having high mean read255
depths (Figure S6). Nevertheless, the P-values from a Wilcoxon signed-rank test comparing the mean256
LD estimated from GUS-LD and the standard approach were very small (Table 3), giving strong257
evidence that the mean estimated level of LD from GUS-LD was significantly larger than from the258
standard approach. The distribution of the sequencing error estimates obtained from GUS-LD for all259
SNP pairs is given in Figure S7, where the mean estimate was 0.14%.260
16 Bilton et al.
Table 3 Average LD estimate across all pairs of SNPs for the deer dataset
LD Measure Standard GUS-LD P-valuea
|D′| 0.48 0.62 < 10−6
r2 0.028 0.040 < 10−6
aP-value from a Wilcoxon signed-rank test comparing the mean LD estimated from the standardapproach and GUS-LD. The test was performed in the programming language R (R Core Team 2017)using the wilcox.test function (paired=TRUE).
Discussion261
The introduction of high-throughput sequencing methods that multiplex a large number of individu-262
als is driving forward research into many species, particularly non-model species, and is increasingly263
being utilized by many researchers. However, analyzing sequencing data using existing analytical264
tools and methods may, in some cases, be impractical or lead to erroneous results due to the added265
complexity and nuances of the data compared to other genetic data types. Consequently, the devel-266
opment of new methodological tools for analyzing sequencing data is needed, although the progress267
of such tools has been slow compared to the sequencing technology (Gardner et al. 2014).268
Our simulation results have demonstrated that genotyping errors associated with under-called269
heterozygotes (e.g., allelic dropout) and miscalled bases leads to under-estimation of LD, when270
these errors are not taken into account. This is important as biased estimates of LD can have a271
profound effect on downstream analyses. For example, in case-control association studies, it has272
been shown using simulations that the presence of genotyping errors leads to reduced power in273
detecting an association between a locus and phenotype (Gordon and Ott 2001; Gordon et al. 2002).274
Russell and Fewster (2009) have also shown via simulations that allelic dropout results in positively275
biased estimates of effective population size when calculated using LD information. This problem is276
exacerbated for low coverage data as the rate of genotyping errors is much higher than those used277
LD estimation using sequencing data 17
Figure 4 Heatmaps of the absolute value of the pairwise estimates for D′ between all 38 SNPs inthe deer dataset using (A) the standard likelihood approach which does not account for under-called heterozygous genotypes or sequencing errors and (B) GUS-LD.
in these simulations studies. We have developed a new method, called GUS-LD, that accounts for278
errors associated with under-called heterozygotes and miscalled bases in the estimation of LD. Our279
results show that GUS-LD was able to greatly reduce bias in LD estimates at low sequencing depth,280
although the variability of these estimates were larger compared to the standard approach at low281
depths, which reflects the additional variation introduced into the data by uncertainty over whether282
both alleles or only one allele were seen. This additional variation will affect downstream analyses283
such that there will be less power to detect causal variates in association studies, more variable284
estimates of effective population size and less precision in assessing genome quality. However, this285
can be counteracted by sampling more individuals, since it can be more efficient to sample more286
individuals at low depth than fewer at high depth as suggested by our simulations results and by287
Maruki and Lynch (2014). The simulations also show that GUS-LD was able to reduce bias in LD288
estimates caused by sequencing errors, especially at high depths when the true LD was moderate to289
18 Bilton et al.
Figure 5 Heatmaps of the pairwise estimates for r2 between all 38 SNPs in the deer dataset us-ing (A) the standard likelihood approach which does not account for under-called heterozygousgenotypes or sequencing errors and (B) GUS-LD.
large.290
The sequencing error parameter, ε, in GUS-LD is specified in terms of a miscalled base for a given291
read, which differs from the tradition specification that is in terms of a miscalled allele in a genotype292
call. As a consequence, GUS-LD estimates the sequencing error rate from information provided293
by the allele counts for the reference and alternate alleles. In addition, a smaller sequencing error294
rate under the alternative specification can affect more genotypes calls than under the traditional295
specification for the same value of ε, especially if there are many reads associated with each genotype296
call. This means that the estimate of ε from GUS-LD is likely to differ from sequencing errors rates297
generally quoted in the literature. For the deer data set, the mean sequencing error rate for a given298
read was estimated at around 0.14%, which is of similar magnitude to the rate estimated by Bilton299
et al. (2018) in a linkage context for genotyping-by-sequencing data. Simulation results suggest that300
GUS-LD accurately estimates the sequencing error rate at high depths, but the estimates become301
LD estimation using sequencing data 19
biased as the mean depth decreases. This bias is likely due to the inability to distinguish between302
sequencing errors and true reads at very low depths. Nevertheless, GUS-LD still provided accurate303
LD estimates, even when the sequencing error estimates themselves were biased.304
With low coverage sequencing data, there are issues with estimating LD when the true parameter305
value lies near or on the upper or lower bound of its parameter space (Equation (2)). Specifically,306
the bias in the LD estimates increases as D approaches its upper or lower bound. This is even the307
case for GUS-LD, which adjusts for genotyping errors associated with low read depths, although308
the bias is significantly less than the standard likelihood approach. This bias is caused by sampling309
variation resulting in the maximum of the likelihood (6) lying outside the parameter space of D,310
whereas maximization is performed with respect to constraint (2). When genotype calls are accurate311
and without error, this bias, in estimating D when its true value is near its upper or lower bound, is312
absent.313
There are many potential applications of using pairwise LD estimates from GUS-LD. For example,314
they could be used for quantifying the extent of LD decay in populations relative to physical distance315
from an assembly or genetic distance computed from a linkage analysis. This should prove a popular316
application of GUS-LD since there are numerous studies already using sequencing data for this317
purpose in a number of species (e.g., Huang et al. 2014; Nimmakayala et al. 2014; Fè et al. 2015;318
Gur et al. 2017; Sieber et al. 2017), including one by Faville et al. (2018) which utilized GUS-LD. LD319
estimates from GUS-LD can also be used in conjunction with the method of Sved (1971) to estimate320
historic effective population size or the method of Waples (2006) to estimate contemporary effective321
population size. Another application is assessing the quality of an assembly (e.g., Pernaci et al. 2014)322
or ordering scaffolds, such as in the Locus Ordering by Dis-Equilibrium procedure (Khatkar et al.323
2010). This application of LD is perhaps less well known but is particularly useful for sequencing324
data, since assemblies are often fragmented or not existent, and has already been used in a study by325
20 Bilton et al.
Tennessen et al. (2017). One powerful application is combining LD estimates from GUS-LD with the326
software package LDna (Kemppainen et al. 2015) to explore genome-wide LD and investigate the327
evolutionary forces acting on a population. The advantage of combining these two approaches is328
that no reference genome is required, meaning that it is applicable to any species and so will prove329
particularly valuable for non-model species.330
For the methodology developed in this paper, a number of assumptions have been made. Firstly,331
genotype calls observed in the sequencing data are assumed to be conditionally independent between332
loci given the true genotype call. This assumption is reasonable provided that loci are not located333
on the same sequencing read across individuals. Estimation of LD is unaffected by the presence of334
genotyping errors resulting from low read depth when the loci are located on the same read as the335
true underlying haplotypes in the individuals are preserved. Depending on their settings, many336
variant callers allow for multiple SNPs to be called on the same sequencing read. However, it is more337
practical to only retain a single SNP from a given read as the loss of information is minimal and is338
outweighed by the reduced computational time. Other assumptions include that missing genotypes339
resulting from read depths of zero occur randomly, and that the alleles of the true genotype are340
sampled randomly in the sequencing process. If the latter assumption does not hold, one allele will341
be sampled more frequently than the other (e.g., preferential sampling). In this case, the proportion342
of heterozygotes seen as homozygotes will be larger than expected under the model, which would343
result in some bias in the LD estimates at low sequencing depth. If additional information is available,344
then the probabilities in Equation (5) can be adjusted to reflect alternative sampling models. Lastly, it345
is assumed that sequencing errors occur independently between reads. In reality, this assumption346
may not hold, although it has been found to be reasonable in some scenarios (Bilton et al. 2018).347
The main contributions of this paper are two fold. Firstly, we have demonstrated that there can be348
significant bias in LD estimates from sequencing data when the read depth is low and the associated349
LD estimation using sequencing data 21
errors are not taken into account. This highlights the need for practitioners to either remove these350
errors by filtering or adjust their methodology to account for these errors. This is particularly351
important as some LD analyses give no explicit mention of a minimum cut-off with respect to read352
depth being used. Secondly, we have proposed GUS-LD as a new method to estimate LD using353
low coverage sequencing data. GUS-LD will prove valuable to researchers seeking to undertake354
population studies when cost constraints prohibit the production of high coverage sequencing data355
or other types of genetic data. In fact, our simulation results suggest that it is more cost-efficient to356
use low coverage data, as it allows more individuals to be sequenced for the same cost and results in357
smaller mean square errors for the LD estimates. From our results, the optimal sequencing depth358
was between 2 and 5, which was similar to the optimal read depth observed by Dodds et al. (2015) in359
the context of relatedness estimation. GUS-LD also allows LD estimation using loci with a mixture of360
high and low mean read depths, which is particularly useful as the sequencing depth typically varies361
substantially between SNPs.362
Acknowledgements363
This work was funded by FarmIQ (Ministry for Primary Industries’ Primary Growth Partnership364
fund) – FIQ Systems – Plate to Pasture (PGP06-09020) and the Ministry of Business, Innovation365
and Employment (New Zealand), Contract C10X1306, “Genomics for Production & Security in a366
Biological Economy” to AgResearch Ltd. We thank Landcorp Farming Limited for use of their data,367
and two anonymous referees for their helpful comments.368
Literature Cited369
Akey, J. M., K. Zhang, M. Xiong, P. Doris, and L. Jin, 2001 The effect that genotyping errors have on370
the robustness of common linkage-disequilibrium measures. Am. J. Hum. Genet. 68: 1447–1456.371
22 Bilton et al.
Andrews, K. R., J. M. Good, M. R. Miller, G. Luikart, and P. A. Hohenlohe, 2016 Harnessing the power372
of RADseq for ecological and evolutionary genomics. Nat. Rev. Genet. 17: 81–92.373
Ardlie, K. G., L. Kruglyak, and M. Seielstad, 2002 Patterns of linkage disequilibrium in the human374
genome. Nat. Rev. Genet. 3: 299–309.375
Baird, N. A., P. D. Etter, T. S. Atwood, M. C. Currey, A. L. Shiver, et al., 2008 Rapid SNP discovery and376
genetic mapping using sequenced RAD markers. PLoS One 3: e3376.377
Bilton, T. P., M. R. Schofield, M. A. Black, D. Chagné, P. L. Wilcox, et al., 2018 Accounting for errors in378
low coverage high-throughput sequencing data when constructing genetic maps using biparental379