MPG NGS workshop I: SNP calling Mark DePristo Manager, Medical and Popula<on Gene<c Analysis Genome Sequencing and Analysis Group Medical and Popula<on Gene<cs Program Broad Ins<tute of Harvard and MIT 02/04/10
MPGNGSworkshopI:SNPcalling
MarkDePristo
Manager,MedicalandPopula<onGene<cAnalysisGenomeSequencingandAnalysisGroupMedicalandPopula<onGene<csProgram
BroadIns<tuteofHarvardandMIT02/04/10
ThreeslidebackgroundonSNPcallingintheGATK
2
SNPcallingworkflow
3
Call-ready BAM files(cleaned, dedupped, recalibrated,
with well-formated header)
Raw variants (VCF)(all sites confidently containing non-reference bases; with genotypes)
Filtered variants (VCF)(separate true segregating variation
from machine artifacts)
Data input and output Processing tools
GATK unified genotyper
GATK variant analysis
GATK variant filtration
Expert user judgement
Ease of useRuntime*Filesize*
* Runtime and file sizes are for a single sample 30x whole genome BAM
** Potentially requires many rounds of experimentation and evaluation
Very easy
200Gb
1 Gb
1 Gb
Tools are easy to use
but parameter selection
requires significant
expertise and
judgement
10 hrs
Instant
30 min
Days**
L(G | D) = P(G)P(D |G) = P(b |G)b∈ good _ bases{ }∏
GATKsinglesamplegenotypelikelihoods
• Priorsappliedduringmul<‐samplecalcula<on;P(G)=1
• Likelihoodofdatacomputedusingpileupofbasesandassociatedqualityscoresatgivenlocus
• Only“goodbases”areincluded:thosesa<sfyingminimumbasequality,mappingreadquality,pairmappingquality,NQS
• P(b|G)usesplaYorm‐specificconfusionmatrices• L(G|D)computedforall10genotypes
Prior for the genotype
Likelihood for the genotype
Likelihood of the data given the genotype
Bayesianmodel
Independent base model
Seeh[p://www.broadins<tute.org/gsa/wiki/index.php/Unified_genotyperformoreinforma<on4
Weapplyageneraliza<onofthesinglesampleSNPcallertoPilot1
• Thisapproachallowsustocombineweaksinglesamplecallstodiscovervaria<onamongsampleswithhighconfidence
Individual 1
Sample-associated reads
Individual 2
Individual N
Genotype likelihoods
Joint estimate across samples
Genotype frequencies
Allele frequency
SNPs
Seeh[p://www.broadins<tute.org/gsa/wiki/index.php/Unified_genotyperformoreinforma<on5
MakingrawvariantcallswiththeGATKunifiedgenotyper
6
RunningtheUnifiedGenotyper
Seeh[p://www.broadins<tute.org/gsa/wiki/index.php/Unified_genotyperformoreinforma<on7
java -Xmx2048m –jar GenomeAnalysisTK.jar -R /broad/1KG/reference/human_b36_both.fasta -T UnifiedGenotyper-D dbsnp_129_b36.rod -varout NA19240.raw.vcf -confidence 50 --heterozygosity 1.000000e-03 -I NA19240.SLX.bam
Minimumphred‐scaledconfidencerequiredtoemitaSNP
1hetper1000referencebasesonaverageforaYoruban
BAMfilecontainingNA19240SLXreads
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19240 1 36496 . T A 53.13 0 <ATTRIBUTES> GT:DP:GQ 1/0:6:84.70
1 45162 rs10399749 C T 331.37 0 <ATTRIBUTES> GT:DP:GQ 0/1:27:99.00
1 48677 . G A 399.86 0 <ATTRIBUTES> GT:DP:GQ 1/0:25:99.00
Longstringofvariantannota<ons(moreinfoinafewslides)RawVCFcalls(NA19240.raw.vcf)
SNPcallingar<facts
• SNPcallsaregenerallyinfestedwithfalseposi<ves– Fromsystema<cmachinear<facts,mismappedreads,alignedindels/CNV
– RawSNPcallsmighthavebetween5‐20%FPsamongnovelcalls
• Separa<ngtruevaria<onfromar<factsdependsverymuchonthepar<cularsofone’sdataandprojectgoals– Wholegenomedeepdata,WGlow‐pass,hybridcapture,pooledPCRarehavesignificantlydifferenterrormodes
8
Filteringar<factsoutofyourSNPcalls
• TheGATKusesathreepassapproach– Firstemitallsitespoten<allycontainingatruevariant
– AggregateSNPcovariatesintherawVCFtodeterminetherela<onshipbetweeneachcovariateanderror[warning:requiresuserexper0se]
– Finally,applythesefilterstotherawVCFusingtheGATKVariantFiltra<ontool
• Wearecurrentlyworkingonarobust,easy‐to‐useautomatedtool
9
Variantannota<onsandfilters
22 49582364 . A G 198.96 0 AB=0.67;AC=3;AF=0.50;AN=6;DP=87;Dels=0.00;HRun=1;MQ=71.31;MQ0=22;QD=2.29;SB=-31.76 GT:DP:GQ 0/1:12:99.00 0/1:11:89.43 0/1:28:37.78
VCFrecordforanA/GSNPat22:49582364
HeterozygousgenotypeA/Ginallthreeindividuals
Seeh[p://www.broadins<tute.org/gsa/wiki/index.php/VariantAnnotatorformoreinforma<on10
AC No.chromosomescarryingaltallele
AB Allelebalanceofref/altinhets
AN Totalno.ofchromosomes Hrun Lengthoflongestcon<guoushomopolymer
AF Allelefrequency MQ RMSMAPQofallreads
DP Depthofcoverage MQ0 No.ofMAPQ0readsatlocus
QD QUALscoreoverdepth SB Es<matedSBscore
INFO
field
Covariate bin value
Tra
nsitio
n / tra
nsvers
ion r
atio
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0.0 0.2 0.4 0.6 0.8
AB
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0 200 400 600
DP
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0 10 20 30 40 50
MQ0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
-1500 -1000 -500 0
SB
titv
dbSNP/100
Selec<ngfilteringthresholds
Selectedfiltersare:AB>0.75||DP>300||MQ0>40||SB>‐0.10||3snpswithin10bp
Notelet‐mostvaluesareSNPswithoutdisplayedannota<on
Annota<on
Seeh[p://www.broadins<tute.org/gsa/wiki/index.php/VariantFiltra<onWalkerformoreinforma<on11
RunningVariantFiltra<on
12
java -Xmx2048m –jar GenomeAnalysisTK.jar -R /broad/1KG/reference/human_b36_both.fasta-T VariantFiltration -B variant,VCF,NA19240.raw.vcf -D dbsnp_129_b36.rod --clusterWindowSize 10--filterExpression “AB > 0.75 || DP > 300 || MQ0 > 40 || SB > -0.10” -l INFO-o NA19240.filtered.vcf
ExpressiondescribingSNPsthatshouldbefilteredout
Filtersoutanygroupof3SNPswithin10bpofeachother
FilteredVCFcalls(NA19240.filtered.vcf)#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19240 1 36496 . T A 53.13 GATK_FILTER<ATTRIBUTES> GT:DP:GQ 1/0:6:84.70
1 45162 rs10399749 C T 331.37 0 <ATTRIBUTES> GT:DP:GQ 0/1:27:99.00
1 48677 . G A 399.86 0 <ATTRIBUTES> GT:DP:GQ 1/0:25:99.00
SNPswithpoorcharacteris<cshavetheirFILTERfieldfilledin
Callset Callablebases1
#variants dbSNP% Ti/Tv(Est.FPrate2) Hapmap3Sensi@vity3
Hapmap3Concordance3
Known Novel
SingleindividualcallsfromtheGATK
RawNA192402.70B(89%)
4.52M 77.832.07(1.9%)
1.81(18.1%)
99.41 99.85
FilteredNA19240 4.26M 80.422.10
(~0.0%)2.01(5.6%)
99.14 99.85
Daughter+parentsmul@‐samplecallsfromtheGATK
RawYRItriotogether
2.5B(81%)
6.24M 71.652.07(1.9%)
1.80(18.8%)
99.62 99.85
FilteredYRItriotogether
5.60M 74.862.11
(~0.0%)2.02(5.0%)
99.29 99.85
RawandfilteredautosomalcallsforYRIdaughterandtrio
1. %ofall3.1BbasesoftheB36humangenomecalledwithatleastQ50confidence2. Calculatedas1‐(<tv_Observed‐0.5)/(<tv_Expected‐0.5)with<tv_Expectedof2.13. NA19240sensi<vityandconcordanceresults
13
Examplenovelvariant
14
Chr1:67634785in3’untranslatedregion
Examplescripts
• 1000GenomesSLXYRIBAMfiles:– Locallyavailableat:/humgen/gsa‐hpprojects/1kg/1kg_pilot2/useTheseBamsForAnalyses/<sample>.SLX.bam
– Availablefordownloadat1000genomes.org
• ScriptsandVCFfiles:– /humgen/gsa‐scr1/pub/tutorials/MPG_workshop
15
Appendix
16
SNPs with confidence score within interval
% S
NP
s in
db
SN
P 1
29
020406080100
0 100 200 300 400 500
SNPs with confidence score within interval
Ti/T
v r
atio
0.5
1.0
1.5
2.0
2.5
0 100 200 300 400 500
SNPs with confidence score within interval
Tru
e p
ositiv
e S
NP
s
0100020003000
0 100 200 300 400 500
ChoosingaminimumconfidencescoreforaSNP
17
Defaultthreshold
• Eachpointonplotincludes~3000SNPsfromNA19240• ThedensityofpointsacrosstheconfidenceintervalindicatesthenumberofSNPs• ~0.5%ofSNPshaveQ<100,andonly2%arelessthanQ<200• ThedefaultQ50thresholdresultsinanhighlysensi<vecallset
dbSNPrate
Ti/Tvrate
Trueposi<veson1KG
customIlluminachip(cum.)