Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D. Genome Sequence and Analysis Medical and Population Genetics [email protected] Best practices for Variant Calling with Pacific Biosciences data 1 Wednesday, February 15, 12
Mauricio Carneiro, Ph.D.Mark DePristo, Ph.D.
Genome Sequence and AnalysisMedical and Population [email protected]
Best practices for Variant Calling with Pacific Biosciences data
1Wednesday, February 15, 12
The Current Pipeline
General best practice data processing and variant calling using the GATK
2Wednesday, February 15, 12
SNPs
Indels
Structural variation (SV)
Rawindels
RawSVs
Typically by lane Typically multiple samples simultaneously but can be single sample alone
Input
Output
Mapping
Local realignment
Duplicate marking
Base quality recalibration
Analysis-ready reads
Raw reads Sample 1 reads
Raw variants
RawSNPs
Genotype refinement
Variant quality recalibration
Analysis-ready variants
Pedigrees Known variation
Known genotypes
Population structure
Phase 1: NGS data processing Phase 2: Variant discovery and genotyping Phase 3: Integrative analysis
Sample N reads
External data
Our framework for variation discovery!
DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !
3Wednesday, February 15, 12
Finding the true origin of each read is a computationally demanding first step!
Region 1
Enormous pile of short reads from
NGS
Detects correct read origin and flags them
with high certainty
Detects ambiguity in the origin of reads and
flags them as uncertain
Reference genome
Region 2 Region 3
For more information see: Li and Homer (2010). A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics.
Mapping'and'alignment'algorithms'
Phase 1:!NGS data processing!
Input
Output
Mapping
Local realignment
Duplicate marking
Base quality recalibration
Analysis-ready reads
Raw reads
4Wednesday, February 15, 12
rs28782535!
rs28783181! rs28788974! rs34877486! rs28788974!
1,000 Genomes Pilot 2 data, raw MAQ alignments! 1,000 Genomes Pilot 2 data, after MSA!
HiSeq data, raw BWA alignments! HiSeq data, after MSA!
Effect of MSA on alignments!NA12878, chr1:1,510,530-1,510,589!
Accurate read alignment through multiple sequence local realignment"
25"DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !
Phase 1:!NGS data processing!
Input
Output
Mapping
Local realignment
Duplicate marking
Base quality recalibration
Analysis-ready reads
Raw reads
5Wednesday, February 15, 12
Accurate error modeling with base quality score recalibration"
26"
Phase 1:!NGS data processing!
!!!!!!
!!!
!!!!
!!
!
!
!!
!!
!!
!!
!!!!!! !
!!
0 10 20 30 40
010
20
30
40
Reported Quality
Em
pir
ical Q
ualit
y
!!!!!!!!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!!
!
!
Original, RMSE = 5.242Recalibrated, RMSE = 0.196
!!
!!
!!!
!!
!!
!!
!!
!
!!!!
0 10 20 30 40
010
20
30
40
Reported Quality
Em
pir
ical Q
ualit
y
!!!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!
!
!
Original, RMSE = 2.556Recalibrated, RMSE = 0.213 !!!
!
!
!!!
!!!
!!
!!
!!
!!!!
!!
!!
!
!
0 10 20 30 40
010
20
30
40
Reported Quality
Em
pir
ical Q
ualit
y!!!!!!!!!!!
!!
!!
!!
!!
!!
!!
!!
!!
!!!!
!!!!
!
!
Original, RMSE = 1.215Recalibrated, RMSE = 0.756
!!!
!!!!!
!!
! !!
!
!
!
!
!!
!!
!
!
!!
!!
!!!!
!!
!!
!
0 10 20 30 40
010
20
30
40
Reported Quality
Em
pir
ical Q
ualit
y
!!!!!!!!!!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!
!
Original, RMSE = 5.634Recalibrated, RMSE = 0.135
!!!!!!!!!!!!!!! !! !! !! !
! !! !!
!
!
!! !!!!!
0 5 10 15 20 25 30 35
!10
!5
05
10
Machine Cycle
Acc
ura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!!!!!!!!!!!! !! !! !! !! !! !! !! !! !!!! !
!
!
Original, RMSE = 2.207Recalibrated, RMSE = 0.186
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!!
!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!
!!!
!!!!!!!! !!!!!!!! !!!!!!!! !!
!!!!!! !!!!!!!!
!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!! !!!! !!!
!!!!! !!!!
!!!!!!
!!!!!!!!!!
!!
!
!
!
!!
!!
!!
0 50 100 150 200
!10
!5
05
10
Machine Cycle
Acc
ura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!! !!!! !!!!!!!! !!!! !!!!!!!!!!!!!!!!
!!!
!!!!!
!
!
Original, RMSE = 1.784Recalibrated, RMSE = 0.136
!!
!
!
!
!
!
!
!
!
!!
!
!!!
!!
! !!!
! !!!
!
!!
!!
!!
!!! !!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!!
!30 !20 !10 0 10 20 30
!10
!5
05
10
Machine Cycle
Acc
ura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!! !!! !! !! !!!! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !
! !! !! ! !! !!!!!
! !! ! !!!!!
Second of pair reads First of pair reads
!
!
Original, RMSE = 1.688Recalibrated, RMSE = 0.213
!
!!!!!!!!!!!
!!!!!!!!
!!!!
!
!!
!
!!!!!!!!!!!!
!
!!!!!
! !
!
!!
!!
!!
!!
!!
!!!!
!!!
!
!
!
!!
!!!!
!
!
!
!
!
!!!
!
!
!
!
!!!!!!!! !
!
!!
!!
!
!!!
!!
!
!!!!
!!
!
!
!
!!!!
!
!
!!
!
!
!
!
!!
!
!
!!!! !!!! !!!
!
!!!! !
!!!!!!!!!!!!!!!
!
!!!!
!
!!!!
!!!!!
! !!!!!!!!
!!!!
!!!!
!!
!!!
!100 !50 0 50 100
!10
!5
05
10
Machine CycleA
ccura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!
Second of pair reads First of pair reads
!
!
Original, RMSE = 2.609Recalibrated, RMSE = 0.089
!1
0!
50
51
0
Dinucleotide
Acc
ura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!
!!
!
!!!!!!!!!
!!!!!!!!!!!!!!!!
AA AG CA CG GA GG TA TG
Original, RMSE = 2.598Recalibrated, RMSE = 0.052
!1
0!
50
51
0
Dinucleotide
Acc
ura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!
!!!!!
!
!!!!!!!
!!
!!!!!!!!!!!!!!!!
AA AG CA CG GA GG TA TG
Original, RMSE = 2.169Recalibrated, RMSE = 0.135
!1
0!
50
51
0
Dinucleotide
Acc
ura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!!!!!!!
!!!!
!!!!!!!!!!!!!!!!!!
AA AG CA CG GA GG TA TG
Original, RMSE = 1.656Recalibrated, RMSE = 0.088
!1
0!
50
51
0
Dinucleotide
Acc
ura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!!
!!
!!!!
!!!!!
!!!!!!!!!!!!!!!!
AA AG CA CG GA GG TA TG
Original, RMSE = 2.469Recalibrated, RMSE = 0.083
Illumina/GenomeAnalyzer Roche/454 Life/SOLiD Illumina/HiSeq 2000
!!!!!!
!!!
!!!!
!!
!
!
!!
!!
!!
!!
!!!!!! !
!!
0 10 20 30 40
010
20
30
40
Reported Quality
Em
pir
ical Q
ualit
y!!!!!!!!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!!
!
!
Original, RMSE = 5.242Recalibrated, RMSE = 0.196
!!
!!
!!!
!!
!!
!!
!!
!
!!!!
0 10 20 30 40
010
20
30
40
Reported Quality
Em
pir
ical Q
ualit
y
!!!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!
!
!
Original, RMSE = 2.556Recalibrated, RMSE = 0.213 !!!
!
!
!!!
!!!
!!
!!
!!
!!!!
!!
!!
!
!
0 10 20 30 40
010
20
30
40
Reported Quality
Em
pir
ical Q
ualit
y
!!!!!!!!!!!
!!
!!
!!
!!
!!
!!
!!
!!
!!!!
!!!!
!
!
Original, RMSE = 1.215Recalibrated, RMSE = 0.756
!!!
!!!!!
!!
! !!
!
!
!
!
!!
!!
!
!
!!
!!
!!!!
!!
!!
!
0 10 20 30 40
010
20
30
40
Reported Quality
Em
pir
ical Q
ualit
y
!!!!!!!!!!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!!
!
!
Original, RMSE = 5.634Recalibrated, RMSE = 0.135
!!!!!!!!!!!!!!! !! !! !! !
! !! !!
!
!
!! !!!!!
0 5 10 15 20 25 30 35
!10
!5
05
10
Machine Cycle
Acc
ura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!!!!!!!!!!!! !! !! !! !! !! !! !! !! !!!! !
!
!
Original, RMSE = 2.207Recalibrated, RMSE = 0.186
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!!
!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!
!!!
!!!!!!!! !!!!!!!! !!!!!!!! !!
!!!!!! !!!!!!!!
!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!! !!!! !!!
!!!!! !!!!
!!!!!!
!!!!!!!!!!
!!
!
!
!
!!
!!
!!
0 50 100 150 200
!10
!5
05
10
Machine Cycle
Acc
ura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!! !!!! !!!!!!!! !!!! !!!!!!!!!!!!!!!!
!!!
!!!!!
!
!
Original, RMSE = 1.784Recalibrated, RMSE = 0.136
!!
!
!
!
!
!
!
!
!
!!
!
!!!
!!
! !!!
! !!!
!
!!
!!
!!
!!! !!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!!!!
!30 !20 !10 0 10 20 30
!10
!5
05
10
Machine Cycle
Acc
ura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!! !!! !! !! !!!! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !
! !! !! ! !! !!!!!
! !! ! !!!!!
Second of pair reads First of pair reads
!
!
Original, RMSE = 1.688Recalibrated, RMSE = 0.213
!
!!!!!!!!!!!
!!!!!!!!
!!!!
!
!!
!
!!!!!!!!!!!!
!
!!!!!
! !
!
!!
!!
!!
!!
!!
!!!!
!!!
!
!
!
!!
!!!!
!
!
!
!
!
!!!
!
!
!
!
!!!!!!!! !
!
!!
!!
!
!!!
!!
!
!!!!
!!
!
!
!
!!!!
!
!
!!
!
!
!
!
!!
!
!
!!!! !!!! !!!
!
!!!! !
!!!!!!!!!!!!!!!
!
!!!!
!
!!!!
!!!!!
! !!!!!!!!
!!!!
!!!!
!!
!!!
!100 !50 0 50 100
!10
!5
05
10
Machine Cycle
Acc
ura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!
Second of pair reads First of pair reads
!
!
Original, RMSE = 2.609Recalibrated, RMSE = 0.089
!1
0!
50
51
0
Dinucleotide
Acc
ura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!
!!
!
!!!!!!!!!
!!!!!!!!!!!!!!!!
AA AG CA CG GA GG TA TG
Original, RMSE = 2.598Recalibrated, RMSE = 0.052
!1
0!
50
51
0
Dinucleotide
Acc
ura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!
!!!!!
!
!!!!!!!
!!
!!!!!!!!!!!!!!!!
AA AG CA CG GA GG TA TG
Original, RMSE = 2.169Recalibrated, RMSE = 0.135
!1
0!
50
51
0
Dinucleotide
Acc
ura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!!!!!!!
!!!!
!!!!!!!!!!!!!!!!!!
AA AG CA CG GA GG TA TG
Original, RMSE = 1.656Recalibrated, RMSE = 0.088
!1
0!
50
51
0
Dinucleotide
Acc
ura
cy (
Em
pir
ical !
Report
ed Q
ualit
y)
!!!!!
!!
!!!!
!!!!!
!!!!!!!!!!!!!!!!
AA AG CA CG GA GG TA TG
Original, RMSE = 2.469Recalibrated, RMSE = 0.083
Illumina/GenomeAnalyzer Roche/454 Life/SOLiD Illumina/HiSeq 2000
Ryan Poplin
Input
Output
Mapping
Local realignment
Duplicate marking
Base quality recalibration
Analysis-ready reads
Raw reads
DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !
6Wednesday, February 15, 12
Bayesian(model((
4 SNP calling
4.1 Simple genotype likelihoods for presentations
Pr{G|D} =Pr{G}Pr{D|G}
�i Pr{Gi}Pr{D|Gi}, [Bayes’ rule]
Pr{D|G} =⇧
j
�Pr{Dj|H1}
2+
Pr{Dj|H2}2
⇥where G = H1H2
Pr{D|H} is the haploid likelihood function
4.1.1 SNP haploid likelihood
Pr{Dj|H} = Pr{Dj|b}, [single base pileup]
Pr{Dj|b} =
⇤1� �j Dj = b,�j otherwise.
4.1.2 Indel haploid likelihood
Pr{Dj|H} =⌅
alignments � of Dj to H
Pr{Dj, ⇥}
4.2 Genotype likelihoods
Pr{Di|GTi} =⇧
j
Pr{Di,j|GTi}
Pr{Di,j|GTi = AB} = (Pr{Di,j|A}+ Pr{Di,j|B}) /2
Pr{Di,j|B} =
⇤1� �i,j Di,j = B,
�i,j · Pr{B is true|Di,j is miscalled} otherwise.
3
SNP and Indel calling is a large-scale Bayesian modeling problem!
• Inference:(what(is(the(genotype(G(of(each(sample(given(read(data(D(for(each(sample?(
• Calculate(via(Bayes’(rule(the(probability(of(each(possible(G(• Product(expansion(assumes(reads(are(independent(• Relies(on(a(likelihood(funcCon(to(esCmate(probability(of(sample(
data(given(proposed(haplotype(
Prior of the genotype!
Likelihood of the genotype!
See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information 27!
Diploid assumption!
7Wednesday, February 15, 12
SNP genotype likelihoods!
• All diploid genotypes (AA, AC, …, GT, TT) considered at each base!
• Likelihood of genotype computed using only pileup of bases and associated quality scores at given locus!
• Only �good bases� are included: those satisfying minimum base quality, mapping read quality, pair mapping quality, NQS!
See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information 28!
4 SNP calling
4.1 Simple genotype likelihoods for presentations
Pr{G|D} =Pr{G}Pr{D|G}
�i Pr{Gi}Pr{D|Gi}, [Bayes’ rule]
Pr{D|G} =⇧
j
�Pr{Dj|H1}
2+
Pr{Dj|H2}2
⇥where G = H1H2
Pr{D|H} is the haploid likelihood function
4.1.1 SNP haploid likelihood
Pr{Dj|H} = Pr{Dj|b}, [single base pileup]
Pr{Dj|b} =
⇤1� �j Dj = b,�j otherwise.
4.1.2 Indel haploid likelihood
Pr{Dj|H} =⌅
alignments � of Dj to H
Pr{Dj, ⇥}
4.2 Genotype likelihoods
Pr{Di|GTi} =⇧
j
Pr{Di,j|GTi}
Pr{Di,j|GTi = AB} = (Pr{Di,j|A}+ Pr{Di,j|B}) /2
Pr{Di,j|B} =
⇤1� �i,j Di,j = B,
�i,j · Pr{B is true|Di,j is miscalled} otherwise.
3
8Wednesday, February 15, 12
Variant Quality Score Recalibration (VQSR): modeling error properties of real polymorphism to determine the probability that novel sites are real!
The HapMap3 sites from NA12878 HiSeq!calls are used to train the GMM. Shown!here is the 2D plot of strand bias vs. the!variant quality / depth for those sites.!
Variants are scored based on their!fit to the Gaussians. The variants!(here just the novels) clearly!separate into good and bad clusters.!
32!DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !
9Wednesday, February 15, 12
These methods are available in the Genome Analysis Toolkit (GATK)"
• Most Broad Institute tools for the 1000 Genomes have been developed in the GATK "
McKenna et al. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res.!
http://www.broadinstitute.org/gsa/wiki/""
1000 genomes GATK tools"
Genome Analysis Toolkit (GATK)" SAM/BAM format"
• Technology agnostic, binary, indexed, portable and extensible file format for NGS reads"
• Also used in the Broad production pipeline"
http://samtools.sourceforge.net/""
VCF format"• Standard and accessible
format for storing population variation and individual genotypes"
• Open-source map/reduce programming framework for developing analysis tools for next-gen sequencing data"
• Easy-to-use, CPU and memory efficient, automatically parallelizing Java engine"
h"p://vc(ools.sourceforge.net/44
Indel realignment"
VQSR"
Base quality score recalibration"
Unified Genotyper"
Variant Eval" Many other analysis tools"
10Wednesday, February 15, 12
Pacbio Processing Pipeline
how we apply our pipeline to Pacific Biosciences dataa step-by-step tutorial
11Wednesday, February 15, 12
SNPs
Indels
Structural variation (SV)
Rawindels
RawSVs
Typically by lane Typically multiple samples simultaneously but can be single sample alone
Input
Output
Mapping
Local realignment
Duplicate marking
Base quality recalibration
Analysis-ready reads
Raw reads Sample 1 reads
Raw variants
RawSNPs
Genotype refinement
Variant quality recalibration
Analysis-ready variants
Pedigrees Known variation
Known genotypes
Population structure
Phase 1: NGS data processing Phase 2: Variant discovery and genotyping Phase 3: Integrative analysis
Sample N reads
External data
Our framework for variation discovery!
DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !
not evaluated yet on PacBio data due to small size of the datasets
currently the GATK cannot perform indel realignment due to the high indel error rate and the long reads of Pacific Biosciences
12Wednesday, February 15, 12
Pacbio Processing Pipeline1. We start our processing pipeline with the
filtered_subreads.fasta file produced by PacBio software and turn into a fastQ using SMRT Pipeline scripts provided by PacBio.
2. Mapping and Alignment are done using BWA with a heuristic smith waterman algorithm (bwa-sw)
3. We sort the bam file, add read group and sample information using Picard Tools: SortSam and AddOrReplaceReadGroups.
4. We recalibrate base qualities using the GATK’s Base Quality Score recalibration framework.
FASTA
BWA-SW
SAM
BAM
Picard
Base Quality Score Recalibration
Analysis Ready BAM
FASTQ
13Wednesday, February 15, 12
Why do we align with BWA and not BLASR?• BWA is the standard aligner in the Broad’s sequencing platform.
• BLASR is still responsible for generating the filtered sub-reads.
• With recent updates, BLASR generated BAM files are a reasonable alternative for this step of the pipeline
- optional pipeline starts with a BLASR generated BAM (skipping BWA and Picard steps).
- Read Group information and BQSR are still required steps.
- Works well, but generally smaller yield.
- We anticipate further development in BLASR generated BAMs could improve this alternate pipeline in the future.
14Wednesday, February 15, 12
total mapped coverage: 74,735,274 bp total mapped coverage: 19,562,290 bp
Strict BLASR filtering reduces yield and eliminates the longer reads
aggressive BLASR clipping turns longer reads into
“short” reads
15Wednesday, February 15, 12
0 10 20 30 40
Reported quality score histogram
Empirical quality score
Co
un
t
02
00
00
00
00
40
00
00
00
06
00
00
00
00
●
●● ● ●
● ●
●● ●
● ●●
● ●●
● ● ●● ●
● ●
● ● ●
● ●●
●
0 10 20 30 40
01
02
03
04
0
Reported vs. empirical quality scores
Reported quality score
Em
piric
al q
ua
lity s
co
re
●
0 10 20 30 40
Reported quality score histogram
Empirical quality score
Co
un
t
0 5
00
00
00
01
00
00
00
00
15
00
00
00
0
● ●●
●
●●
●●
●●
●●
●
●
●●
●●
●●
●●
●
●●
●
●●
●●
●
●●
0 10 20 30 40
01
02
03
04
0
Reported vs. empirical quality scores
Reported quality score
Em
piric
al q
ua
lity s
co
re
●
Recalibra)on
Sequencers provide es)mates of error rate per nucleo)de
… but they aren’t very accurate
… and they aren’t very informa)ve
Reported quality score
Reported quality score
Reported quality score
Reported quality score
Introduc)on to Base Quality Score Recalibra)on
16Wednesday, February 15, 12
Recalibra)on workflow
17
Original BAM file
Covariates table (.csv)
Recalibrated BAM file
Recalibrated covariates table (.csv)
CountCovariates
TableRecalibra)on
CountCovariates
AnalyzeCovariates
AnalyzeCovariates
Pre-‐recalibra)on analysis plots
Post-‐recalibra)on analysis plots
dbSNP / known sitesnecessary
17Wednesday, February 15, 12
Running CountCovariates
18
java -‐Xmx4g -‐jar GenomeAnalysisTK.jar -‐R reference.fasta -‐D dbsnp.vcf -‐I original.bam -‐T CountCovariates -‐cov ReadGroupCovariate -‐cov QualityScoreCovariate -‐cov DinucCovariate -‐cov CycleCovariate -‐recalFile table.recal_data.csv
List of known polymorphic sites is necessary so these sites do not count against bases
mismatch rate
List of covariates to be used in the recalibra)on calcula)on
CSV file containing covariate counts
# Counted Bases 143745620ReadGroup,QualityScore,Dinuc,Cycle,nObservations,nMismatches,QempiricalSRR001802,2,AA,-8,165,17,10SRR001802,2,AA,-2,91,10,10SRR001802,2,AA,3,5,4,1SRR001802,2,AA,4,9,4,4SRR001802,2,AA,7,12,4,5
Table recalibra)on file (table.recal_data.csv)
See hVp://www.broadins)tute.org/gsa/wiki/index.php/Base_quality_score_recalibra)on for more informa)on
18Wednesday, February 15, 12
Running TableRecalibra)on
19
java -‐Xmx4g -‐jar GenomeAnalysisTK.jar -‐R Homo_sapiens_assembly18.fasta -‐I original.bam -‐T TableRecalibra)on -‐recalFile table.recal_data.csv -‐outputBam recal.bam
Table recalibra)on file from CountCovariates step
The full recalibrated bam file
A recalibrated copy of the original BAM file
See hVp://www.broadins)tute.org/gsa/wiki/index.php/Base_quality_score_recalibra)on for more informa)on
19Wednesday, February 15, 12
Running AnalyzeCovariates
See hVp://www.broadins)tute.org/gsa/wiki/index.php/Base_quality_score_recalibra)on for more informa)on20
java -‐Xmx4g -‐jar AnalyzeCovariates.jar -‐outputDir /path/to/output_dir/ -‐resources resources/ -‐recalFile table.recal_data.csv
The directory in which to place the output analysis plots
Points to the GATK installa)on’s directory of R scripts which are used for plodng the
data
Table recalibra)on file from either the before or aeer CountCovariates step
Many plots of base quality versus each covariate
A separate .jar file distributed with the GATK
20Wednesday, February 15, 12
The Pacbio Processing Pipeline is available for educational purposes
(but not supported)
Queue is part of the GATK and is a pipeline manager used internally at the Broad in most analysis projects
(see http://www.broadinstitute.org/gsa/wiki/index.php/Queue)
java -Xmx4g -jar Queue.jar -S PacbioProcessingPipeline.scala -i filtered_subreads.fastq -D dbSNP.vcf -R reference.fasta -run
or blasr.bam with extra -‐blasr op)on
21Wednesday, February 15, 12
Calling snps and indels using pacbio data with the Unified Genotyper
java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -I input.recal.bam -R reference.fasta -D dbsnp.vcf -deletions 0.5 -o myCalls.vcf -mbq 10
The ideal deletions and minimum base quality parameters for this specific dataset were determined systematically by measuring
sensitivity/specificity to known variant calls in NA12878.
allows sites with 50% dele)ons to be analyzed
minimum base quality 10 calibrates the UG for PacBio data (avg base qual is 20)
22Wednesday, February 15, 12
Analyzing PacBio data
more information available at the poster session of AGBT(presentation thursday 1:10 - 2:40pm)
23Wednesday, February 15, 12
A quick look at Pacific Biosciences data
indels are the primary error mode (all purple markers)
Notice the SNP
discovery dataset
0%
3.75%
7.5%
11.25%
15%
inser
tions
delet
ions
mismatc
hes
erro
r ra
te
24Wednesday, February 15, 12
average coverage
number of reads
120x 104x ~120x per sample
~500x per sample
36,918 305,581 89,934per sample
256,989
per sample
discovery validation cancer 1000G
Long reads and deep coverage on all PacBio datasets
discovery dataset validation dataset breast cancer dataset 1000G dataset
25Wednesday, February 15, 12
Sequencing bias is a known problem with NGS technologies that PacBio does not share
normalized coverage by GC content contrasted with GC content of the genome
E. coli R. sphaeroidsP. falciparum
come to Michael Ross’ talk on tuesday @ 7pm for a more thorough exploration on bias in the different sequencing technologies today
26Wednesday, February 15, 12
Random error profile of PacBio is much preferred by the GATK bayesian model to systematic errors
phasing dataset
SYSTEMATIC ERROR
RANDOM ERROR
same genome region on both datasets
27Wednesday, February 15, 12
Long reads with a high indel error rate have a side effect: reference bias
Allele balances for known variants in PacBio
Alternate allele fraction
Num
ber o
f het
eroz
ygou
s si
tes
0.0 0.2 0.4 0.6 0.8 1.0
020
4060
80
Current tools are not capable of locally realigning PacBio data, but we anticipate that newer tools will improve this issue.
28Wednesday, February 15, 12
True variation missed by Pacbio due to reference bias
1000G dataset
the alternate allele is hiding inside the insertions due to the low gap open
penalty of the aligner.
“C” INSERTIONS
29Wednesday, February 15, 12
0 10 20 30 40 50
Reported quality score histogram, entropy = 2.178
Reported quality score
Num
ber o
f Bas
es
010
0000
030
0000
050
0000
070
0000
0
The Base Quality Score framework does not account for indel errors
PacBio produces Q20 bases on average across datasets
discovery dataset
0 10 20 30 40 50
Reported quality score histogram, entropy = 2.718
Reported quality score
Num
ber o
f Bas
es
0 2
0000
00 4
0000
00 6
0000
00 8
0000
0010
0000
00
validation dataset
0 10 20 30 40 50
Reported quality score histogram, entropy = 2.769
Reported quality score
Num
ber o
f Bas
es
0 5
0000
0010
0000
0015
0000
00
1000G dataset
30Wednesday, February 15, 12
●●●●●●●●
●
●●
●
●●●●●●●●●●●●●●●●●●●●●●
●
●●●●
●●●●●●●
●●●●●●
●●●●●●●●●●●●●●●
●
●
●●●●●●●●●●●
●●
●●●
●●
●●●●●●●●●●
●
●●●
●●●
●●●●
●●●
●
●●
●●●●●
●●●●●●●
●
●●
●●●
●
●●●●
●●
●
●●
●●
●
●●●●
●●●●●●●●
●
●●●●
●●●●●●●●●●
●●
●
●●●●●
●●●●●●●●●●●●●●
●●
●
●●●
●●●
●●●●●
●
●●●●●●●●●
●
●●●●●●
●
●●●●●●●●●●●●●●●●
●
●●●●●
●
●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●
●
●●
●●●●●●●
●
●●●●●●●●
●●●●●●●●
●
●●●●
●
●●●●●●
●●●●●
●●●●●
●●
●●●●●●●●
●●●●●●
●●●●●●●●●●●●●
●●●
●●●●●●●
●●
●●●
●●●●●
●●●●●●●●●●
●●●
●●●
●●
●
●●●●●●●●●●
●●●●●
●●●
●
●
●
●●●●●
●●●●●●●
●
●
●
●●
●●●●
●
●●●●
●
●●
●●●●
●
●●●●●●●●●
●●●●●
●
●●
●●
●●●●●●●●
●
●●●
●
●●
●
●●
●●●●●●●●
●●●
●●
●
●●●●
●●●
●
●
●
●
●●
●●
●●
●
●●●●●●●●●●●●●●●●●
●●●●●
●
●●●●●
●●●●●
●
●●●
●
●
●●
●
●
●●●●●●●●●
●●
●
●●
●
●
●●●
●●
●
●●●
●
●
●
●●●
●
●
●
●●●
●●
●●
●
●●●
●●
●●●●●
●●●●●
●
●●●
●
●●●●
●
●●
●●●
●
●
●
●●●●●●●
●
●●
●●●●●
●
●
●●
●
●
●
●
●●●●●●●●●●
●●
●
●
●●
●
●
●
●●●
●●
●●●●●
●●
●●
●●●●●●●●
●●
●
●
●
●
●
●●●
●●●●●
●
●●
●●●
●●
●●●
●●●●
●
●●●●●●●●●
●●●●●●●●
●
●●
●●●
●
●●
●●●●
●
●
●
●
●●
●●
●●
●
●
●●●●
●●●
●
●●●●●
●
●
●
●●●●●
●●●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●●
●
●●●
●
●
●
●
●●●
●
●●
●●●●
●●
●●●●●●●●
●●●●●●
●●
●
●
●
●●
●
●●●●
●
●
●●
●●●●
●
●
●●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●●●
●●●
●●
●
●
●
●●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●●●●●●●●
●
●
●
●
●
●●
●●●●
●●
●
●
●
●●
●
●
●
●●
●●●
●●●
●
●
●
●
●●
●●●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
●●●●●
●●●
●
●
●
●
●
●
●
●●●●●●●
●
●
●
●
●
●●
●
●
●●
●●●●
●●●
●
●●●
●
●
●
●●●●
●
●
●●●●
●●●
●
●
●
●●●
●●●
●
●●●
●
●●
●
●
●●
●●●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●●●
●●●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●●●
●
●
●
●
●
●
●
●
●●
●●●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
0 500 1000 1500 2000 2500
−10
−50
510
RMSE_good = 7.196 , RMSE_all = 7.211
Cycle
Empi
rical
− R
epor
ted
Qua
lity
●●
●
●
●
●●●
●
●●●●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●●
●
●
●
●●
●
●
●
●
●●
●
●●●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●●●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●●
●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●●
●
●
●●●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●●
●
●
●●
●
●
●●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●●●
●
●
●
●
●●●●●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●●
●●
●
●
●
●●●
●●●
●
●
●
●
●●
●●●
●
●●●●●●●
●
●
●●●●
●●●●●
●●●
●
●●●●●
●
●
●●
●●●
●
●
●●
●●●●●
●
●
●
●
●
●
●●
●●
●
●●●●●
●●●
●
●
●●●●●●●
●●
●●
●
●●●●●●●
●
●
●
●●●●●●●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●
●
●●●●●●●●●●●●●●
●
●●
●
●●●●●●
●●
●●●
●●
●●●
●
●●
●
●●
●
●●●●
●
●●●
●●
●●●●●●●●
●
●●●●
●
●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●●
●●
●●●●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●
●
●●
●
●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●
●
●●●●●●●●●●●●●●●●●●
●
●●●●
●
●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●
●
●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●
●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
SLX GA 454 SOLiD Complete Genomics HiSeq
PacBio
●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●
0 500 1000 1500 2000 2500
−10
−50
510
RMSE_good = 0.559 , RMSE_all = 0.877
Cycle
Empi
rical
− R
epor
ted
Qua
lity
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●
●
●
●
●●●●
●●●
●
●
●
●●●●●●●●●●
●
●●
●●●●●●●
●
●●●●●●
●
●●
●
●
●
●●
●
●●
●
●
●●●●●●●●●●●●●●●
●●
●
●●
●●●
●
●
●
●●
●
●
●
●
●●●
●●
●●●
●
●
●
●
●
●●
●●
●●●
●●
●●
●
●●●●
●
●
●●
●●
●●
●
●
●
●
●
●●●
●
●●●●●●●
●●
●●●●
●
●
●
●
●
●●
●
●●
●●
●●●
●●●
●
●
●●●
●●
●
●●
●●
●
●
●
●●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●●●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●●●
●
●
●
●
●●●●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●
●
●
●●●
●●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●●●●
●
●
●●●
●●●
●
●●●●●
●
●
●
●
●●●
●
●
●
●
●●●●●
●
●
●
●
●
●
●
●
●●
●
●●●●●
●●
●
●
●
●●●●●●●
●
●
●●
●
●●●●●●●
●
●
●
●●●●●●●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●
●
●●●●●●●●●●●●●●
●
●●
●
●●●●●●
●●
●●●
●
●
●●●
●
●●
●
●●
●
●●●●
●
●●●
●●
●●●●●●●●
●
●●●●
●
●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●●
●
●
●●●●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●
●
●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
before recalibration:• even before
recalibration PacBio reads do not seem to be affected by the length of the read like other technologies.
• The steady straight line breaks after 1250bp because we have very few reads that go that long (hence the light blue colored dots)
PacBio base qualities are not affected by the length of the read
after recalibration:• recalibration helps make
the straight line more dense and clear.
• the lack of data points still breaks the recalibrated line after 1250bp.
discovery dataset
31Wednesday, February 15, 12
PacBio variation discovery and validation
validating hard-to-call-sites and a look at variation discovery using PacBio
32Wednesday, February 15, 12
How can we use PacBio data for human analysis?
• Is PacBio a good platform for follow-up validation today?
• Can we do SNP discovery with PacBio data?
• How does PacBio compare to other technologies?
33Wednesday, February 15, 12
Data and Definitions• We have performed a number of experiments at the Broad using
PacBio for human data analysis.
- discovery dataset (12/23/2010)61 amplicons covering 177 kb from regions across chromosome 20 of NA12878 (1000G sample).
- validation dataset (1/20/2011)a set of hard to call NA12878 snps targeted with 2Kbp amplicons
- breast cancer dataset (6/17/2011)24 samples for tumor/normal validation analysis of 15 events against HiSeq, 454 and Sequenom.
- 1000G dataset (8/25/2011)8 samples resequenced at 250 sites for follow up validation against Illumina, Sanger and Sequenom.
34Wednesday, February 15, 12
Pacbio as a validation tool• Follow up validation is a major unmet need at the
Broad and other centers.
• We carried out a follow-up validation assay using the de novo mutations previously validated by the 1000G project.
• Some are real de novo mutations
• Most are machine artifacts already identified by follow up validation in 1000G.
• These are hard-to-call sites that are prone to errors and really challenge sequence technology accuracy.
35Wednesday, February 15, 12
PacBio demonstrates great performance on hard-to-call sites
PacBio known true variant site
known false variant site
predictive value
calledalt
called ref
48 5 91%
0 67 100%
HiSeq known true variant site
known false variant site
predictive value
calledalt
called ref
48 35 58%
0 37 100%
positive predictive value, or precision rate is the proportion of subjects with positive test
results who are correctly diagnosed
negative predictive value (NPV) is the proportion of subjects with a negative test
result who are correctly diagnosed.
same sites on both tests validation dataset
36Wednesday, February 15, 12
Pacbio performs well in “apples to apples” comparison with MiSeq data
PacBio known true variant site
known false variant site
predictive value
calledalt
called ref
37 1 97%
1 59 98%
MiSeq known true variant site
known false variant site
predictive value
calledalt
called ref
38 5 88%
0 55 100%
Site missed due to reference bias
both technologies miscalled this site with the same “wrong” allele that is reported in our gold standard callset, making the truth status of this site questionable (possible sanger trace error)
same sites on both tests validation dataset
4 sites missed due to systematic error
(probably misalignment)
37Wednesday, February 15, 12
1000G project validation experiment
• First we used Sequenom to validate 300 well-behaving SNP sites chosen to be polymorphic in at least 1 out of 8 specific samples from Illumina low pass data.
- Sequenom is the current standard validation tool at the Broad.
• Sequenom only had data for 250 sites.
• We used PacBio to validate all 300 sites and looked at the agreement between Sequenom and Pacbio.
38Wednesday, February 15, 12
Pacbio adds valuable information to Sequenom validation
Pacbio ALT Pacbio REF
sequenom ALT
sequenom REF
218 7
8 12
Result Pacbio No. occurrences what went wrong
good sequencing 1 Sequenom was wrong
Alt allele placed on insertion 4 Pacbio Reference Bias
No coverage 1 Reads actually didn’t belong at location
Wrong ALT allele called 1 UG triallelic issue
Visual classification Result from Pacbio
6 look incredibly good 5 ALTs, 1 Reference Bias
1 bad mapping quality ALT
1 has nearby deletion (unclear) Reads actually didn’t belong at location
50 sites not called by sequenom Many sites were ALT, others mismapped
1000G dataset
39Wednesday, February 15, 12
Pacific Biosciences, Ion Torrent and MiSeq have good potential for validation experiments
sensitivity specificity PPV NPV
Ion (bwa-sw)
Ion (tmap)
MiSeq
PacBio
96.2% 100% 100% 54.5%
96.2% 100% 100% 54.5%
98.1% 92.3% 99.6% 70.5%
98.1% 100% 100% 68.7%
Low specificity indicates artifactual calls outside the scope of the validation
Ion Torrent has a low NPV but is good in most other metrics. NPV
40Wednesday, February 15, 12
cancer dataset
Illumina Sequenom Pacbio 454
somatic
wildtype
unknown
15 6 12 80 6 1 00 3 2 7
high coverage and high specificity to targets
breast cancer validation experiment
base qualities are severely under calculated
Pacbio correctly identified a false positive in the original dataset
(unknown in sequenom and 454)
41Wednesday, February 15, 12
GATK performs very well for SNP discovery with PacBio data
MiSeq HiSeq PacBio
Gold Standard SNP calls
calls on HapMap
Sensitivity
222 225 197
43 43 38
99.1% 100.0% 87.6%
discovery
• Reference bias (17) and lack of coverage (11) were the reasons for missed sites in Pacbio
• MiSeq missing data are due to mismapping/artifact (2) or low coverage (1).
42Wednesday, February 15, 12
Broad’s somatic mutation caller (muTect) successfully calls pacbio data
• One tumor/normal pair called:
• 6,459 sites examined
• 4,837 sites covered (14x/8x)
• 1 true somatic mutation called (previously validated)
• 0 False Positives called
muTect is a GATK based caller developed by the cancer group at the Broad Institute(https://confluence.broadinstitute.org/display/CGATools/MuTect)
43Wednesday, February 15, 12
PacBio data performs well with the GATK because...
• The error rate is random (despite being high).
• Such non-systematic error mode is well handled by the GATK SNP calling mathematics.
• very long reads make mapping very clear.
• less mismappings of paralogous sequences.
• structural variants are less prone to appear as SNPs.
Pacbio’s reference bias is currently the major limiting factor
44Wednesday, February 15, 12
Future of the GATK
What is the GSA team working on right now(that will impact PacBio data analysis)
45Wednesday, February 15, 12
From reads to alleles: the first frontier!
• Can’t calculate a likelihood for a hypothesis you don’t consider!
• How do I know what genetic variant I’m looking at, given the read data alone?!– A SNP, an INDEL, an SV,
or something else?!• General problem, but
acute for medium-sized events and insertions!
Too systematic to be machine errors, but the haplotype for Pr{D|H} is unclear
Example 1! Example 2!
46Wednesday, February 15, 12
From reads to alleles: the next frontier
• Can’t calculate a likelihood for a hypothesis you don’t consider
• How do I know what genetic variant I’m looking at, given each locus independently?–A SNP, an INDEL, an SV, or
something else?• General problem, but acute for
medium-sized events as we not only miss the true event but also generate many smaller false events
• Reference bias can be addressed from a haplotype approach
Too systema)c to be machine errors, but the haplotype for Pr{D|H} is unclear
Example 1 Example 2
47Wednesday, February 15, 12
Using local de novo haplotype assembly via DeBruijn graphs!
29# Assembly(of(large(genomes(using(second3genera4on(sequencing.(Schatz.(Genome(Research.(2010.(
48Wednesday, February 15, 12
Example Mullikin het dele)on we now callchr4:336781 TTAAAAAAGTATTAAAAAAGTTCCTTGCATGA/-‐
49
Original read data
Discovered haplotype
49Wednesday, February 15, 12
50
Example Mullikin het inser)on we now callchr18:14937489 -‐/CCACTCCAGCCTCTGATGGACTGCAAGCTGGGTCT
Original read data
Discovered haplotype
50Wednesday, February 15, 12
Caller Variant Sensi-vity(strict)
Genotype Concordance(strict)
Variant Sensi-vity(strict)
Genotype Concordance(strict)
Unified Genotyper 51.9%(40 / 77)
51.9%(40 / 77)
49.0%(97 / 198)
49.0%(97 / 198)
Haplotype Caller 90.9%(70 / 77)
89.6%(69 / 77)
81.8%(162 / 198)
81.8%(162 / 198)
51
Haplotype Caller greatly increases sensitivity to larger indel events over the Unified Genotyper
Mullikin Mills
• Input data is NA12878 b37+decoy WGS HiSeq high coverage• Sites chosen to be very difficult (het) but high confidence in being real
(require family transmission)• Evaluation sets• Mullikin Fosmids and Mills et al, GR, 2011 (2x hit, double center)• Large events (> 15 bp), largest is 106bp (which we don’t yet call)
51Wednesday, February 15, 12
A new BQSR that also recalibrates “indel qualities” qualities
AAAAA context
suffix
Empi
rical
gap
ope
n pe
nalty
25
30
35
40
45
50
55
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●
●●
●●●●●●●●●●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●●●●●
●●●●
●●●●
●●●
●●●●●●●●●●●●●
●●●
●
●
●●●●
●●●●●
●●●
●
●
●●●●●●●
●
●
●
●●●●
●
●
●●●
●
●●
●
●●●●●
●●●●
●●●●●●●●●●●●●●●●
●●●●
●
●●●●●
●
●●●
●
●
●
●●
●●
●●●
●
●●●●●●
●
●●●●●●●●●●●●●●●● ●
●●●●●●●●●●●●●●●
●●●●
●
●●●
●
●
●●
●
●●
●
●
●●●●
●
●
●●●●●
●●
●●
●●●●
●●
●
●●●●●●
●●●
●●
●
●●●
●
●●●●●
●●
●
●
●
●
●●●
●
●●●
●●●
●
●●
●
●
●●
●
●●●●●
●●
●
●●
●
●
●●●●●
●●●●●
●
●
●
●●●
●
●
●●●●●●●●●
●
●●
●● ●●
●
●●
●
●●●●●●●●●●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●●
●●●
●●
●
●
●●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●●●●●
●
●●●
●
●
●
●
●●
●
●
●●
●●●●●●●
●●
●
●
●●
●●●
●●
●●
●
●●●●●●●●
●●●
●●●
●●●●●●●●●●
●●●
●
●
●
●●
●
●
●
●●
●
●
● ●●
●●●●●●
●●●●●
●●●
●
●
●
●●●●
●
●
●
●●●
●
●
●●●
●●
●●
●●
●
●
●
●
●
●
●●
●
●
●●●●
●●●
●●●
●
●●●
●
●●
●
●●●●●●●●●
●●●
●●
●●●●
●
●●●
●
●
●●
●
●●
●
●●
●
●●●●●●●●
●●●
●
●
●●
●●●●●
●
●
●
●
●
●
●●●●●●●●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●●
●
●
●●
●●
●
●
●●●
●
●
●
●● ●●●●
●
●
●
●●
●●●
●●●
●
●●●
●
●
●
●●●●●●●
●
●● ●
●
●
●●
●
●
●
●
●●
●●●●●
●●●
●
●●●●●
●●
●
●●●●
●●
●
●
●
●●●●●●●●●●●
●
●
●●
●
●●
●
●●
●
●●●●
●●
●
●●●●●●●●
●
●●
●
●●
●●
●
●●●●
●●●●●
●●●●
●●●●●●●
●●
●
●
●●●●●
●●●●
●
●●
●
●●●
●
●●●
●
●
●●
●
●
●
●●●●
●
●
●
●●●
●●●●●●●●
●●
●●●●●● ●
●●
●●●●●●●●
●
●
●
●
●●●●
●
●
●
●●●●●
●●
●
●●
●
●●
●●
●
●
●
●
●●●●●
●● ●●●●
●●
●
●●●●●●●●
●●●
●
●
●●●●●
●
●
●
●
●
●●
●●●
●
●●
●
●●
●●●●●●●
●
●●●●●●●●●●●●●●●
AAAAACAAGAATAC
AAC
CAC
GAC
TAG
AAG
CAG
GAG
TATAATCATGATTC
AAC
ACC
AGC
ATC
CA
CC
CC
CG
CC
TC
GA
CG
CC
GG
CG
TC
TAC
TCC
TGC
TTG
AAG
ACG
AGG
ATG
CA
GC
CG
CG
GC
TG
GA
GG
CG
GG
GG
TG
TAG
TCG
TGG
TTTAATACTAGTATTC
ATC
CTC
GTC
TTG
ATG
CTG
GTG
TTTATTCTTGTTT
● 20FUK.2● 20FUK.3● 20FUK.4● 20FUK.5● 20FUK.6● 20FUK.7● 20FUK.8● 20GAV.1● 20GAV.2● 20GAV.3● 20GAV.4● 20GAV.5● 20GAV.6
AATCG context
suffix
Empi
rical
gap
ope
n pe
nalty
25
30
35
40
45
50
55
●●●●●●
●
●●●●●●●●●
●
●●
●
●●●●●●●●●●●●
●●●●●
●
●●●
●
●●●●●●
●
●●●●●●●
●
●
●
●●●●
●●●●●●●●●●●●●●●●●
●●
●
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●
●
●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●
●
●●●●●●●
●●●●●●
●
●●●●●●●●● ●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●
●
●●●●●●●
●●●●●●●●●●●●●●●● ●●●●●●●●●●●●
●
●●●
●●●●●●●●●
●
●●
●
●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●
●
●●●
●
●●●●●●●
●
●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●
●
●●●● ●●
●
●●●●
●●
●●●●
●
●●
●●●●●●●●●●●●●●●●
●●●
●●●
●
●
●
●●
●
●
●
●
●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●
●
●●●●●● ●●●●●●●●●●●●●
●
●● ●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●
●●
●●●●●
●
●●●●●●●●●●●●●●
●
●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●● ●●●●●●●●●●
●
●●●
●
● ●●●●●●●●●●●●●
●
●● ●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●
●
●●
AAAAACAAGAATAC
AAC
CAC
GAC
TAG
AAG
CAG
GAG
TATAATCATGATTC
AAC
ACC
AGC
ATC
CA
CC
CC
CG
CC
TC
GA
CG
CC
GG
CG
TC
TAC
TCC
TGC
TTG
AAG
ACG
AGG
ATG
CA
GC
CG
CG
GC
TG
GA
GG
CG
GG
GG
TG
TAG
TCG
TGG
TTTAATACTAGTATTC
ATC
CTC
GTC
TTG
ATG
CTG
GTG
TTTATTCTTGTTT
● 20FUK.2● 20FUK.3● 20FUK.4● 20FUK.5● 20FUK.6● 20FUK.7● 20FUK.8● 20GAV.1● 20GAV.2● 20GAV.3● 20GAV.4● 20GAV.5● 20GAV.6
there is significant difference in the empirical probability of starting an insertion or deletion due to context
other improvements
• “auto-recalibration” mode for organisms without known callsets
• improved covariate models
• simpler command line pipeline with a single tool instead of three.
52Wednesday, February 15, 12
Reported Quality Score
Empi
rical
Qua
lity
Scor
e
10
20
30
40
50
Base Substitution
10 20 30 40 50
Base Insertion
●
10 20 30 40 50
Base Deletion
●
10 20 30 40 50
Recalibration● Recalibrated● newRecalibrator
log(nBases)8
101214161820
Cycle Covariate
Qua
lity
Scor
e Ac
cura
cy
−6
−4
−2
0
2
4
Base Substitution
−100
−50
0 50 100
Base Insertion
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●
−100
−50
0 50 100
Base Deletion
●
●
●●●●●●●●●●
●
●●
●●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●
−100
−50
0 50 100
Recalibration● Recalibrated● newRecalibrator
log(nBases)15.515.615.715.8
Context Covariate
Qua
lity
Scor
e Ac
cura
cy
−8
−6
−4
−2
0
2
Base Substitution
●● ●●
AA AAA
AAC
AAG
AAT
AC ACA
ACC
ACG
ACT
AG AGA
AGC
AGG
AGT
AT ATA
ATC
ATG
ATT
CA
CAA
CAC
CAG
CAT
CC
CC
AC
CC
CC
GC
CT
CG
CG
AC
GC
CG
GC
GT
CT
CTA
CTC
CTG
CTT
GA
GAA
GAC
GAG
GAT
GC
GC
AG
CC
GC
GG
CT
GG
GG
AG
GC
GG
GG
GT
GT
GTA
GTC
GTG
GTT
TA TAA
TAC
TAG
TAT
TC TCA
TCC
TCG
TCT
TG TGA
TGC
TGG
TGT
TT TTA
TTC
TTG
TTT
Base Insertion
AA AAA
AAC
AAG
AAT
AC ACA
ACC
ACG
ACT
AG AGA
AGC
AGG
AGT
AT ATA
ATC
ATG
ATT
CA
CAA
CAC
CAG
CAT
CC
CC
AC
CC
CC
GC
CT
CG
CG
AC
GC
CG
GC
GT
CT
CTA
CTC
CTG
CTT
GA
GAA
GAC
GAG
GAT
GC
GC
AG
CC
GC
GG
CT
GG
GG
AG
GC
GG
GG
GT
GT
GTA
GTC
GTG
GTT
TA TAA
TAC
TAG
TAT
TC TCA
TCC
TCG
TCT
TG TGA
TGC
TGG
TGT
TT TTA
TTC
TTG
TTT
Base Deletion
AA AAA
AAC
AAG
AAT
AC ACA
ACC
ACG
ACT
AG AGA
AGC
AGG
AGT
AT ATA
ATC
ATG
ATT
CA
CAA
CAC
CAG
CAT
CC
CC
AC
CC
CC
GC
CT
CG
CG
AC
GC
CG
GC
GT
CT
CTA
CTC
CTG
CTT
GA
GAA
GAC
GAG
GAT
GC
GC
AG
CC
GC
GG
CT
GG
GG
AG
GC
GG
GG
GT
GT
GTA
GTC
GTG
GTT
TA TAA
TAC
TAG
TAT
TC TCA
TCC
TCG
TCT
TG TGA
TGC
TGG
TGT
TT TTA
TTC
TTG
TTT
Recalibration● Recalibrated● newRecalibrator
log(nBases)15161718
53Wednesday, February 15, 12
QualityScore Covariate
Num
ber o
f Obs
erva
tions
0
200,000,000
400,000,000
600,000,000
800,000,000
1,000,000,000
1,200,000,000
1,400,000,000Base Substitution
10 20 30 40 50
Base Insertion
10 20 30 40 50
Base Deletion
10 20 30 40 50
RecalibrationRecalibratednewRecalibrator
Cycle Covariate
Mea
n Q
uality
Sco
re
25
30
35
40
45
50
Base Substitution
−100
−50
0 50 100
Base Insertion
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●
−100
−50
0 50 100
Base Deletion
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●
−100
−50
0 50 100
Recalibration● Recalibrated● newRecalibrator
log(nBases)15.515.615.715.8
Context Covariate
Mea
n Q
uality
Sco
re
30
35
40
45
Base Substitution
●● ●●
AA AAA
AAC
AAG
AAT
AC ACA
ACC
ACG
ACT
AG AGA
AGC
AGG
AGT
AT ATA
ATC
ATG
ATT
CA CAA
CAC
CAG
CAT
CC CCA
CCC
CCG
CCT
CG CGA
CGC
CGG
CGT
CT CTA
CTC
CTG
CTT
GA
GAA
GAC
GAG
GAT
GC
GCA
GCC
GCG
GCT
GG
GG
AG
GC
GG
GG
GT
GT
GTA
GTC
GTG
GTT
TA TAA
TAC
TAG
TAT
TC TCA
TCC
TCG
TCT
TG TGA
TGC
TGG
TGT
TT TTA
TTC
TTG
TTT
Base Insertion
AA AAA
AAC
AAG
AAT
AC ACA
ACC
ACG
ACT
AG AGA
AGC
AGG
AGT
AT ATA
ATC
ATG
ATT
CA CAA
CAC
CAG
CAT
CC CCA
CCC
CCG
CCT
CG CGA
CGC
CGG
CGT
CT CTA
CTC
CTG
CTT
GA
GAA
GAC
GAG
GAT
GC
GCA
GCC
GCG
GCT
GG
GG
AG
GC
GG
GG
GT
GT
GTA
GTC
GTG
GTT
TA TAA
TAC
TAG
TAT
TC TCA
TCC
TCG
TCT
TG TGA
TGC
TGG
TGT
TT TTA
TTC
TTG
TTT
Base Deletion
AA AAA
AAC
AAG
AAT
AC ACA
ACC
ACG
ACT
AG AGA
AGC
AGG
AGT
AT ATA
ATC
ATG
ATT
CA CAA
CAC
CAG
CAT
CC CCA
CCC
CCG
CCT
CG CGA
CGC
CGG
CGT
CT CTA
CTC
CTG
CTT
GA
GAA
GAC
GAG
GAT
GC
GCA
GCC
GCG
GCT
GG
GG
AG
GC
GG
GG
GT
GT
GTA
GTC
GTG
GTT
TA TAA
TAC
TAG
TAT
TC TCA
TCC
TCG
TCT
TG TGA
TGC
TGG
TGT
TT TTA
TTC
TTG
TTT
Recalibration● Recalibrated● newRecalibrator
log(nBases)15161718
54Wednesday, February 15, 12
Thank you!Stay up to date with the GSA team through our wiki
• the latest releases of our tools and version changelogs
• tutorials on our best practices for data processing and analysis
• further information on how to use the GATK engine for your own research or to collaborate with us
http://www.broadinstitute.org/gsa/wiki/index.php
55Wednesday, February 15, 12