Best practices for Variant Calling with Paciﬁc Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Mauricio Carneiro, Ph.D.Mark DePristo, Ph.D.

Genome Sequence and AnalysisMedical and Population [email protected]

Best practices for Variant Calling with Pacific Biosciences data

1Wednesday, February 15, 12

The Current Pipeline

General best practice data processing and variant calling using the GATK


SNPs

Indels

Structural variation (SV)

Rawindels

RawSVs

Typically by lane Typically multiple samples simultaneously but can be single sample alone

Input

Output

Mapping

Local realignment

Duplicate marking

Base quality recalibration

Analysis-ready reads

Raw reads Sample 1 reads

Raw variants

RawSNPs

Genotype refinement

Variant quality recalibration

Analysis-ready variants

Pedigrees Known variation

Known genotypes

Population structure

Phase 1: NGS data processing Phase 2: Variant discovery and genotyping Phase 3: Integrative analysis

Sample N reads

External data

Our framework for variation discovery!

DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !


Finding the true origin of each read is a computationally demanding first step!

Region 1

Enormous pile of short reads from

NGS

Detects correct read origin and flags them

with high certainty

Detects ambiguity in the origin of reads and

flags them as uncertain

Reference genome

Region 2 Region 3

For more information see: Li and Homer (2010). A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics.

Mapping'and'alignment'algorithms'

Phase 1:!NGS data processing!

Input

Output

Mapping

Local realignment

Duplicate marking



Raw reads


rs28782535!

rs28783181! rs28788974! rs34877486! rs28788974!

1,000 Genomes Pilot 2 data, raw MAQ alignments! 1,000 Genomes Pilot 2 data, after MSA!

HiSeq data, raw BWA alignments! HiSeq data, after MSA!

Effect of MSA on alignments!NA12878, chr1:1,510,530-1,510,589!

Accurate read alignment through multiple sequence local realignment"

25"DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !


Input

Output

Mapping

Local realignment

Duplicate marking



Raw reads


Accurate error modeling with base quality score recalibration"

26"


!!!!!!

!!!

!!!!

!!

!

!

!!

!!

!!

!!

!!!!!! !

!!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!!

!

!

Original, RMSE = 5.242Recalibrated, RMSE = 0.196

!!

!!

!!!

!!

!!

!!

!!

!

!!!!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!

!

!

Original, RMSE = 2.556Recalibrated, RMSE = 0.213 !!!

!

!

!!!

!!!

!!

!!

!!

!!!!

!!

!!

!

!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!!!

!!!!

!

!


!!!

!!!!!

!!

! !!

!

!

!

!

!!

!!

!

!

!!

!!

!!!!

!!

!!

!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!

!


!!!!!!!!!!!!!!! !! !! !! !

! !! !!

!

!

!! !!!!!

0 5 10 15 20 25 30 35

!10

!5

05

10

Machine Cycle

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!! !! !! !! !! !! !! !! !! !!!! !

!

!


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!!

!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!

!!!

!!!!!!!! !!!!!!!! !!!!!!!! !!

!!!!!! !!!!!!!!

!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!!!! !!!! !!!

!!!!! !!!!

!!!!!!

!!!!!!!!!!

!!

!

!

!

!!

!!

!!

0 50 100 150 200

!10

!5

05

10

Machine Cycle

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!! !!!! !!!!!!!! !!!! !!!!!!!!!!!!!!!!

!!!

!!!!!

!

!


!!

!

!

!

!

!

!

!

!

!!

!

!!!

!!

! !!!

! !!!

!

!!

!!

!!

!!! !!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!!

!30 !20 !10 0 10 20 30

!10

!5

05

10

Machine Cycle

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!! !!! !! !! !!!! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !

! !! !! ! !! !!!!!

! !! ! !!!!!

Second of pair reads First of pair reads

!

!


!

!!!!!!!!!!!

!!!!!!!!

!!!!

!

!!

!

!!!!!!!!!!!!

!

!!!!!

! !

!

!!

!!

!!

!!

!!

!!!!

!!!

!

!

!

!!

!!!!

!

!

!

!

!

!!!

!

!

!

!

!!!!!!!! !

!

!!

!!

!

!!!

!!

!

!!!!

!!

!

!

!

!!!!

!

!

!!

!

!

!

!

!!

!

!

!!!! !!!! !!!

!

!!!! !

!!!!!!!!!!!!!!!

!

!!!!

!

!!!!

!!!!!

! !!!!!!!!

!!!!

!!!!

!!

!!!

!100 !50 0 50 100

!10

!5

05

10

Machine CycleA

ccura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!


!

!


!1

0!

50

51

0

Dinucleotide

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!

!!

!

!!!!!!!!!

!!!!!!!!!!!!!!!!

AA AG CA CG GA GG TA TG


!1

0!

50

51

0

Dinucleotide

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!

!!!!!

!

!!!!!!!

!!

!!!!!!!!!!!!!!!!



!1

0!

50

51

0

Dinucleotide

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!

!!!!

!!!!!!!!!!!!!!!!!!



!1

0!

50

51

0

Dinucleotide

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!

!!

!!!!

!!!!!

!!!!!!!!!!!!!!!!



Illumina/GenomeAnalyzer Roche/454 Life/SOLiD Illumina/HiSeq 2000

!!!!!!

!!!

!!!!

!!

!

!

!!

!!

!!

!!

!!!!!! !

!!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!!

!

!


!!

!!

!!!

!!

!!

!!

!!

!

!!!!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!

!

!

Original, RMSE = 2.556Recalibrated, RMSE = 0.213 !!!

!

!

!!!

!!!

!!

!!

!!

!!!!

!!

!!

!

!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!!!

!!!!

!

!


!!!

!!!!!

!!

! !!

!

!

!

!

!!

!!

!

!

!!

!!

!!!!

!!

!!

!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!

!


!!!!!!!!!!!!!!! !! !! !! !

! !! !!

!

!

!! !!!!!

0 5 10 15 20 25 30 35

!10

!5

05

10

Machine Cycle

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!! !! !! !! !! !! !! !! !! !!!! !

!

!


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!!

!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!

!!!

!!!!!!!! !!!!!!!! !!!!!!!! !!

!!!!!! !!!!!!!!

!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!!!! !!!! !!!

!!!!! !!!!

!!!!!!

!!!!!!!!!!

!!

!

!

!

!!

!!

!!

0 50 100 150 200

!10

!5

05

10

Machine Cycle

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!! !!!! !!!!!!!! !!!! !!!!!!!!!!!!!!!!

!!!

!!!!!

!

!


!!

!

!

!

!

!

!

!

!

!!

!

!!!

!!

! !!!

! !!!

!

!!

!!

!!

!!! !!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!!

!30 !20 !10 0 10 20 30

!10

!5

05

10

Machine Cycle

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!! !!! !! !! !!!! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !

! !! !! ! !! !!!!!

! !! ! !!!!!


!

!


!

!!!!!!!!!!!

!!!!!!!!

!!!!

!

!!

!

!!!!!!!!!!!!

!

!!!!!

! !

!

!!

!!

!!

!!

!!

!!!!

!!!

!

!

!

!!

!!!!

!

!

!

!

!

!!!

!

!

!

!

!!!!!!!! !

!

!!

!!

!

!!!

!!

!

!!!!

!!

!

!

!

!!!!

!

!

!!

!

!

!

!

!!

!

!

!!!! !!!! !!!

!

!!!! !

!!!!!!!!!!!!!!!

!

!!!!

!

!!!!

!!!!!

! !!!!!!!!

!!!!

!!!!

!!

!!!

!100 !50 0 50 100

!10

!5

05

10

Machine Cycle

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!


!

!


!1

0!

50

51

0

Dinucleotide

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!

!!

!

!!!!!!!!!

!!!!!!!!!!!!!!!!



!1

0!

50

51

0

Dinucleotide

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!

!!!!!

!

!!!!!!!

!!

!!!!!!!!!!!!!!!!



!1

0!

50

51

0

Dinucleotide

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!

!!!!

!!!!!!!!!!!!!!!!!!



!1

0!

50

51

0

Dinucleotide

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!

!!

!!!!

!!!!!

!!!!!!!!!!!!!!!!



Illumina/GenomeAnalyzer Roche/454 Life/SOLiD Illumina/HiSeq 2000

Ryan Poplin

Input

Output

Mapping

Local realignment

Duplicate marking



Raw reads



Bayesian(model((

4 SNP calling

4.1 Simple genotype likelihoods for presentations

Pr{G|D} =Pr{G}Pr{D|G}

�i Pr{Gi}Pr{D|Gi}, [Bayes’ rule]

Pr{D|G} =⇧

j

�Pr{Dj|H1}

2+

Pr{Dj|H2}2

⇥where G = H1H2

Pr{D|H} is the haploid likelihood function

4.1.1 SNP haploid likelihood

Pr{Dj|H} = Pr{Dj|b}, [single base pileup]

Pr{Dj|b} =

⇤1� �j Dj = b,�j otherwise.

4.1.2 Indel haploid likelihood

Pr{Dj|H} =⌅

alignments � of Dj to H

Pr{Dj, ⇥}

4.2 Genotype likelihoods

Pr{Di|GTi} =⇧

j

Pr{Di,j|GTi}

Pr{Di,j|GTi = AB} = (Pr{Di,j|A}+ Pr{Di,j|B}) /2

Pr{Di,j|B} =

⇤1� �i,j Di,j = B,

�i,j · Pr{B is true|Di,j is miscalled} otherwise.

3

SNP and Indel calling is a large-scale Bayesian modeling problem!

•  Inference:(what(is(the(genotype(G(of(each(sample(given(read(data(D(for(each(sample?(

•  Calculate(via(Bayes’(rule(the(probability(of(each(possible(G(•  Product(expansion(assumes(reads(are(independent(•  Relies(on(a(likelihood(funcCon(to(esCmate(probability(of(sample(

data(given(proposed(haplotype(

Prior of the genotype!

Likelihood of the genotype!

See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information 27!

Diploid assumption!


SNP genotype likelihoods!

•  All diploid genotypes (AA, AC, …, GT, TT) considered at each base!

•  Likelihood of genotype computed using only pileup of bases and associated quality scores at given locus!

•  Only �good bases� are included: those satisfying minimum base quality, mapping read quality, pair mapping quality, NQS!

See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information 28!

4 SNP calling

4.1 Simple genotype likelihoods for presentations

Pr{G|D} =Pr{G}Pr{D|G}

�i Pr{Gi}Pr{D|Gi}, [Bayes’ rule]

Pr{D|G} =⇧

j

�Pr{Dj|H1}

2+

Pr{Dj|H2}2

⇥where G = H1H2

Pr{D|H} is the haploid likelihood function

4.1.1 SNP haploid likelihood

Pr{Dj|H} = Pr{Dj|b}, [single base pileup]

Pr{Dj|b} =

⇤1� �j Dj = b,�j otherwise.

4.1.2 Indel haploid likelihood

Pr{Dj|H} =⌅

alignments � of Dj to H

Pr{Dj, ⇥}

4.2 Genotype likelihoods

Pr{Di|GTi} =⇧

j

Pr{Di,j|GTi}

Pr{Di,j|GTi = AB} = (Pr{Di,j|A}+ Pr{Di,j|B}) /2

Pr{Di,j|B} =

⇤1� �i,j Di,j = B,

�i,j · Pr{B is true|Di,j is miscalled} otherwise.

3


Variant Quality Score Recalibration (VQSR): modeling error properties of real polymorphism to determine the probability that novel sites are real!

The HapMap3 sites from NA12878 HiSeq!calls are used to train the GMM. Shown!here is the 2D plot of strand bias vs. the!variant quality / depth for those sites.!

Variants are scored based on their!fit to the Gaussians. The variants!(here just the novels) clearly!separate into good and bad clusters.!

32!DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !


These methods are available in the Genome Analysis Toolkit (GATK)"

•  Most Broad Institute tools for the 1000 Genomes have been developed in the GATK "

McKenna et al. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res.!

http://www.broadinstitute.org/gsa/wiki/""

1000 genomes GATK tools"

Genome Analysis Toolkit (GATK)" SAM/BAM format"

•  Technology agnostic, binary, indexed, portable and extensible file format for NGS reads"

•  Also used in the Broad production pipeline"

http://samtools.sourceforge.net/""

VCF format"•  Standard and accessible

format for storing population variation and individual genotypes"

•  Open-source map/reduce programming framework for developing analysis tools for next-gen sequencing data"

•  Easy-to-use, CPU and memory efficient, automatically parallelizing Java engine"

h"p://vc(ools.sourceforge.net/44

Indel realignment"

VQSR"

Base quality score recalibration"

Unified Genotyper"

Variant Eval" Many other analysis tools"


Pacbio Processing Pipeline

how we apply our pipeline to Pacific Biosciences dataa step-by-step tutorial


SNPs

Indels

Structural variation (SV)

Rawindels

RawSVs

Typically by lane Typically multiple samples simultaneously but can be single sample alone

Input

Output

Mapping

Local realignment

Duplicate marking



Raw reads Sample 1 reads

Raw variants

RawSNPs

Genotype refinement

Variant quality recalibration

Analysis-ready variants

Pedigrees Known variation

Known genotypes

Population structure

Phase 1: NGS data processing Phase 2: Variant discovery and genotyping Phase 3: Integrative analysis

Sample N reads

External data

Our framework for variation discovery!


not evaluated yet on PacBio data due to small size of the datasets

currently the GATK cannot perform indel realignment due to the high indel error rate and the long reads of Pacific Biosciences


Pacbio Processing Pipeline1. We start our processing pipeline with the

filtered_subreads.fasta file produced by PacBio software and turn into a fastQ using SMRT Pipeline scripts provided by PacBio.

2. Mapping and Alignment are done using BWA with a heuristic smith waterman algorithm (bwa-sw)

3. We sort the bam file, add read group and sample information using Picard Tools: SortSam and AddOrReplaceReadGroups.

4. We recalibrate base qualities using the GATK’s Base Quality Score recalibration framework.

FASTA

BWA-SW

SAM

BAM

Picard

Base Quality Score Recalibration

Analysis Ready BAM

FASTQ


Why do we align with BWA and not BLASR?• BWA is the standard aligner in the Broad’s sequencing platform.

• BLASR is still responsible for generating the filtered sub-reads.

• With recent updates, BLASR generated BAM files are a reasonable alternative for this step of the pipeline

- optional pipeline starts with a BLASR generated BAM (skipping BWA and Picard steps).

- Read Group information and BQSR are still required steps.

- Works well, but generally smaller yield.

- We anticipate further development in BLASR generated BAMs could improve this alternate pipeline in the future.


total mapped coverage: 74,735,274 bp total mapped coverage: 19,562,290 bp

Strict BLASR filtering reduces yield and eliminates the longer reads

aggressive BLASR clipping turns longer reads into

“short” reads


0 10 20 30 40

Reported quality score histogram

Empirical quality score

Co

un

t

02

00

00

00

00

40

00

00

00

06

00

00

00

00

●

●● ● ●

● ●

●● ●

● ●●

● ●●

● ● ●● ●

● ●

● ● ●

● ●●

●

0 10 20 30 40

01

02

03

04

0

Reported vs. empirical quality scores

Reported quality score

Em

piric

al q

ua

lity s

co

re

●

0 10 20 30 40

Reported quality score histogram

Empirical quality score

Co

un

t

0 5

00

00

00

01

00

00

00

00

15

00

00

00

0

● ●●

●

●●

●●

●●

●●

●

●

●●

●●

●●

●●

●

●●

●

●●

●●

●

●●

0 10 20 30 40

01

02

03

04

0

Reported vs. empirical quality scores


Em

piric

al q

ua

lity s

co

re

●

Recalibra)on

Sequencers provide es)mates of error rate per nucleo)de

… but they aren’t very accurate

… and they aren’t very informa)ve





Introduc)on to Base Quality Score Recalibra)on


Recalibra)on workflow

17

Original BAM file

Covariates table (.csv)

Recalibrated BAM file

Recalibrated covariates table (.csv)

CountCovariates

TableRecalibra)on

CountCovariates

AnalyzeCovariates

AnalyzeCovariates

Pre-‐recalibra)on analysis plots

Post-‐recalibra)on analysis plots

dbSNP / known sitesnecessary


Running CountCovariates

18

java -‐Xmx4g -‐jar GenomeAnalysisTK.jar -‐R reference.fasta -‐D dbsnp.vcf -‐I original.bam -‐T CountCovariates -‐cov ReadGroupCovariate -‐cov QualityScoreCovariate -‐cov DinucCovariate -‐cov CycleCovariate -‐recalFile table.recal_data.csv

List of known polymorphic sites is necessary so these sites do not count against bases

mismatch rate

List of covariates to be used in the recalibra)on calcula)on

CSV file containing covariate counts

# Counted Bases 143745620ReadGroup,QualityScore,Dinuc,Cycle,nObservations,nMismatches,QempiricalSRR001802,2,AA,-8,165,17,10SRR001802,2,AA,-2,91,10,10SRR001802,2,AA,3,5,4,1SRR001802,2,AA,4,9,4,4SRR001802,2,AA,7,12,4,5

Table recalibra)on file (table.recal_data.csv)

See hVp://www.broadins)tute.org/gsa/wiki/index.php/Base_quality_score_recalibra)on for more informa)on


Running TableRecalibra)on

19

java -‐Xmx4g -‐jar GenomeAnalysisTK.jar -‐R Homo_sapiens_assembly18.fasta -‐I original.bam -‐T TableRecalibra)on -‐recalFile table.recal_data.csv -‐outputBam recal.bam

Table recalibra)on file from CountCovariates step

The full recalibrated bam file

A recalibrated copy of the original BAM file

See hVp://www.broadins)tute.org/gsa/wiki/index.php/Base_quality_score_recalibra)on for more informa)on


Running AnalyzeCovariates

See hVp://www.broadins)tute.org/gsa/wiki/index.php/Base_quality_score_recalibra)on for more informa)on20

java -‐Xmx4g -‐jar AnalyzeCovariates.jar -‐outputDir /path/to/output_dir/ -‐resources resources/ -‐recalFile table.recal_data.csv

The directory in which to place the output analysis plots

Points to the GATK installa)on’s directory of R scripts which are used for plodng the

data

Table recalibra)on file from either the before or aeer CountCovariates step

Many plots of base quality versus each covariate

A separate .jar file distributed with the GATK


The Pacbio Processing Pipeline is available for educational purposes

(but not supported)

Queue is part of the GATK and is a pipeline manager used internally at the Broad in most analysis projects

(see http://www.broadinstitute.org/gsa/wiki/index.php/Queue)

java -Xmx4g -jar Queue.jar -S PacbioProcessingPipeline.scala -i filtered_subreads.fastq -D dbSNP.vcf -R reference.fasta -run

or blasr.bam with extra -‐blasr op)on


Calling snps and indels using pacbio data with the Unified Genotyper

java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -I input.recal.bam -R reference.fasta -D dbsnp.vcf -deletions 0.5 -o myCalls.vcf -mbq 10

The ideal deletions and minimum base quality parameters for this specific dataset were determined systematically by measuring

sensitivity/specificity to known variant calls in NA12878.

allows sites with 50% dele)ons to be analyzed

minimum base quality 10 calibrates the UG for PacBio data (avg base qual is 20)


Analyzing PacBio data

more information available at the poster session of AGBT(presentation thursday 1:10 - 2:40pm)


A quick look at Pacific Biosciences data

indels are the primary error mode (all purple markers)

Notice the SNP

discovery dataset

0%

3.75%

7.5%

11.25%

15%

inser

tions

delet

ions

mismatc

hes

erro

r ra

te


average coverage

number of reads

120x 104x ~120x per sample

~500x per sample

36,918 305,581 89,934per sample

256,989

per sample

discovery validation cancer 1000G

Long reads and deep coverage on all PacBio datasets

discovery dataset validation dataset breast cancer dataset 1000G dataset


Sequencing bias is a known problem with NGS technologies that PacBio does not share

normalized coverage by GC content contrasted with GC content of the genome

E. coli R. sphaeroidsP. falciparum

come to Michael Ross’ talk on tuesday @ 7pm for a more thorough exploration on bias in the different sequencing technologies today


Random error profile of PacBio is much preferred by the GATK bayesian model to systematic errors

phasing dataset

SYSTEMATIC ERROR

RANDOM ERROR

same genome region on both datasets


Long reads with a high indel error rate have a side effect: reference bias

Allele balances for known variants in PacBio

Alternate allele fraction

Num

ber o

f het

eroz

ygou

s si

tes

0.0 0.2 0.4 0.6 0.8 1.0

020

4060

80

Current tools are not capable of locally realigning PacBio data, but we anticipate that newer tools will improve this issue.


True variation missed by Pacbio due to reference bias

1000G dataset

the alternate allele is hiding inside the insertions due to the low gap open

penalty of the aligner.

“C” INSERTIONS


0 10 20 30 40 50

Reported quality score histogram, entropy = 2.178


Num

ber o

f Bas

es

010

0000

030

0000

050

0000

070

0000

0

The Base Quality Score framework does not account for indel errors

PacBio produces Q20 bases on average across datasets

discovery dataset

0 10 20 30 40 50



Num

ber o

f Bas

es

0 2

0000

00 4

0000

00 6

0000

00 8

0000

0010

0000

00

validation dataset

0 10 20 30 40 50



Num

ber o

f Bas

es

0 5

0000

0010

0000

0015

0000

00

1000G dataset


●●●●●●●●

●

●●

●

●●●●●●●●●●●●●●●●●●●●●●

●

●●●●

●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●

●●

●●●

●●

●●●●●●●●●●

●

●●●

●●●

●●●●

●●●

●

●●

●●●●●

●●●●●●●

●

●●

●●●

●

●●●●

●●

●

●●

●●

●

●●●●

●●●●●●●●

●

●●●●

●●●●●●●●●●

●●

●

●●●●●

●●●●●●●●●●●●●●

●●

●

●●●

●●●

●●●●●

●

●●●●●●●●●

●

●●●●●●

●

●●●●●●●●●●●●●●●●

●

●●●●●

●

●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●

●

●●

●●●●●●●

●

●●●●●●●●

●●●●●●●●

●

●●●●

●

●●●●●●

●●●●●

●●●●●

●●

●●●●●●●●

●●●●●●

●●●●●●●●●●●●●

●●●

●●●●●●●

●●

●●●

●●●●●

●●●●●●●●●●

●●●

●●●

●●

●

●●●●●●●●●●

●●●●●

●●●

●

●

●

●●●●●

●●●●●●●

●

●

●

●●

●●●●

●

●●●●

●

●●

●●●●

●

●●●●●●●●●

●●●●●

●

●●

●●

●●●●●●●●

●

●●●

●

●●

●

●●

●●●●●●●●

●●●

●●

●

●●●●

●●●

●

●

●

●

●●

●●

●●

●

●●●●●●●●●●●●●●●●●

●●●●●

●

●●●●●

●●●●●

●

●●●

●

●

●●

●

●

●●●●●●●●●

●●

●

●●

●

●

●●●

●●

●

●●●

●

●

●

●●●

●

●

●

●●●

●●

●●

●

●●●

●●

●●●●●

●●●●●

●

●●●

●

●●●●

●

●●

●●●

●

●

●

●●●●●●●

●

●●

●●●●●

●

●

●●

●

●

●

●

●●●●●●●●●●

●●

●

●

●●

●

●

●

●●●

●●

●●●●●

●●

●●

●●●●●●●●

●●

●

●

●

●

●

●●●

●●●●●

●

●●

●●●

●●

●●●

●●●●

●

●●●●●●●●●

●●●●●●●●

●

●●

●●●

●

●●

●●●●

●

●

●

●

●●

●●

●●

●

●

●●●●

●●●

●

●●●●●

●

●

●

●●●●●

●●●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●●

●

●●●

●

●

●

●

●●●

●

●●

●●●●

●●

●●●●●●●●

●●●●●●

●●

●

●

●

●●

●

●●●●

●

●

●●

●●●●

●

●

●●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●●●

●●●

●●

●

●

●

●●

●

●●●

●

●●

●

●

●

●

●

●

●

●●

●●●●●●●●

●

●

●

●

●

●●

●●●●

●●

●

●

●

●●

●

●

●

●●

●●●

●●●

●

●

●

●

●●

●●●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●●

●●●●●

●●●

●

●

●

●

●

●

●

●●●●●●●

●

●

●

●

●

●●

●

●

●●

●●●●

●●●

●

●●●

●

●

●

●●●●

●

●

●●●●

●●●

●

●

●

●●●

●●●

●

●●●

●

●●

●

●

●●

●●●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●●●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●●●

●●●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●●●

●

●

●

●

●

●

●

●

●●

●●●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

0 500 1000 1500 2000 2500

−10

−50

510

RMSE_good = 7.196 , RMSE_all = 7.211

Cycle

Empi

rical

− R

epor

ted

Qua

lity

●●

●

●

●

●●●

●

●●●●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●●

●

●

●

●●

●

●

●

●

●●

●

●●●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●●●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●●

●

●

●●

●

●●

●

●

●●

●●

●

●

●

●

●●

●

●

●●●

●●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●●

●

●

●●

●

●

●●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●●●

●

●

●

●

●●●●●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●●

●●

●

●

●

●●●

●●●

●

●

●

●

●●

●●●

●

●●●●●●●

●

●

●●●●

●●●●●

●●●

●

●●●●●

●

●

●●

●●●

●

●

●●

●●●●●

●

●

●

●

●

●

●●

●●

●

●●●●●

●●●

●

●

●●●●●●●

●●

●●

●

●●●●●●●

●

●

●

●●●●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●

●

●●●●●●●●●●●●●●

●

●●

●

●●●●●●

●●

●●●

●●

●●●

●

●●

●

●●

●

●●●●

●

●●●

●●

●●●●●●●●

●

●●●●

●

●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●

●

●

●

●●●●●●●●●●●●

●●

●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●

●

●●

●

●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●●

●

●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●

●

●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●

●

●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

SLX GA 454 SOLiD Complete Genomics HiSeq

PacBio

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●

0 500 1000 1500 2000 2500

−10

−50

510

RMSE_good = 0.559 , RMSE_all = 0.877

Cycle

Empi

rical

− R

epor

ted

Qua

lity

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●

●

●

●

●●●●

●●●

●

●

●

●●●●●●●●●●

●

●●

●●●●●●●

●

●●●●●●

●

●●

●

●

●

●●

●

●●

●

●

●●●●●●●●●●●●●●●

●●

●

●●

●●●

●

●

●

●●

●

●

●

●

●●●

●●

●●●

●

●

●

●

●

●●

●●

●●●

●●

●●

●

●●●●

●

●

●●

●●

●●

●

●

●

●

●

●●●

●

●●●●●●●

●●

●●●●

●

●

●

●

●

●●

●

●●

●●

●●●

●●●

●

●

●●●

●●

●

●●

●●

●

●

●

●●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●●●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●●●

●

●

●

●

●●●●●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●●

●

●

●

●●●

●●●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●●●●

●

●

●●●

●●●

●

●●●●●

●

●

●

●

●●●

●

●

●

●

●●●●●

●

●

●

●

●

●

●

●

●●

●

●●●●●

●●

●

●

●

●●●●●●●

●

●

●●

●

●●●●●●●

●

●

●

●●●●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●

●

●●●●●●●●●●●●●●

●

●●

●

●●●●●●

●●

●●●

●

●

●●●

●

●●

●

●●

●

●●●●

●

●●●

●●

●●●●●●●●

●

●●●●

●

●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●

●

●

●

●●●●●●●●●●●●

●

●

●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●

●

●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

before recalibration:• even before

recalibration PacBio reads do not seem to be affected by the length of the read like other technologies.

• The steady straight line breaks after 1250bp because we have very few reads that go that long (hence the light blue colored dots)

PacBio base qualities are not affected by the length of the read

after recalibration:• recalibration helps make

the straight line more dense and clear.

• the lack of data points still breaks the recalibrated line after 1250bp.

discovery dataset


PacBio variation discovery and validation

validating hard-to-call-sites and a look at variation discovery using PacBio


How can we use PacBio data for human analysis?

• Is PacBio a good platform for follow-up validation today?

• Can we do SNP discovery with PacBio data?

• How does PacBio compare to other technologies?


Data and Definitions• We have performed a number of experiments at the Broad using

PacBio for human data analysis.

- discovery dataset (12/23/2010)61 amplicons covering 177 kb from regions across chromosome 20 of NA12878 (1000G sample).

- validation dataset (1/20/2011)a set of hard to call NA12878 snps targeted with 2Kbp amplicons

- breast cancer dataset (6/17/2011)24 samples for tumor/normal validation analysis of 15 events against HiSeq, 454 and Sequenom.

- 1000G dataset (8/25/2011)8 samples resequenced at 250 sites for follow up validation against Illumina, Sanger and Sequenom.


Pacbio as a validation tool• Follow up validation is a major unmet need at the

Broad and other centers.

• We carried out a follow-up validation assay using the de novo mutations previously validated by the 1000G project.

• Some are real de novo mutations

• Most are machine artifacts already identified by follow up validation in 1000G.

• These are hard-to-call sites that are prone to errors and really challenge sequence technology accuracy.


PacBio demonstrates great performance on hard-to-call sites

PacBio known true variant site

known false variant site

predictive value

calledalt

called ref

48 5 91%

0 67 100%

HiSeq known true variant site


predictive value

calledalt

called ref

48 35 58%

0 37 100%

positive predictive value, or precision rate is the proportion of subjects with positive test

results who are correctly diagnosed

negative predictive value (NPV) is the proportion of subjects with a negative test

result who are correctly diagnosed.

same sites on both tests validation dataset


Pacbio performs well in “apples to apples” comparison with MiSeq data

PacBio known true variant site


predictive value

calledalt

called ref

37 1 97%

1 59 98%

MiSeq known true variant site


predictive value

calledalt

called ref

38 5 88%

0 55 100%

Site missed due to reference bias

both technologies miscalled this site with the same “wrong” allele that is reported in our gold standard callset, making the truth status of this site questionable (possible sanger trace error)

same sites on both tests validation dataset

4 sites missed due to systematic error

(probably misalignment)


1000G project validation experiment

• First we used Sequenom to validate 300 well-behaving SNP sites chosen to be polymorphic in at least 1 out of 8 specific samples from Illumina low pass data.

- Sequenom is the current standard validation tool at the Broad.

• Sequenom only had data for 250 sites.

• We used PacBio to validate all 300 sites and looked at the agreement between Sequenom and Pacbio.


Pacbio adds valuable information to Sequenom validation

Pacbio ALT Pacbio REF

sequenom ALT

sequenom REF

218 7

8 12

Result Pacbio No. occurrences what went wrong

good sequencing 1 Sequenom was wrong

Alt allele placed on insertion 4 Pacbio Reference Bias

No coverage 1 Reads actually didn’t belong at location

Wrong ALT allele called 1 UG triallelic issue

Visual classification Result from Pacbio

6 look incredibly good 5 ALTs, 1 Reference Bias

1 bad mapping quality ALT

1 has nearby deletion (unclear) Reads actually didn’t belong at location

50 sites not called by sequenom Many sites were ALT, others mismapped

1000G dataset


Pacific Biosciences, Ion Torrent and MiSeq have good potential for validation experiments

sensitivity specificity PPV NPV

Ion (bwa-sw)

Ion (tmap)

MiSeq

PacBio

96.2% 100% 100% 54.5%

96.2% 100% 100% 54.5%

98.1% 92.3% 99.6% 70.5%

98.1% 100% 100% 68.7%

Low specificity indicates artifactual calls outside the scope of the validation

Ion Torrent has a low NPV but is good in most other metrics. NPV


cancer dataset

Illumina Sequenom Pacbio 454

somatic

wildtype

unknown

15 6 12 80 6 1 00 3 2 7

high coverage and high specificity to targets

breast cancer validation experiment

base qualities are severely under calculated

Pacbio correctly identified a false positive in the original dataset

(unknown in sequenom and 454)


GATK performs very well for SNP discovery with PacBio data

MiSeq HiSeq PacBio

Gold Standard SNP calls

calls on HapMap

Sensitivity

222 225 197

43 43 38

99.1% 100.0% 87.6%

discovery

• Reference bias (17) and lack of coverage (11) were the reasons for missed sites in Pacbio

• MiSeq missing data are due to mismapping/artifact (2) or low coverage (1).


Broad’s somatic mutation caller (muTect) successfully calls pacbio data

• One tumor/normal pair called:

• 6,459 sites examined

• 4,837 sites covered (14x/8x)

• 1 true somatic mutation called (previously validated)

• 0 False Positives called

muTect is a GATK based caller developed by the cancer group at the Broad Institute(https://confluence.broadinstitute.org/display/CGATools/MuTect)


PacBio data performs well with the GATK because...

• The error rate is random (despite being high).

• Such non-systematic error mode is well handled by the GATK SNP calling mathematics.

• very long reads make mapping very clear.

• less mismappings of paralogous sequences.

• structural variants are less prone to appear as SNPs.

Pacbio’s reference bias is currently the major limiting factor


Future of the GATK

What is the GSA team working on right now(that will impact PacBio data analysis)


From reads to alleles: the first frontier!

•  Can’t calculate a likelihood for a hypothesis you don’t consider!

•  How do I know what genetic variant I’m looking at, given the read data alone?!–  A SNP, an INDEL, an SV,

or something else?!•  General problem, but

acute for medium-sized events and insertions!

Too systematic to be machine errors, but the haplotype for Pr{D|H} is unclear

Example 1! Example 2!


From reads to alleles: the next frontier

• Can’t calculate a likelihood for a hypothesis you don’t consider

• How do I know what genetic variant I’m looking at, given each locus independently?–A SNP, an INDEL, an SV, or

something else?• General problem, but acute for

medium-sized events as we not only miss the true event but also generate many smaller false events

• Reference bias can be addressed from a haplotype approach

Too systema)c to be machine errors, but the haplotype for Pr{D|H} is unclear

Example 1 Example 2


Using local de novo haplotype assembly via DeBruijn graphs!

29# Assembly(of(large(genomes(using(second3genera4on(sequencing.(Schatz.(Genome(Research.(2010.(


Example Mullikin het dele)on we now callchr4:336781 TTAAAAAAGTATTAAAAAAGTTCCTTGCATGA/-‐

49

Original read data

Discovered haplotype


50

Example Mullikin het inser)on we now callchr18:14937489 -‐/CCACTCCAGCCTCTGATGGACTGCAAGCTGGGTCT

Original read data

Discovered haplotype


Caller Variant Sensi-vity(strict)

Genotype Concordance(strict)

Variant Sensi-vity(strict)

Genotype Concordance(strict)

Unified Genotyper 51.9%(40 / 77)

51.9%(40 / 77)

49.0%(97 / 198)

49.0%(97 / 198)

Haplotype Caller 90.9%(70 / 77)

89.6%(69 / 77)

81.8%(162 / 198)

81.8%(162 / 198)

51

Haplotype Caller greatly increases sensitivity to larger indel events over the Unified Genotyper

Mullikin Mills

• Input data is NA12878 b37+decoy WGS HiSeq high coverage• Sites chosen to be very difficult (het) but high confidence in being real

(require family transmission)• Evaluation sets• Mullikin Fosmids and Mills et al, GR, 2011 (2x hit, double center)• Large events (> 15 bp), largest is 106bp (which we don’t yet call)


A new BQSR that also recalibrates “indel qualities” qualities

AAAAA context

suffix

Empi

rical

gap

ope

n pe

nalty

25

30

35

40

45

50

55

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●

●●●●●●●●●●

●

●

●

●

●

●

●

●●

●

●●

●●●

●

●●●●●

●●●●

●●●●

●●●

●●●●●●●●●●●●●

●●●

●

●

●●●●

●●●●●

●●●

●

●

●●●●●●●

●

●

●

●●●●

●

●

●●●

●

●●

●

●●●●●

●●●●

●●●●●●●●●●●●●●●●

●●●●

●

●●●●●

●

●●●

●

●

●

●●

●●

●●●

●

●●●●●●

●

●●●●●●●●●●●●●●●● ●

●●●●●●●●●●●●●●●

●●●●

●

●●●

●

●

●●

●

●●

●

●

●●●●

●

●

●●●●●

●●

●●

●●●●

●●

●

●●●●●●

●●●

●●

●

●●●

●

●●●●●

●●

●

●

●

●

●●●

●

●●●

●●●

●

●●

●

●

●●

●

●●●●●

●●

●

●●

●

●

●●●●●

●●●●●

●

●

●

●●●

●

●

●●●●●●●●●

●

●●

●● ●●

●

●●

●

●●●●●●●●●●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●●

●●●

●●

●

●

●●●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●●●●●

●

●●●

●

●

●

●

●●

●

●

●●

●●●●●●●

●●

●

●

●●

●●●

●●

●●

●

●●●●●●●●

●●●

●●●

●●●●●●●●●●

●●●

●

●

●

●●

●

●

●

●●

●

●

● ●●

●●●●●●

●●●●●

●●●

●

●

●

●●●●

●

●

●

●●●

●

●

●●●

●●

●●

●●

●

●

●

●

●

●

●●

●

●

●●●●

●●●

●●●

●

●●●

●

●●

●

●●●●●●●●●

●●●

●●

●●●●

●

●●●

●

●

●●

●

●●

●

●●

●

●●●●●●●●

●●●

●

●

●●

●●●●●

●

●

●

●

●

●

●●●●●●●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●●

●

●

●●

●●

●

●

●●●

●

●

●

●● ●●●●

●

●

●

●●

●●●

●●●

●

●●●

●

●

●

●●●●●●●

●

●● ●

●

●

●●

●

●

●

●

●●

●●●●●

●●●

●

●●●●●

●●

●

●●●●

●●

●

●

●

●●●●●●●●●●●

●

●

●●

●

●●

●

●●

●

●●●●

●●

●

●●●●●●●●

●

●●

●

●●

●●

●

●●●●

●●●●●

●●●●

●●●●●●●

●●

●

●

●●●●●

●●●●

●

●●

●

●●●

●

●●●

●

●

●●

●

●

●

●●●●

●

●

●

●●●

●●●●●●●●

●●

●●●●●● ●

●●

●●●●●●●●

●

●

●

●

●●●●

●

●

●

●●●●●

●●

●

●●

●

●●

●●

●

●

●

●

●●●●●

●● ●●●●

●●

●

●●●●●●●●

●●●

●

●

●●●●●

●

●

●

●

●

●●

●●●

●

●●

●

●●

●●●●●●●

●

●●●●●●●●●●●●●●●

AAAAACAAGAATAC

AAC

CAC

GAC

TAG

AAG

CAG

GAG

TATAATCATGATTC

AAC

ACC

AGC

ATC

CA

CC

CC

CG

CC

TC

GA

CG

CC

GG

CG

TC

TAC

TCC

TGC

TTG

AAG

ACG

AGG

ATG

CA

GC

CG

CG

GC

TG

GA

GG

CG

GG

GG

TG

TAG

TCG

TGG

TTTAATACTAGTATTC

ATC

CTC

GTC

TTG

ATG

CTG

GTG

TTTATTCTTGTTT

● 20FUK.2● 20FUK.3● 20FUK.4● 20FUK.5● 20FUK.6● 20FUK.7● 20FUK.8● 20GAV.1● 20GAV.2● 20GAV.3● 20GAV.4● 20GAV.5● 20GAV.6

AATCG context

suffix

Empi

rical

gap

ope

n pe

nalty

25

30

35

40

45

50

55

●●●●●●

●

●●●●●●●●●

●

●●

●

●●●●●●●●●●●●

●●●●●

●

●●●

●

●●●●●●

●

●●●●●●●

●

●

●

●●●●

●●●●●●●●●●●●●●●●●

●●

●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●

●

●●●●●●●

●●●●●●

●

●●●●●●●●● ●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●

●

●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●

●●●●●●●

●●●●●●●●●●●●●●●● ●●●●●●●●●●●●

●

●●●

●●●●●●●●●

●

●●

●

●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●

●●●

●

●●●●●●●

●

●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●

●●●● ●●

●

●●●●

●●

●●●●

●

●●

●●●●●●●●●●●●●●●●

●●●

●●●

●

●

●

●●

●

●

●

●

●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●

●

●●●●●● ●●●●●●●●●●●●●

●

●● ●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●

●●

●●●●●

●

●●●●●●●●●●●●●●

●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●● ●●●●●●●●●●

●

●●●

●

● ●●●●●●●●●●●●●

●

●● ●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●

●●

AAAAACAAGAATAC

AAC

CAC

GAC

TAG

AAG

CAG

GAG

TATAATCATGATTC

AAC

ACC

AGC

ATC

CA

CC

CC

CG

CC

TC

GA

CG

CC

GG

CG

TC

TAC

TCC

TGC

TTG

AAG

ACG

AGG

ATG

CA

GC

CG

CG

GC

TG

GA

GG

CG

GG

GG

TG

TAG

TCG

TGG

TTTAATACTAGTATTC

ATC

CTC

GTC

TTG

ATG

CTG

GTG

TTTATTCTTGTTT

● 20FUK.2● 20FUK.3● 20FUK.4● 20FUK.5● 20FUK.6● 20FUK.7● 20FUK.8● 20GAV.1● 20GAV.2● 20GAV.3● 20GAV.4● 20GAV.5● 20GAV.6

there is significant difference in the empirical probability of starting an insertion or deletion due to context

other improvements

• “auto-recalibration” mode for organisms without known callsets

• improved covariate models

• simpler command line pipeline with a single tool instead of three.


Reported Quality Score

Empi

rical

Qua

lity

Scor

e

10

20

30

40

50

Base Substitution

10 20 30 40 50

Base Insertion

●

10 20 30 40 50

Base Deletion

●

10 20 30 40 50

Recalibration● Recalibrated● newRecalibrator

log(nBases)8

101214161820

Cycle Covariate

Qua

lity

Scor

e Ac

cura

cy

−6

−4

−2

0

2

4

Base Substitution

−100

−50

0 50 100

Base Insertion

●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●

−100

−50

0 50 100

Base Deletion

●

●

●●●●●●●●●●

●

●●

●●

●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●

−100

−50

0 50 100


log(nBases)15.515.615.715.8

Context Covariate

Qua

lity

Scor

e Ac

cura

cy

−8

−6

−4

−2

0

2

Base Substitution

●● ●●

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA

CAA

CAC

CAG

CAT

CC

CC

AC

CC

CC

GC

CT

CG

CG

AC

GC

CG

GC

GT

CT

CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GC

AG

CC

GC

GG

CT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT

Base Insertion

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA

CAA

CAC

CAG

CAT

CC

CC

AC

CC

CC

GC

CT

CG

CG

AC

GC

CG

GC

GT

CT

CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GC

AG

CC

GC

GG

CT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT

Base Deletion

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA

CAA

CAC

CAG

CAT

CC

CC

AC

CC

CC

GC

CT

CG

CG

AC

GC

CG

GC

GT

CT

CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GC

AG

CC

GC

GG

CT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT


log(nBases)15161718


QualityScore Covariate

Num

ber o

f Obs

erva

tions

0

200,000,000

400,000,000

600,000,000

800,000,000

1,000,000,000

1,200,000,000

1,400,000,000Base Substitution

10 20 30 40 50

Base Insertion

10 20 30 40 50

Base Deletion

10 20 30 40 50

RecalibrationRecalibratednewRecalibrator

Cycle Covariate

Mea

n Q

uality

Sco

re

25

30

35

40

45

50

Base Substitution

−100

−50

0 50 100

Base Insertion

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●

−100

−50

0 50 100

Base Deletion

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●

−100

−50

0 50 100


log(nBases)15.515.615.715.8

Context Covariate

Mea

n Q

uality

Sco

re

30

35

40

45

Base Substitution

●● ●●

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA CAA

CAC

CAG

CAT

CC CCA

CCC

CCG

CCT

CG CGA

CGC

CGG

CGT

CT CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GCA

GCC

GCG

GCT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT

Base Insertion

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA CAA

CAC

CAG

CAT

CC CCA

CCC

CCG

CCT

CG CGA

CGC

CGG

CGT

CT CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GCA

GCC

GCG

GCT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT

Base Deletion

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA CAA

CAC

CAG

CAT

CC CCA

CCC

CCG

CCT

CG CGA

CGC

CGG

CGT

CT CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GCA

GCC

GCG

GCT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT


log(nBases)15161718


Thank you!Stay up to date with the GSA team through our wiki

• the latest releases of our tools and version changelogs

• tutorials on our best practices for data processing and analysis

• further information on how to use the GATK engine for your own research or to collaborate with us

http://www.broadinstitute.org/gsa/wiki/index.php


Best practices for Variant Calling with Paciﬁc Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Documents