Top Banner
Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D. Genome Sequence and Analysis Medical and Population Genetics [email protected] Best practices for Variant Calling with Pacific Biosciences data 1 Wednesday, February 15, 12
55

Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Feb 25, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Mauricio Carneiro, Ph.D.Mark DePristo, Ph.D.

Genome Sequence and AnalysisMedical and Population [email protected]

Best practices for Variant Calling with Pacific Biosciences data

1Wednesday, February 15, 12

Page 2: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

The Current Pipeline

General best practice data processing and variant calling using the GATK

2Wednesday, February 15, 12

Page 3: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

SNPs

Indels

Structural variation (SV)

Rawindels

RawSVs

Typically by lane Typically multiple samples simultaneously but can be single sample alone

Input

Output

Mapping

Local realignment

Duplicate marking

Base quality recalibration

Analysis-ready reads

Raw reads Sample 1 reads

Raw variants

RawSNPs

Genotype refinement

Variant quality recalibration

Analysis-ready variants

Pedigrees Known variation

Known genotypes

Population structure

Phase 1: NGS data processing Phase 2: Variant discovery and genotyping Phase 3: Integrative analysis

Sample N reads

External data

Our framework for variation discovery!

DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !

3Wednesday, February 15, 12

Page 4: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Finding the true origin of each read is a computationally demanding first step!

Region 1

Enormous pile of short reads from

NGS

Detects correct read origin and flags them

with high certainty

Detects ambiguity in the origin of reads and

flags them as uncertain

Reference genome

Region 2 Region 3

For more information see: Li and Homer (2010). A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics.

Mapping'and'alignment'algorithms'

Phase 1:!NGS data processing!

Input

Output

Mapping

Local realignment

Duplicate marking

Base quality recalibration

Analysis-ready reads

Raw reads

4Wednesday, February 15, 12

Page 5: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

rs28782535!

rs28783181! rs28788974! rs34877486! rs28788974!

1,000 Genomes Pilot 2 data, raw MAQ alignments! 1,000 Genomes Pilot 2 data, after MSA!

HiSeq data, raw BWA alignments! HiSeq data, after MSA!

Effect of MSA on alignments!NA12878, chr1:1,510,530-1,510,589!

Accurate read alignment through multiple sequence local realignment"

25"DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !

Phase 1:!NGS data processing!

Input

Output

Mapping

Local realignment

Duplicate marking

Base quality recalibration

Analysis-ready reads

Raw reads

5Wednesday, February 15, 12

Page 6: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Accurate error modeling with base quality score recalibration"

26"

Phase 1:!NGS data processing!

!!!!!!

!!!

!!!!

!!

!

!

!!

!!

!!

!!

!!!!!! !

!!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!!

!

!

Original, RMSE = 5.242Recalibrated, RMSE = 0.196

!!

!!

!!!

!!

!!

!!

!!

!

!!!!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!

!

!

Original, RMSE = 2.556Recalibrated, RMSE = 0.213 !!!

!

!

!!!

!!!

!!

!!

!!

!!!!

!!

!!

!

!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!!!

!!!!

!

!

Original, RMSE = 1.215Recalibrated, RMSE = 0.756

!!!

!!!!!

!!

! !!

!

!

!

!

!!

!!

!

!

!!

!!

!!!!

!!

!!

!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!

!

Original, RMSE = 5.634Recalibrated, RMSE = 0.135

!!!!!!!!!!!!!!! !! !! !! !

! !! !!

!

!

!! !!!!!

0 5 10 15 20 25 30 35

!10

!5

05

10

Machine Cycle

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!! !! !! !! !! !! !! !! !! !!!! !

!

!

Original, RMSE = 2.207Recalibrated, RMSE = 0.186

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!!

!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!

!!!

!!!!!!!! !!!!!!!! !!!!!!!! !!

!!!!!! !!!!!!!!

!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!!!! !!!! !!!

!!!!! !!!!

!!!!!!

!!!!!!!!!!

!!

!

!

!

!!

!!

!!

0 50 100 150 200

!10

!5

05

10

Machine Cycle

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!! !!!! !!!!!!!! !!!! !!!!!!!!!!!!!!!!

!!!

!!!!!

!

!

Original, RMSE = 1.784Recalibrated, RMSE = 0.136

!!

!

!

!

!

!

!

!

!

!!

!

!!!

!!

! !!!

! !!!

!

!!

!!

!!

!!! !!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!!

!30 !20 !10 0 10 20 30

!10

!5

05

10

Machine Cycle

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!! !!! !! !! !!!! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !

! !! !! ! !! !!!!!

! !! ! !!!!!

Second of pair reads First of pair reads

!

!

Original, RMSE = 1.688Recalibrated, RMSE = 0.213

!

!!!!!!!!!!!

!!!!!!!!

!!!!

!

!!

!

!!!!!!!!!!!!

!

!!!!!

! !

!

!!

!!

!!

!!

!!

!!!!

!!!

!

!

!

!!

!!!!

!

!

!

!

!

!!!

!

!

!

!

!!!!!!!! !

!

!!

!!

!

!!!

!!

!

!!!!

!!

!

!

!

!!!!

!

!

!!

!

!

!

!

!!

!

!

!!!! !!!! !!!

!

!!!! !

!!!!!!!!!!!!!!!

!

!!!!

!

!!!!

!!!!!

! !!!!!!!!

!!!!

!!!!

!!

!!!

!100 !50 0 50 100

!10

!5

05

10

Machine CycleA

ccura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!

Second of pair reads First of pair reads

!

!

Original, RMSE = 2.609Recalibrated, RMSE = 0.089

!1

0!

50

51

0

Dinucleotide

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!

!!

!

!!!!!!!!!

!!!!!!!!!!!!!!!!

AA AG CA CG GA GG TA TG

Original, RMSE = 2.598Recalibrated, RMSE = 0.052

!1

0!

50

51

0

Dinucleotide

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!

!!!!!

!

!!!!!!!

!!

!!!!!!!!!!!!!!!!

AA AG CA CG GA GG TA TG

Original, RMSE = 2.169Recalibrated, RMSE = 0.135

!1

0!

50

51

0

Dinucleotide

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!

!!!!

!!!!!!!!!!!!!!!!!!

AA AG CA CG GA GG TA TG

Original, RMSE = 1.656Recalibrated, RMSE = 0.088

!1

0!

50

51

0

Dinucleotide

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!

!!

!!!!

!!!!!

!!!!!!!!!!!!!!!!

AA AG CA CG GA GG TA TG

Original, RMSE = 2.469Recalibrated, RMSE = 0.083

Illumina/GenomeAnalyzer Roche/454 Life/SOLiD Illumina/HiSeq 2000

!!!!!!

!!!

!!!!

!!

!

!

!!

!!

!!

!!

!!!!!! !

!!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!!

!

!

Original, RMSE = 5.242Recalibrated, RMSE = 0.196

!!

!!

!!!

!!

!!

!!

!!

!

!!!!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!

!

!

Original, RMSE = 2.556Recalibrated, RMSE = 0.213 !!!

!

!

!!!

!!!

!!

!!

!!

!!!!

!!

!!

!

!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!!!

!!!!

!

!

Original, RMSE = 1.215Recalibrated, RMSE = 0.756

!!!

!!!!!

!!

! !!

!

!

!

!

!!

!!

!

!

!!

!!

!!!!

!!

!!

!

0 10 20 30 40

010

20

30

40

Reported Quality

Em

pir

ical Q

ualit

y

!!!!!!!!!!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!!

!

!

Original, RMSE = 5.634Recalibrated, RMSE = 0.135

!!!!!!!!!!!!!!! !! !! !! !

! !! !!

!

!

!! !!!!!

0 5 10 15 20 25 30 35

!10

!5

05

10

Machine Cycle

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!! !! !! !! !! !! !! !! !! !!!! !

!

!

Original, RMSE = 2.207Recalibrated, RMSE = 0.186

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!!

!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!

!!!

!!!!!!!! !!!!!!!! !!!!!!!! !!

!!!!!! !!!!!!!!

!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!!!! !!!! !!!

!!!!! !!!!

!!!!!!

!!!!!!!!!!

!!

!

!

!

!!

!!

!!

0 50 100 150 200

!10

!5

05

10

Machine Cycle

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!! !!!! !!!!!!!! !!!! !!!!!!!!!!!!!!!!

!!!

!!!!!

!

!

Original, RMSE = 1.784Recalibrated, RMSE = 0.136

!!

!

!

!

!

!

!

!

!

!!

!

!!!

!!

! !!!

! !!!

!

!!

!!

!!

!!! !!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!!

!30 !20 !10 0 10 20 30

!10

!5

05

10

Machine Cycle

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!! !!! !! !! !!!! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !

! !! !! ! !! !!!!!

! !! ! !!!!!

Second of pair reads First of pair reads

!

!

Original, RMSE = 1.688Recalibrated, RMSE = 0.213

!

!!!!!!!!!!!

!!!!!!!!

!!!!

!

!!

!

!!!!!!!!!!!!

!

!!!!!

! !

!

!!

!!

!!

!!

!!

!!!!

!!!

!

!

!

!!

!!!!

!

!

!

!

!

!!!

!

!

!

!

!!!!!!!! !

!

!!

!!

!

!!!

!!

!

!!!!

!!

!

!

!

!!!!

!

!

!!

!

!

!

!

!!

!

!

!!!! !!!! !!!

!

!!!! !

!!!!!!!!!!!!!!!

!

!!!!

!

!!!!

!!!!!

! !!!!!!!!

!!!!

!!!!

!!

!!!

!100 !50 0 50 100

!10

!5

05

10

Machine Cycle

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!! !!!! !!!! !!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!! !!!!!!!! !!!!!!!!!!!!!

Second of pair reads First of pair reads

!

!

Original, RMSE = 2.609Recalibrated, RMSE = 0.089

!1

0!

50

51

0

Dinucleotide

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!

!!

!

!!!!!!!!!

!!!!!!!!!!!!!!!!

AA AG CA CG GA GG TA TG

Original, RMSE = 2.598Recalibrated, RMSE = 0.052

!1

0!

50

51

0

Dinucleotide

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!

!!!!!

!

!!!!!!!

!!

!!!!!!!!!!!!!!!!

AA AG CA CG GA GG TA TG

Original, RMSE = 2.169Recalibrated, RMSE = 0.135

!1

0!

50

51

0

Dinucleotide

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!!!!!!

!!!!

!!!!!!!!!!!!!!!!!!

AA AG CA CG GA GG TA TG

Original, RMSE = 1.656Recalibrated, RMSE = 0.088

!1

0!

50

51

0

Dinucleotide

Acc

ura

cy (

Em

pir

ical !

Report

ed Q

ualit

y)

!!!!!

!!

!!!!

!!!!!

!!!!!!!!!!!!!!!!

AA AG CA CG GA GG TA TG

Original, RMSE = 2.469Recalibrated, RMSE = 0.083

Illumina/GenomeAnalyzer Roche/454 Life/SOLiD Illumina/HiSeq 2000

Ryan Poplin

Input

Output

Mapping

Local realignment

Duplicate marking

Base quality recalibration

Analysis-ready reads

Raw reads

DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !

6Wednesday, February 15, 12

Page 7: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Bayesian(model((

4 SNP calling

4.1 Simple genotype likelihoods for presentations

Pr{G|D} =Pr{G}Pr{D|G}

�i Pr{Gi}Pr{D|Gi}, [Bayes’ rule]

Pr{D|G} =⇧

j

�Pr{Dj|H1}

2+

Pr{Dj|H2}2

⇥where G = H1H2

Pr{D|H} is the haploid likelihood function

4.1.1 SNP haploid likelihood

Pr{Dj|H} = Pr{Dj|b}, [single base pileup]

Pr{Dj|b} =

⇤1� �j Dj = b,�j otherwise.

4.1.2 Indel haploid likelihood

Pr{Dj|H} =⌅

alignments � of Dj to H

Pr{Dj, ⇥}

4.2 Genotype likelihoods

Pr{Di|GTi} =⇧

j

Pr{Di,j|GTi}

Pr{Di,j|GTi = AB} = (Pr{Di,j|A}+ Pr{Di,j|B}) /2

Pr{Di,j|B} =

⇤1� �i,j Di,j = B,

�i,j · Pr{B is true|Di,j is miscalled} otherwise.

3

SNP and Indel calling is a large-scale Bayesian modeling problem!

•  Inference:(what(is(the(genotype(G(of(each(sample(given(read(data(D(for(each(sample?(

•  Calculate(via(Bayes’(rule(the(probability(of(each(possible(G(•  Product(expansion(assumes(reads(are(independent(•  Relies(on(a(likelihood(funcCon(to(esCmate(probability(of(sample(

data(given(proposed(haplotype(

Prior of the genotype!

Likelihood of the genotype!

See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information 27!

Diploid assumption!

7Wednesday, February 15, 12

Page 8: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

SNP genotype likelihoods!

•  All diploid genotypes (AA, AC, …, GT, TT) considered at each base!

•  Likelihood of genotype computed using only pileup of bases and associated quality scores at given locus!

•  Only �good bases� are included: those satisfying minimum base quality, mapping read quality, pair mapping quality, NQS!

See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for more information 28!

4 SNP calling

4.1 Simple genotype likelihoods for presentations

Pr{G|D} =Pr{G}Pr{D|G}

�i Pr{Gi}Pr{D|Gi}, [Bayes’ rule]

Pr{D|G} =⇧

j

�Pr{Dj|H1}

2+

Pr{Dj|H2}2

⇥where G = H1H2

Pr{D|H} is the haploid likelihood function

4.1.1 SNP haploid likelihood

Pr{Dj|H} = Pr{Dj|b}, [single base pileup]

Pr{Dj|b} =

⇤1� �j Dj = b,�j otherwise.

4.1.2 Indel haploid likelihood

Pr{Dj|H} =⌅

alignments � of Dj to H

Pr{Dj, ⇥}

4.2 Genotype likelihoods

Pr{Di|GTi} =⇧

j

Pr{Di,j|GTi}

Pr{Di,j|GTi = AB} = (Pr{Di,j|A}+ Pr{Di,j|B}) /2

Pr{Di,j|B} =

⇤1� �i,j Di,j = B,

�i,j · Pr{B is true|Di,j is miscalled} otherwise.

3

8Wednesday, February 15, 12

Page 9: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Variant Quality Score Recalibration (VQSR): modeling error properties of real polymorphism to determine the probability that novel sites are real!

The HapMap3 sites from NA12878 HiSeq!calls are used to train the GMM. Shown!here is the 2D plot of strand bias vs. the!variant quality / depth for those sites.!

Variants are scored based on their!fit to the Gaussians. The variants!(here just the novels) clearly!separate into good and bad clusters.!

32!DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !

9Wednesday, February 15, 12

Page 10: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

These methods are available in the Genome Analysis Toolkit (GATK)"

•  Most Broad Institute tools for the 1000 Genomes have been developed in the GATK "

McKenna et al. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res.!

http://www.broadinstitute.org/gsa/wiki/""

1000 genomes GATK tools"

Genome Analysis Toolkit (GATK)" SAM/BAM format"

•  Technology agnostic, binary, indexed, portable and extensible file format for NGS reads"

•  Also used in the Broad production pipeline"

http://samtools.sourceforge.net/""

VCF format"•  Standard and accessible

format for storing population variation and individual genotypes"

•  Open-source map/reduce programming framework for developing analysis tools for next-gen sequencing data"

•  Easy-to-use, CPU and memory efficient, automatically parallelizing Java engine"

h"p://vc(ools.sourceforge.net/44

Indel realignment"

VQSR"

Base quality score recalibration"

Unified Genotyper"

Variant Eval" Many other analysis tools"

10Wednesday, February 15, 12

Page 11: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Pacbio Processing Pipeline

how we apply our pipeline to Pacific Biosciences dataa step-by-step tutorial

11Wednesday, February 15, 12

Page 12: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

SNPs

Indels

Structural variation (SV)

Rawindels

RawSVs

Typically by lane Typically multiple samples simultaneously but can be single sample alone

Input

Output

Mapping

Local realignment

Duplicate marking

Base quality recalibration

Analysis-ready reads

Raw reads Sample 1 reads

Raw variants

RawSNPs

Genotype refinement

Variant quality recalibration

Analysis-ready variants

Pedigrees Known variation

Known genotypes

Population structure

Phase 1: NGS data processing Phase 2: Variant discovery and genotyping Phase 3: Integrative analysis

Sample N reads

External data

Our framework for variation discovery!

DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !

not evaluated yet on PacBio data due to small size of the datasets

currently the GATK cannot perform indel realignment due to the high indel error rate and the long reads of Pacific Biosciences

12Wednesday, February 15, 12

Page 13: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Pacbio Processing Pipeline1. We start our processing pipeline with the

filtered_subreads.fasta file produced by PacBio software and turn into a fastQ using SMRT Pipeline scripts provided by PacBio.

2. Mapping and Alignment are done using BWA with a heuristic smith waterman algorithm (bwa-sw)

3. We sort the bam file, add read group and sample information using Picard Tools: SortSam and AddOrReplaceReadGroups.

4. We recalibrate base qualities using the GATK’s Base Quality Score recalibration framework.

FASTA

BWA-SW

SAM

BAM

Picard

Base Quality Score Recalibration

Analysis Ready BAM

FASTQ

13Wednesday, February 15, 12

Page 14: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Why do we align with BWA and not BLASR?• BWA is the standard aligner in the Broad’s sequencing platform.

• BLASR is still responsible for generating the filtered sub-reads.

• With recent updates, BLASR generated BAM files are a reasonable alternative for this step of the pipeline

- optional pipeline starts with a BLASR generated BAM (skipping BWA and Picard steps).

- Read Group information and BQSR are still required steps.

- Works well, but generally smaller yield.

- We anticipate further development in BLASR generated BAMs could improve this alternate pipeline in the future.

14Wednesday, February 15, 12

Page 15: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

total mapped coverage: 74,735,274 bp total mapped coverage: 19,562,290 bp

Strict BLASR filtering reduces yield and eliminates the longer reads

aggressive BLASR clipping turns longer reads into

“short” reads

15Wednesday, February 15, 12

Page 16: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

0 10 20 30 40

Reported quality score histogram

Empirical quality score

Co

un

t

02

00

00

00

00

40

00

00

00

06

00

00

00

00

●● ● ●

● ●

●● ●

● ●●

● ●●

● ● ●● ●

● ●

● ● ●

● ●●

0 10 20 30 40

01

02

03

04

0

Reported vs. empirical quality scores

Reported quality score

Em

piric

al q

ua

lity s

co

re

0 10 20 30 40

Reported quality score histogram

Empirical quality score

Co

un

t

0 5

00

00

00

01

00

00

00

00

15

00

00

00

0

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

0 10 20 30 40

01

02

03

04

0

Reported vs. empirical quality scores

Reported quality score

Em

piric

al q

ua

lity s

co

re

Recalibra)on

Sequencers  provide  es)mates  of  error  rate  per  nucleo)de

…  but  they  aren’t  very  accurate

…  and  they  aren’t  very  informa)ve

Reported  quality  score

Reported  quality  score

Reported  quality  score

Reported  quality  score

Introduc)on  to  Base  Quality  Score  Recalibra)on

16Wednesday, February 15, 12

Page 17: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Recalibra)on  workflow

17

Original  BAM  file

Covariates  table  (.csv)

Recalibrated  BAM  file

Recalibrated  covariates  table  (.csv)

CountCovariates

TableRecalibra)on

CountCovariates

AnalyzeCovariates

AnalyzeCovariates

Pre-­‐recalibra)on  analysis  plots

Post-­‐recalibra)on  analysis  plots

dbSNP  /  known  sitesnecessary

17Wednesday, February 15, 12

Page 18: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Running  CountCovariates

18

java  -­‐Xmx4g  -­‐jar  GenomeAnalysisTK.jar      -­‐R  reference.fasta      -­‐D  dbsnp.vcf        -­‐I  original.bam      -­‐T  CountCovariates      -­‐cov  ReadGroupCovariate      -­‐cov  QualityScoreCovariate              -­‐cov  DinucCovariate              -­‐cov  CycleCovariate              -­‐recalFile  table.recal_data.csv

List  of  known  polymorphic  sites  is  necessary  so  these  sites  do  not  count  against  bases  

mismatch  rate

List  of  covariates  to  be  used  in  the  recalibra)on  calcula)on

CSV  file  containing  covariate  counts

# Counted Bases 143745620ReadGroup,QualityScore,Dinuc,Cycle,nObservations,nMismatches,QempiricalSRR001802,2,AA,-8,165,17,10SRR001802,2,AA,-2,91,10,10SRR001802,2,AA,3,5,4,1SRR001802,2,AA,4,9,4,4SRR001802,2,AA,7,12,4,5

Table  recalibra)on  file  (table.recal_data.csv)

See  hVp://www.broadins)tute.org/gsa/wiki/index.php/Base_quality_score_recalibra)on  for  more  informa)on

18Wednesday, February 15, 12

Page 19: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Running  TableRecalibra)on

19

java  -­‐Xmx4g  -­‐jar  GenomeAnalysisTK.jar      -­‐R  Homo_sapiens_assembly18.fasta      -­‐I  original.bam      -­‐T  TableRecalibra)on      -­‐recalFile  table.recal_data.csv          -­‐outputBam  recal.bam

Table  recalibra)on  file  from  CountCovariates  step

The  full  recalibrated  bam  file  

A  recalibrated  copy  of  the  original  BAM  file

See  hVp://www.broadins)tute.org/gsa/wiki/index.php/Base_quality_score_recalibra)on  for  more  informa)on

19Wednesday, February 15, 12

Page 20: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Running  AnalyzeCovariates

See  hVp://www.broadins)tute.org/gsa/wiki/index.php/Base_quality_score_recalibra)on  for  more  informa)on20

java  -­‐Xmx4g  -­‐jar  AnalyzeCovariates.jar      -­‐outputDir  /path/to/output_dir/      -­‐resources  resources/        -­‐recalFile  table.recal_data.csv

The  directory  in  which  to  place  the  output  analysis  plots

Points  to  the  GATK  installa)on’s  directory  of  R  scripts  which  are  used  for  plodng  the  

data  

Table  recalibra)on  file  from  either  the  before  or  aeer  CountCovariates  step

Many  plots  of  base  quality  versus  each  covariate

A  separate  .jar  file  distributed  with  the  GATK

20Wednesday, February 15, 12

Page 21: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

The Pacbio Processing Pipeline is available for educational purposes

(but not supported)

Queue is part of the GATK and is a pipeline manager used internally at the Broad in most analysis projects

(see http://www.broadinstitute.org/gsa/wiki/index.php/Queue)

java -Xmx4g -jar Queue.jar -S PacbioProcessingPipeline.scala -i filtered_subreads.fastq -D dbSNP.vcf -R reference.fasta -run

or  blasr.bam  with  extra  -­‐blasr  op)on

21Wednesday, February 15, 12

Page 22: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Calling snps and indels using pacbio data with the Unified Genotyper

java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -I input.recal.bam -R reference.fasta -D dbsnp.vcf -deletions 0.5 -o myCalls.vcf -mbq 10

The ideal deletions and minimum base quality parameters for this specific dataset were determined systematically by measuring

sensitivity/specificity to known variant calls in NA12878.

allows  sites  with  50%  dele)ons  to  be  analyzed

minimum  base  quality  10  calibrates  the  UG  for  PacBio  data  (avg  base  qual  is  20)

22Wednesday, February 15, 12

Page 23: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Analyzing PacBio data

more information available at the poster session of AGBT(presentation thursday 1:10 - 2:40pm)

23Wednesday, February 15, 12

Page 24: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

A quick look at Pacific Biosciences data

indels are the primary error mode (all purple markers)

Notice the SNP

discovery dataset

0%

3.75%

7.5%

11.25%

15%

inser

tions

delet

ions

mismatc

hes

erro

r ra

te

24Wednesday, February 15, 12

Page 25: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

average coverage

number of reads

120x 104x ~120x per sample

~500x per sample

36,918 305,581 89,934per sample

256,989

per sample

discovery validation cancer 1000G

Long reads and deep coverage on all PacBio datasets

discovery dataset validation dataset breast cancer dataset 1000G dataset

25Wednesday, February 15, 12

Page 26: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Sequencing bias is a known problem with NGS technologies that PacBio does not share

normalized coverage by GC content contrasted with GC content of the genome

E. coli R. sphaeroidsP. falciparum

come to Michael Ross’ talk on tuesday @ 7pm for a more thorough exploration on bias in the different sequencing technologies today

26Wednesday, February 15, 12

Page 27: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Random error profile of PacBio is much preferred by the GATK bayesian model to systematic errors

phasing dataset

SYSTEMATIC ERROR

RANDOM ERROR

same genome region on both datasets

27Wednesday, February 15, 12

Page 28: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Long reads with a high indel error rate have a side effect: reference bias

Allele balances for known variants in PacBio

Alternate allele fraction

Num

ber o

f het

eroz

ygou

s si

tes

0.0 0.2 0.4 0.6 0.8 1.0

020

4060

80

Current tools are not capable of locally realigning PacBio data, but we anticipate that newer tools will improve this issue.

28Wednesday, February 15, 12

Page 29: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

True variation missed by Pacbio due to reference bias

1000G dataset

the alternate allele is hiding inside the insertions due to the low gap open

penalty of the aligner.

“C” INSERTIONS

29Wednesday, February 15, 12

Page 30: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

0 10 20 30 40 50

Reported quality score histogram, entropy = 2.178

Reported quality score

Num

ber o

f Bas

es

010

0000

030

0000

050

0000

070

0000

0

The Base Quality Score framework does not account for indel errors

PacBio produces Q20 bases on average across datasets

discovery dataset

0 10 20 30 40 50

Reported quality score histogram, entropy = 2.718

Reported quality score

Num

ber o

f Bas

es

0 2

0000

00 4

0000

00 6

0000

00 8

0000

0010

0000

00

validation dataset

0 10 20 30 40 50

Reported quality score histogram, entropy = 2.769

Reported quality score

Num

ber o

f Bas

es

0 5

0000

0010

0000

0015

0000

00

1000G dataset

30Wednesday, February 15, 12

Page 31: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●

●●●

●●

●●●●●●●●●●

●●●

●●●

●●●●

●●●

●●

●●●●●

●●●●●●●

●●

●●●

●●●●

●●

●●

●●

●●●●

●●●●●●●●

●●●●

●●●●●●●●●●

●●

●●●●●

●●●●●●●●●●●●●●

●●

●●●

●●●

●●●●●

●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●

●●●●●●

●●●●●

●●●●●

●●

●●●●●●●●

●●●●●●

●●●●●●●●●●●●●

●●●

●●●●●●●

●●

●●●

●●●●●

●●●●●●●●●●

●●●

●●●

●●

●●●●●●●●●●

●●●●●

●●●

●●●●●

●●●●●●●

●●

●●●●

●●●●

●●

●●●●

●●●●●●●●●

●●●●●

●●

●●

●●●●●●●●

●●●

●●

●●

●●●●●●●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●

●●●

●●

●●●●●●●●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●●●●

●●●●●

●●●

●●●●

●●

●●●

●●●●●●●

●●

●●●●●

●●

●●●●●●●●●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●●●●●●●

●●

●●●

●●●●●

●●

●●●

●●

●●●

●●●●

●●●●●●●●●

●●●●●●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●●

●●●

●●●●●

●●●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●●●●●●●

●●●●●●

●●

●●

●●●●

●●

●●●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●●●●●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●●

●●●

●●●●●●●

●●

●●

●●●●

●●●

●●●

●●●●

●●●●

●●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

0 500 1000 1500 2000 2500

−10

−50

510

RMSE_good = 7.196 , RMSE_all = 7.211

Cycle

Empi

rical

− R

epor

ted

Qua

lity

●●

●●●

●●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●●

●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●●

●●●●●●●

●●●●

●●●●●

●●●

●●●●●

●●

●●●

●●

●●●●●

●●

●●

●●●●●

●●●

●●●●●●●

●●

●●

●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●

●●●●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●●●●●●●

●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●

●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

SLX GA 454 SOLiD Complete Genomics HiSeq

PacBio

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

0 500 1000 1500 2000 2500

−10

−50

510

RMSE_good = 0.559 , RMSE_all = 0.877

Cycle

Empi

rical

− R

epor

ted

Qua

lity

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●

●●

●●●●●●●

●●●●●●

●●

●●

●●

●●●●●●●●●●●●●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●●●●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●●●

●●●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●●●●

●●●

●●●●●

●●

●●●●●

●●

●●●●●●●

●●

●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●

●●●●●●

●●

●●●

●●●

●●

●●

●●●●

●●●

●●

●●●●●●●●

●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

before recalibration:• even before

recalibration PacBio reads do not seem to be affected by the length of the read like other technologies.

• The steady straight line breaks after 1250bp because we have very few reads that go that long (hence the light blue colored dots)

PacBio base qualities are not affected by the length of the read

after recalibration:• recalibration helps make

the straight line more dense and clear.

• the lack of data points still breaks the recalibrated line after 1250bp.

discovery dataset

31Wednesday, February 15, 12

Page 32: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

PacBio variation discovery and validation

validating hard-to-call-sites and a look at variation discovery using PacBio

32Wednesday, February 15, 12

Page 33: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

How can we use PacBio data for human analysis?

• Is PacBio a good platform for follow-up validation today?

• Can we do SNP discovery with PacBio data?

• How does PacBio compare to other technologies?

33Wednesday, February 15, 12

Page 34: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Data and Definitions• We have performed a number of experiments at the Broad using

PacBio for human data analysis.

- discovery dataset (12/23/2010)61 amplicons covering 177 kb from regions across chromosome 20 of NA12878 (1000G sample).

- validation dataset (1/20/2011)a set of hard to call NA12878 snps targeted with 2Kbp amplicons

- breast cancer dataset (6/17/2011)24 samples for tumor/normal validation analysis of 15 events against HiSeq, 454 and Sequenom.

- 1000G dataset (8/25/2011)8 samples resequenced at 250 sites for follow up validation against Illumina, Sanger and Sequenom.

34Wednesday, February 15, 12

Page 35: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Pacbio as a validation tool• Follow up validation is a major unmet need at the

Broad and other centers.

• We carried out a follow-up validation assay using the de novo mutations previously validated by the 1000G project.

• Some are real de novo mutations

• Most are machine artifacts already identified by follow up validation in 1000G.

• These are hard-to-call sites that are prone to errors and really challenge sequence technology accuracy.

35Wednesday, February 15, 12

Page 36: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

PacBio demonstrates great performance on hard-to-call sites

PacBio known true variant site

known false variant site

predictive value

calledalt

called ref

48 5 91%

0 67 100%

HiSeq known true variant site

known false variant site

predictive value

calledalt

called ref

48 35 58%

0 37 100%

positive predictive value, or precision rate is the proportion of subjects with positive test

results who are correctly diagnosed

negative predictive value (NPV) is the proportion of subjects with a negative test

result who are correctly diagnosed.

same sites on both tests validation dataset

36Wednesday, February 15, 12

Page 37: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Pacbio performs well in “apples to apples” comparison with MiSeq data

PacBio known true variant site

known false variant site

predictive value

calledalt

called ref

37 1 97%

1 59 98%

MiSeq known true variant site

known false variant site

predictive value

calledalt

called ref

38 5 88%

0 55 100%

Site missed due to reference bias

both technologies miscalled this site with the same “wrong” allele that is reported in our gold standard callset, making the truth status of this site questionable (possible sanger trace error)

same sites on both tests validation dataset

4 sites missed due to systematic error

(probably misalignment)

37Wednesday, February 15, 12

Page 38: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

1000G project validation experiment

• First we used Sequenom to validate 300 well-behaving SNP sites chosen to be polymorphic in at least 1 out of 8 specific samples from Illumina low pass data.

- Sequenom is the current standard validation tool at the Broad.

• Sequenom only had data for 250 sites.

• We used PacBio to validate all 300 sites and looked at the agreement between Sequenom and Pacbio.

38Wednesday, February 15, 12

Page 39: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Pacbio adds valuable information to Sequenom validation

Pacbio ALT Pacbio REF

sequenom ALT

sequenom REF

218 7

8 12

Result Pacbio No. occurrences what went wrong

good sequencing 1 Sequenom was wrong

Alt allele placed on insertion 4 Pacbio Reference Bias

No coverage 1 Reads actually didn’t belong at location

Wrong ALT allele called 1 UG triallelic issue

Visual classification Result from Pacbio

6 look incredibly good 5 ALTs, 1 Reference Bias

1 bad mapping quality ALT

1 has nearby deletion (unclear) Reads actually didn’t belong at location

50 sites not called by sequenom Many sites were ALT, others mismapped

1000G dataset

39Wednesday, February 15, 12

Page 40: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Pacific Biosciences, Ion Torrent and MiSeq have good potential for validation experiments

sensitivity specificity PPV NPV

Ion (bwa-sw)

Ion (tmap)

MiSeq

PacBio

96.2% 100% 100% 54.5%

96.2% 100% 100% 54.5%

98.1% 92.3% 99.6% 70.5%

98.1% 100% 100% 68.7%

Low specificity indicates artifactual calls outside the scope of the validation

Ion Torrent has a low NPV but is good in most other metrics. NPV

40Wednesday, February 15, 12

Page 41: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

cancer dataset

Illumina Sequenom Pacbio 454

somatic

wildtype

unknown

15 6 12 80 6 1 00 3 2 7

high coverage and high specificity to targets

breast cancer validation experiment

base qualities are severely under calculated

Pacbio correctly identified a false positive in the original dataset

(unknown in sequenom and 454)

41Wednesday, February 15, 12

Page 42: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

GATK performs very well for SNP discovery with PacBio data

MiSeq HiSeq PacBio

Gold Standard SNP calls

calls on HapMap

Sensitivity

222 225 197

43 43 38

99.1% 100.0% 87.6%

discovery

• Reference bias (17) and lack of coverage (11) were the reasons for missed sites in Pacbio

• MiSeq missing data are due to mismapping/artifact (2) or low coverage (1).

42Wednesday, February 15, 12

Page 43: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Broad’s somatic mutation caller (muTect) successfully calls pacbio data

• One tumor/normal pair called:

• 6,459 sites examined

• 4,837 sites covered (14x/8x)

• 1 true somatic mutation called (previously validated)

• 0 False Positives called

muTect is a GATK based caller developed by the cancer group at the Broad Institute(https://confluence.broadinstitute.org/display/CGATools/MuTect)

43Wednesday, February 15, 12

Page 44: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

PacBio data performs well with the GATK because...

• The error rate is random (despite being high).

• Such non-systematic error mode is well handled by the GATK SNP calling mathematics.

• very long reads make mapping very clear.

• less mismappings of paralogous sequences.

• structural variants are less prone to appear as SNPs.

Pacbio’s reference bias is currently the major limiting factor

44Wednesday, February 15, 12

Page 45: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Future of the GATK

What is the GSA team working on right now(that will impact PacBio data analysis)

45Wednesday, February 15, 12

Page 46: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

From reads to alleles: the first frontier!

•  Can’t calculate a likelihood for a hypothesis you don’t consider!

•  How do I know what genetic variant I’m looking at, given the read data alone?!–  A SNP, an INDEL, an SV,

or something else?!•  General problem, but

acute for medium-sized events and insertions!

Too systematic to be machine errors, but the haplotype for Pr{D|H} is unclear

Example 1! Example 2!

46Wednesday, February 15, 12

Page 47: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

From reads to alleles: the next frontier

• Can’t calculate a likelihood for a hypothesis you don’t consider

• How do I know what genetic variant I’m looking at, given each locus independently?–A SNP, an INDEL, an SV, or

something else?• General problem, but acute for

medium-sized events as we not only miss the true event but also generate many smaller false events

• Reference bias can be addressed from a haplotype approach

Too  systema)c  to  be  machine  errors,  but  the  haplotype  for  Pr{D|H}  is  unclear

Example 1 Example 2

47Wednesday, February 15, 12

Page 48: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Using local de novo haplotype assembly via DeBruijn graphs!

29# Assembly(of(large(genomes(using(second3genera4on(sequencing.(Schatz.(Genome(Research.(2010.(

48Wednesday, February 15, 12

Page 49: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Example  Mullikin  het  dele)on  we  now  callchr4:336781  TTAAAAAAGTATTAAAAAAGTTCCTTGCATGA/-­‐

49

Original  read  data

Discovered  haplotype

49Wednesday, February 15, 12

Page 50: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

50

Example  Mullikin  het  inser)on  we  now  callchr18:14937489  -­‐/CCACTCCAGCCTCTGATGGACTGCAAGCTGGGTCT

Original  read  data

Discovered  haplotype

50Wednesday, February 15, 12

Page 51: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Caller Variant  Sensi-vity(strict)

Genotype  Concordance(strict)

Variant  Sensi-vity(strict)

Genotype  Concordance(strict)

Unified  Genotyper 51.9%(40  /  77)

51.9%(40  /  77)

49.0%(97  /  198)

49.0%(97  /  198)

Haplotype  Caller 90.9%(70  /  77)

89.6%(69  /  77)

81.8%(162  /  198)

81.8%(162  /  198)

51

Haplotype Caller greatly increases sensitivity to larger indel events over the Unified Genotyper

Mullikin Mills

• Input data is NA12878 b37+decoy WGS HiSeq high coverage• Sites chosen to be very difficult (het) but high confidence in being real

(require family transmission)• Evaluation sets• Mullikin Fosmids and Mills et al, GR, 2011 (2x hit, double center)• Large events (> 15 bp), largest is 106bp (which we don’t yet call)

51Wednesday, February 15, 12

Page 52: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

A new BQSR that also recalibrates “indel qualities” qualities

AAAAA context

suffix

Empi

rical

gap

ope

n pe

nalty

25

30

35

40

45

50

55

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●

●●●●●●●●●●

●●

●●

●●●

●●●●●

●●●●

●●●●

●●●

●●●●●●●●●●●●●

●●●

●●●●

●●●●●

●●●

●●●●●●●

●●●●

●●●

●●

●●●●●

●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●●

●●●

●●

●●

●●●

●●●●●●

●●●●●●●●●●●●●●●● ●

●●●●●●●●●●●●●●●

●●●●

●●●

●●

●●

●●●●

●●●●●

●●

●●

●●●●

●●

●●●●●●

●●●

●●

●●●

●●●●●

●●

●●●

●●●

●●●

●●

●●

●●●●●

●●

●●

●●●●●

●●●●●

●●●

●●●●●●●●●

●●

●● ●●

●●

●●●●●●●●●●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●●●●●●

●●

●●

●●●

●●

●●

●●●●●●●●

●●●

●●●

●●●●●●●●●●

●●●

●●

●●

● ●●

●●●●●●

●●●●●

●●●

●●●●

●●●

●●●

●●

●●

●●

●●

●●●●

●●●

●●●

●●●

●●

●●●●●●●●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●●●●●●●

●●●

●●

●●●●●

●●●●●●●●●●●●

●●

●●●●

●●

●●

●●●

●● ●●●●

●●

●●●

●●●

●●●

●●●●●●●

●● ●

●●

●●

●●●●●

●●●

●●●●●

●●

●●●●

●●

●●●●●●●●●●●

●●

●●

●●

●●●●

●●

●●●●●●●●

●●

●●

●●

●●●●

●●●●●

●●●●

●●●●●●●

●●

●●●●●

●●●●

●●

●●●

●●●

●●

●●●●

●●●

●●●●●●●●

●●

●●●●●● ●

●●

●●●●●●●●

●●●●

●●●●●

●●

●●

●●

●●

●●●●●

●● ●●●●

●●

●●●●●●●●

●●●

●●●●●

●●

●●●

●●

●●

●●●●●●●

●●●●●●●●●●●●●●●

AAAAACAAGAATAC

AAC

CAC

GAC

TAG

AAG

CAG

GAG

TATAATCATGATTC

AAC

ACC

AGC

ATC

CA

CC

CC

CG

CC

TC

GA

CG

CC

GG

CG

TC

TAC

TCC

TGC

TTG

AAG

ACG

AGG

ATG

CA

GC

CG

CG

GC

TG

GA

GG

CG

GG

GG

TG

TAG

TCG

TGG

TTTAATACTAGTATTC

ATC

CTC

GTC

TTG

ATG

CTG

GTG

TTTATTCTTGTTT

● 20FUK.2● 20FUK.3● 20FUK.4● 20FUK.5● 20FUK.6● 20FUK.7● 20FUK.8● 20GAV.1● 20GAV.2● 20GAV.3● 20GAV.4● 20GAV.5● 20GAV.6

AATCG context

suffix

Empi

rical

gap

ope

n pe

nalty

25

30

35

40

45

50

55

●●●●●●

●●●●●●●●●

●●

●●●●●●●●●●●●

●●●●●

●●●

●●●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●

●●●●●●●●● ●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●● ●●●●●●●●●●●●

●●●

●●●●●●●●●

●●

●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●

●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●● ●●

●●●●

●●

●●●●

●●

●●●●●●●●●●●●●●●●

●●●

●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●● ●●●●●●●●●●●●●

●● ●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●

●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●● ●●●●●●●●●●

●●●

● ●●●●●●●●●●●●●

●● ●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●

AAAAACAAGAATAC

AAC

CAC

GAC

TAG

AAG

CAG

GAG

TATAATCATGATTC

AAC

ACC

AGC

ATC

CA

CC

CC

CG

CC

TC

GA

CG

CC

GG

CG

TC

TAC

TCC

TGC

TTG

AAG

ACG

AGG

ATG

CA

GC

CG

CG

GC

TG

GA

GG

CG

GG

GG

TG

TAG

TCG

TGG

TTTAATACTAGTATTC

ATC

CTC

GTC

TTG

ATG

CTG

GTG

TTTATTCTTGTTT

● 20FUK.2● 20FUK.3● 20FUK.4● 20FUK.5● 20FUK.6● 20FUK.7● 20FUK.8● 20GAV.1● 20GAV.2● 20GAV.3● 20GAV.4● 20GAV.5● 20GAV.6

there is significant difference in the empirical probability of starting an insertion or deletion due to context

other improvements

• “auto-recalibration” mode for organisms without known callsets

• improved covariate models

• simpler command line pipeline with a single tool instead of three.

52Wednesday, February 15, 12

Page 53: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Reported Quality Score

Empi

rical

Qua

lity

Scor

e

10

20

30

40

50

Base Substitution

10 20 30 40 50

Base Insertion

10 20 30 40 50

Base Deletion

10 20 30 40 50

Recalibration● Recalibrated● newRecalibrator

log(nBases)8

101214161820

Cycle Covariate

Qua

lity

Scor

e Ac

cura

cy

−6

−4

−2

0

2

4

Base Substitution

−100

−50

0 50 100

Base Insertion

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

−100

−50

0 50 100

Base Deletion

●●●●●●●●●●

●●

●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

−100

−50

0 50 100

Recalibration● Recalibrated● newRecalibrator

log(nBases)15.515.615.715.8

Context Covariate

Qua

lity

Scor

e Ac

cura

cy

−8

−6

−4

−2

0

2

Base Substitution

●● ●●

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA

CAA

CAC

CAG

CAT

CC

CC

AC

CC

CC

GC

CT

CG

CG

AC

GC

CG

GC

GT

CT

CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GC

AG

CC

GC

GG

CT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT

Base Insertion

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA

CAA

CAC

CAG

CAT

CC

CC

AC

CC

CC

GC

CT

CG

CG

AC

GC

CG

GC

GT

CT

CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GC

AG

CC

GC

GG

CT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT

Base Deletion

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA

CAA

CAC

CAG

CAT

CC

CC

AC

CC

CC

GC

CT

CG

CG

AC

GC

CG

GC

GT

CT

CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GC

AG

CC

GC

GG

CT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT

Recalibration● Recalibrated● newRecalibrator

log(nBases)15161718

53Wednesday, February 15, 12

Page 54: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

QualityScore Covariate

Num

ber o

f Obs

erva

tions

0

200,000,000

400,000,000

600,000,000

800,000,000

1,000,000,000

1,200,000,000

1,400,000,000Base Substitution

10 20 30 40 50

Base Insertion

10 20 30 40 50

Base Deletion

10 20 30 40 50

RecalibrationRecalibratednewRecalibrator

Cycle Covariate

Mea

n Q

uality

Sco

re

25

30

35

40

45

50

Base Substitution

−100

−50

0 50 100

Base Insertion

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

−100

−50

0 50 100

Base Deletion

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

−100

−50

0 50 100

Recalibration● Recalibrated● newRecalibrator

log(nBases)15.515.615.715.8

Context Covariate

Mea

n Q

uality

Sco

re

30

35

40

45

Base Substitution

●● ●●

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA CAA

CAC

CAG

CAT

CC CCA

CCC

CCG

CCT

CG CGA

CGC

CGG

CGT

CT CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GCA

GCC

GCG

GCT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT

Base Insertion

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA CAA

CAC

CAG

CAT

CC CCA

CCC

CCG

CCT

CG CGA

CGC

CGG

CGT

CT CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GCA

GCC

GCG

GCT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT

Base Deletion

AA AAA

AAC

AAG

AAT

AC ACA

ACC

ACG

ACT

AG AGA

AGC

AGG

AGT

AT ATA

ATC

ATG

ATT

CA CAA

CAC

CAG

CAT

CC CCA

CCC

CCG

CCT

CG CGA

CGC

CGG

CGT

CT CTA

CTC

CTG

CTT

GA

GAA

GAC

GAG

GAT

GC

GCA

GCC

GCG

GCT

GG

GG

AG

GC

GG

GG

GT

GT

GTA

GTC

GTG

GTT

TA TAA

TAC

TAG

TAT

TC TCA

TCC

TCG

TCT

TG TGA

TGC

TGG

TGT

TT TTA

TTC

TTG

TTT

Recalibration● Recalibrated● newRecalibrator

log(nBases)15161718

54Wednesday, February 15, 12

Page 55: Best practices for Variant Calling with Pacific Biosciences datamauriciocarneiro.github.io/talks/20120215-agbt.pdf · 2015. 4. 3. · Mauricio Carneiro, Ph.D. Mark DePristo, Ph.D.

Thank you!Stay up to date with the GSA team through our wiki

• the latest releases of our tools and version changelogs

• tutorials on our best practices for data processing and analysis

• further information on how to use the GATK engine for your own research or to collaborate with us

http://www.broadinstitute.org/gsa/wiki/index.php

55Wednesday, February 15, 12