Top Banner
Marking duplicates Removing non-independent observa7ons talks
19

Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Mar 09, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Markingduplicates

Removingnon-independentobserva7ons

talks

Page 2: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Analysis-Ready Variants

111Raw Reads

Raw Variants IndelsSNPs

Analysis-ReadyReads

Indel Realignment

Base Recalibration

SNPs & Indels

Variants

IndelsSNPs

VariantAnnotation

Variant Evaluation

look good?

use in projecttroubleshoot

111Analysis-ReadyReads

Genotype Likelihoods

Joint Genotyping

Analysis-Ready

No

n-G

AT

K

Mark Duplicates& Sort (Picard)

Var. Calling HC in ERC mode

separately per variant type

Variant Recalibration

Map to Reference

BWA mem GenotypeRefinement

Data Pre-processing Variant Discovery>> >> Callset Refinement

YouarehereintheGATKBestPrac7cesworkflowforgermlinevariantdiscovery

Page 3: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Whymarkduplicates?

Reference

Mappedreads

=sequencingerrorpropagatedinduplicates

•  Duplicatesaresetsofreadspairsthathavethesameunclippedalignmentstartandunclippedalignmentend

•  They’resuspectedtobenon-independentmeasurementsofasequence•  SampledfromtheexactsametemplateofDNA•  Violatesassump7onsofvariantcalling

•  What’smore,errorsinsample/libraryprepwillgetpropagatedtoalltheduplicates•  Justpickthe“best”copy–mi7gatestheeffectsoferrors

Markduplicates

Page 4: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Howdoduplica7oneventsarise?

PCRduplicates

Op:calduplicatesReadnameshavethefollowingform:@identifier:lane:tile:x:y

hWp://www.slideshare.net/jandot/next-genera7on-sequencing-course-part-2-sequence-mappinghWp://www.slideshare.net/cosen7a/illumina-gaiix-for-high-throughput-sequencing

Page 5: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Op7calandPCRduplica7oneventsariseatdifferentratesasasequencingexperimentproceeds

PCRduplicates

Op:calduplicates

Page 6: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Howdoweiden7fyduplicatereads?

•  DupesmightcomefromthesameinputDNAtemplate,sowewillassumethatreadswillhavesamestartposi7ononreference

–  “Wherewasthefirstbasethatwassequenced?”

–  Forpaired-end(PE)reads,samestartforbothends

•  Iden7fyduplicatesets,thenchooserepresenta7vereadbasedonbasequalityscoresandothercriteria

Page 7: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Butthere’sacatch(ortwo)…

•  BWAsome7mes“clips”basesfromtheendsofthealignment(whenthealignmentthereispoor)

•  NeedtouseSAMflags+CIGARstringtodeterminethe

unclipped5’end

•  Fragmentsmappedtothereversestrandarespecifiedbytheir3’posi7on,insteadof5’

Page 8: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Iden7fyduplicatesusingorienta7on+“unclipped”5’posi7on

Pos 1 2 3 4 5 6 7 8 9Ref T A G C C G A T C r1 T A G C C G A r2 T A G C C G A r3 T A – C CAG A r4 T A G C C H H r5 T A G C C G A T C r6 S S G C C G A r7 G C C G A

BluemapstoforwardstrandRedmapstoreversestrandGreybasesareclippedUnderlinedistheexpected5’startoftheread,giventhemappingWhataretheduplicatesets?

Page 9: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Pos 1 2 3 4 5 6 7 8 9Ref T A G C C G A T C r1 T A G C C G A r2 T A G C C G A r3 T A – C CAG A r4 T A G C C H H r5 T A G C C G A T C r6 S S G C C G A r7 G C C G A

BluemapstoforwardstrandOrangemapstoreversestrandGreybasesareclippedUnderlinedistheexpected5’startoftheread,giventhemappingSo…whataretheduplicatesets?☞  r1,r3,r5,r6(startatposi7on1)

Iden7fyduplicatesusingorienta7on+“unclipped”5’posi7on

Page 10: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Pos 1 2 3 4 5 6 7 8 9Ref T A G C C G A T C r1 T A G C C G A r2 T A G C C G A r3 T A – C CAG A r4 T A G C C H H r5 T A G C C G A T C r6 S S G C C G A r7 G C C G A

BluemapstoforwardstrandOrangemapstoreversestrandGreybasesareclippedUnderlinedistheexpected5’startoftheread,giventhemappingSo…whataretheduplicatesets?☞  r1,r3,r5,r6(startatposi7on1)☞  r2,r4(startatposi7on7)

Iden7fyduplicatesusingorienta7on+“unclipped”5’posi7on

Page 11: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Pos 1 2 3 4 5 6 7 8 9Ref T A G C C G A T C r1 T A G C C G A r2 T A G C C G A r3 T A – C CAG A r4 T A G C C H H r5 T A G C C G A T C r6 S S G C C G A r7 G C C G A

BluemapstoforwardstrandOrangemapstoreversestrandGreybasesareclippedUnderlinedistheexpected5’startoftheread,giventhemappingSo…whataretheduplicatesets?☞  r1,r3,r5,r6(startatposi7on1)☞  r2,r4(startatposi7on7)☞  r7(startsatposi7on3)

Iden7fyduplicatesusingorienta7on+“unclipped”5’posi7on

Page 12: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Sonowwehavemapped,sorted,anddedupedreads

Showingduplicatereads Hidingduplicatereads

Page 13: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Whatthismeansfordownstreamanalysis

•  DuplicatestatusisindicatedinSAMflag

•  Duplicatesarenotremoved,justtagged(unlessyourequestremoval)•  Downstreamtoolscanreadthetagandchoosetoignorethosereads

•  MostGATKtoolsignoreduplicatesbydefault

Page 14: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

UsecaseswhereyoumayNOTwanttomarkduplicates

•  Ampliconsequencing->allreadsstartatsameposi7onbydesign

•  RNAseqallele-specificexpressionanalysis(ASEReadCountercandisableDuplicateFilter)

Page 15: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Add-on:Predic7ngthecomplexityofasequencingexperiment

Complexityanalysisdependson: •  Es:matedlibrarysize•  ReturnonInvestment(ROI)

calcula:ons

Page 16: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Es7ma7onoflibrarysizeandduplica7oninPicard

Es7matedfrac7onofduplicates

Assump7ons●  allreadsaredrawnfromthesamePoissondistribu7onPo(λ)

●  theoccurrenceofduplica7oneventsdependsonunderlyingconcentra7onofinsertsinthelibrary

Page 17: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Ac7veresearchtoimprovelibrarysizees7ma7on

•  Rateofduplica7onvarieswithinsertsizelength•  Duplica7onsratesalsolikelyvarywithGCcontent

Page 18: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Analysis-Ready Variants

111Raw Reads

Raw Variants IndelsSNPs

Analysis-ReadyReads

Indel Realignment

Base Recalibration

SNPs & Indels

Variants

IndelsSNPs

VariantAnnotation

Variant Evaluation

look good?

use in projecttroubleshoot

111Analysis-ReadyReads

Genotype Likelihoods

Joint Genotyping

Analysis-Ready

No

n-G

AT

K

Mark Duplicates& Sort (Picard)

Var. Calling HC in ERC mode

separately per variant type

Variant Recalibration

Map to Reference

BWA mem GenotypeRefinement

Data Pre-processing Variant Discovery>> >> Callset Refinement

YouarehereintheGATKBestPrac7cesworkflowforgermlinevariantdiscovery

Page 19: Marking duplicates - University of California, Los Angeles · 2017-02-24 · on-GATK Mark Duplicates & Sort (Picard) Var. Calling HC in ERC mode separately per variant type Variant

Furtherreading

hWp://www.broadins7tute.org/gatk/guide/best-prac7ces

hWp://broadins7tute.github.io/picard/

talks