For Research Use Only. Not for use in diagnostic procedures. © Copyright 2019 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. Femto Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners. Comprehensive structural and copy-number variant detection with long reads Aaron M. Wenger, Armin Töpfer, Yuan Li, Luke Hickey Pacific Biosciences, 1305 O'Brien Drive, Menlo Park, CA 94025 Variant Detection with HiFi Reads Existing long read variant calling methods rely on de novo assembly or spanning reads to detect variants. These methods are effective for SVs but miss many CNVs that involve long segmental duplications. Copy-number variant (CNV) calling with pbsv CNVs in HG001 and COLO829T PacBio highly accurate, long reads (HiFi reads) comprehensively detect variants in the human genome, including in difficult repetitive regions. References 1. Wenger AM, Peluso P et al. (2019). Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome . Nat Biotechnol. doi:10.1038/s41587-019-0217-9. 2. Zook JM et al. (2019). An open resource for accurately benchmarking small variant and reference calls . Nat Biotechnol. 37(5):561-566. 3. Zook JM et al. (2019). A robust benchmark for germline structural variant detection . bioRxiv. doi:10.1101/664623. [Preprint] 4. Chaisson MJP et al. (2019). Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun. 10(1):1784. 5. https://github.com/PacificBiosciences/pbsv Thank you to David Scherer, Kristin Robertshaw, and Pamela Bentley Mills for poster production support. Ref GCAGGCAGCGACTACGTACGCTAACAGCGATCTCAG Alt GCAGGCAGCGACTACGTCCTCTAACAGCGATCTCAG SNVs Indels SVs 99.97% Precision 99.95% Recall 99.10% Precision 98.86% Recall Precision and recall, as measured against the Genome in a Bottle (GIAB) benchmarks 2,3 , is high for single nucleotide variants (SNVs), indels, and structural variants (SVs). Figure 1. Accurate PacBio HiFi reads detect variants in difficult-to-map exons of the disease gene STRC 1 . Ref GCAGGCAGCGACTACGTACGCTAACAGCGATCTCAG Alt GCAGGCAGCGACTACGT-CGCTAACAGCGATCTCAG Figure 2. PacBio SNV calling performance for HG002 with 32-fold HiFi coverage (Rowell, poster 1866/W). Figure 3. PacBio indel calling performance for HG002 with 32-fold HiFi coverage. "We determined that 57% and 15% of the copy number variable bases within segmental duplications detected by dCGH and Genome STRiP, respectively, were not in contigs resolved by de novo assembly [of long reads]." – Chaisson et al. 2019 (ref. 4) We extended the PacBio SV caller, pbsv 5 , to detect CNVs using a combination of read clipping and depth. Determine genome-wide coverage median coverage of non-gap positions at mapping quality 60 Identify candidate CNV breakpoints positions with multiple clipped reads Evaluate coverage between adjacent candidate breakpoints calculate z-score vs Poisson expectation pbsv CNV algorithm Figure 5. PacBio read coverage is Poisson distributed in autosomes for (a) samples from GIAB and (b) the normal sample from a tumor/normal pair (provided by WP Kloosterman). A tumor sample shows CNV regions of reduced and increased coverage. (c) To call CNVs, pbsv identifies candidate CNV breakpoints with multiple clipped reads and then evaluates read depth between adjacent breakpoints compared to the genome-wide typical coverage. a b c candidate breakpoint candidate breakpoint HG001 PacBio reads Segdups Repeats GRCh37 chr6:103,727,761-103,773,116 (45 kb) 25 kb 96.26% Precision 94.93% Recall Ref Alt Figure 4. PacBio SV calling performance for HG002 with 32-fold HiFi coverage. HG001 PacBio reads Segdups Repeats GRCh37 chr19:50,579,869-50,659,719 (80 kb) HG001 pbsv CNV HG001 PacBio reads Segdups Repeats HG001 pbsv CNV GRCh37 chr6:220,626-410,470 (189 kb) Genes Genes Copy number Copy number Total CNV length (Mb) Coverage Coverage 100 bp windows 100 bp windows HG001 COLO829T (tumor) segdup segdup unique unique Figure 6. pbsv CNV calls in HG001 in (a) segmentally duplicated (segdup) sequence and (b) unique sequence. a b Figure 7. Genomic distribution of pbsv CNV calls. (a, b) Most CNVs in the "healthy" genome HG001 involve segdups. (c, d) The tumor genome COLO829T has many more CNVs than HG001 and most fall outside of segdups. Summary • PacBio HiFi reads comprehensively detect variants in a human genome. • The pbsv variant caller identifies CNVs in HiFi reads using read clipping and depth signatures. b a d c GIAB benchmark DeepVariant PacBio callset 3,047,837 1,571 1,030 GIAB benchmark DeepVariant PacBio callset 459,481 5,283 4,329 GIAB benchmark DeepVariant PacBio callset 9,196 489 357 54 0 6 12 18 24 30 36 42 48 60 90 0 10 20 30 40 50 60 70 80 100 0% 1% 2% 3% 4% 5% 6% 0% 1% 2% 3% 4% 5% 9 0 1 2 3 4 5 6 7 8 ≥10 9 0 1 2 3 4 5 6 7 8 ≥10 0 1 2 3 4 5 0 1 2 3 4 5 0 50 100 150 200 0 50 100 150 200 HG001 HG002 HG005 COLO829N COLO829T