A Comparison of Genomic Structural Variant Detection using LUMPY and DELLY Lance Tan 1 , Ronak H. Shah 2 , Michael F. Berger 2 1 Newark Academy, Livingston, NJ 2 Department of Pathology, Memorial Sloan Ke@ering Cancer Center, New York, NY INTRODUCTION METHODS Background Structural variants (SVs), which are deviaGons from normal chromosomal structure affecGng regions approximately 1 kilobase or longer in size, represent one of the largest and most diverse categories of mutaGons to the human genome. As cancer is a disease caused by the accumulaGon of somaGc mutaGons in an individual's genome, structural variants are clearly implicated as a cause of cancer. Recent developments in highthroughput, nextgeneraGon sequencing technology have allowed researchers to sequence large, targeted regions of tumor DNA to locate and treat specific mutaGons; the MSKIMPACT assay (Memorial Sloan Ke@ering Integrated MutaGon Profiling of AcGonable Cancer Targets) and its associated computaGonal pipeline is an example. Despite these recent advances, accurately and efficiently determining the presence and locaGon of SVs from sequencing data remains a cumbersome task due to a number of hurdles: a wide range of SV sizes (less than one kilobase to tens of megabases), mulGple different structural variant types and complexity levels, and different types of SV evidence including pairedend reads (PE), split reads (SR), and read depth (RD). Here, the LUMPY structural variant discovery so[ware is compared with DELLY, its contemporary program in the MSKIMPACT computaGonal pipeline, in order to determine whether integraGng LUMPY into the IMPACT pipeline will be of benefit. Method LUMPY was used to call structural variants on 122 tumornormal sample pairs from 8 sequencing runs for which DELLY had already called SV mutaGons, and the results were compared. SPEEDSEQ, a framework that simplifies and bundles together mulGple tools, including LUMPY and BWAMEM (a sequence aligner), was used to align raw sequencing reads and call structural variants with LUMPY. Python and shell scripts were wri@en to process reads and interface with the components of SPEEDSEQ. All computer processing was done on a computer cluster at MSKCC through the LSF queuing system. Align and process (SPEEDSEQ ALIGN v0.0.3a) • Map reads to human genome (BWAMEM v0.7.8.r455) • Mark duplicates, extract discordant/split reads (SAMBLASTER v0.1.21) • SorGng and indexing (Sambamba v0.4.7) Call SVs (SPEEDSEQ SV) • LUMPY (v0.2.9) run on pairs of tumor/normal samples Filter and annotate • Filter by support, hotspotness, variant size (custom Python script) • Annotate breakpoints (iAnnotateSV v0.0.2) Manually review and compare with DELLY Figure 2: DELLY (Rausch et. al., European Molecular Biology Laboratory, Heidelberg, Germany) is the SV caller currently used in the IMPACT pipeline. Its sequenGal strategy calculates SVcontaining ranges from pairedend reads first and then localizes these ranges using split reads. DELLY contains a modified version of the Gotoh algorithm aligner for split reads idenGficaGon, unlike LUMPY, which must rely on generalized tools. Figure from: Rausch et. al. Bioinforma<cs 28, no. 18 (September 15, 2012): i333–39. Figure 3: LUMPY (Layer et. al., University of Virginia, Charlo@esville, VA) uses a modular framework for detecGng structural variants. It and accounts for mulGple types of evidence in parallel by calculaGng separate breakpoint ranges from each evidence category and then adding these ranges together. In this study, pairedend reads and split reads were used while opGonal copy number variaGon and previously known variants were omi@ed. Figure from: Layer et. al. Genome Biology 15, no. 6 (June 26, 2014): R84. 114 37 17 10 35 20 4 20 14 78 34 17 10 18 17 4 19 7 0 20 40 60 80 100 120 DEL (ROS1EZR) DUP (RETNCOA4) TRA (EWSR1FLI1) TRA (ROS1CD74) INV (RETCCDC6) INV (DIS3DAOA) TRA (FLI1EWSR1) TRA (WT1EWSR1) #1 TRA (WT1EWSR1) #2 PE support (reads) Pairedend support of known structural variants from LUMPY and DELLY LUMPY Tumor PE support DELLY Tumor PE Support 115 40 15 0 37 19 2 26 13 77 47 26 12 43 24 15 29 12 0 20 40 60 80 100 120 DEL (ROS1EZR) DUP (RETNCOA4) TRA (EWSR1FLI1) TRA (ROS1CD74) INV (RETCCDC6) INV (DIS3DAOA) TRA (FLI1EWSR1) TRA (WT1EWSR1) #1 TRA (WT1EWSR1) #2 SR support (reads) Splitread support of known structural variants from LUMPY and DELLY LUMPY Tumor SR support DELLY Tumor SR Support Figure 5: For the nine true structural variant calls made by DELLY, LUMPY's detecGon algorithm tended to detect more pairedend reads than DELLY. This is an advantage to using LUMPY over DELLY since pairedend reads increase confidence that a variant is real. Figure 6: DELLY generally finds more splitread support for the same nine mutaGons. Since split reads uniquely of all evidence types allow localizaGon of a breakpoint to the exact base, this presents a significant advantage over LUMPY. This difference may be that DELLY uses a modified version of the Gotoh algorithm to align sequences with kmers to find split reads, whereas LUMPY must rely on split read support found by BWAMEM. LUMPY calls gain pairedend support but lose splitread support compared to exisLng DELLY calls. Figure 1: Categories of geneGc structural variaGon. LUMPY and DELLY both group variants broadly into deleGons, inserGons, duplicaGons and translocaGons. In addiGon to these categories, mulGple structural changes can occur in overlapping regions, creaGng complex and hardtocategorize mutaGons. Figure from: Alkan et. al. Nature Reviews Gene<cs 12, no. 5 (May 2011): 363–76. CONCLUSION •While LUMPY detects more pairedend read support than DELLY for the same mutaGons, it also detects less split read support. •LUMPY exhibits a significant bias towards calling deleGons and inversions, the majority of which are false posiGves that distract manual reviewers from significant SVs. •LUMPY detects the majority of variants that DELLY does. DELLY in the IMPACT pipeline's implementaGon detected more variants that LUMPY in this study's implementaGon did not than vice versa. Overall, replacing DELLY with LUMPY as the structural variant detector in the MSKIMPACT pipeline would not produce much benefit. Whether a combined approach including LUMPY as a supplement to DELLY is more effecGve remains to be seen. ACKNOWLEDGEMENTS I would like to thank Ronak Shah and Dr. Michael Berger for all of their support and instrucGon in developing this project and seeing it to compleGon. Figure 4: (A) A ROS1EZR deleGon, (B) a ROS1CD74 translocaGon, (C) a RETNCOA4 duplicaGon, and (D) a RETCCDC6 inversion called by LUMPY. All four structural variants are also true posiGves called by DELLY. The different orientaGons of pairedend reads allow variant categorizaGon into one of these four classes. Like other structural variant callers, LUMPY is imperfect and may not detect all the evidence that supports a mutaGon, such as in (B). 0 200 400 600 800 1000 10 20 30 40 50 60 70 80 90 100 Number of SVs Breakpoint difference threshold (bases) Common and unique SVs with varying breakpoint difference threshold SVs found by both SVs found by LUMPY only SVs found by DELLY only 685 1025 95 DELLY calls (1120 total) LUMPY calls (780 total) LUMPY trends towards calling deleLons and inversions over duplicaLons and translocaLons. DEL, 208711, 90.62% INV, 16987, 7.38% DUP,656, 0.28% TRA, 3951, 1.72% Unfiltered LUMPY calls DEL, 2689, 52.63% INV, 2366, 46.31% DUP, 17, 0.33% TRA, 37, 0.72% Filtered LUMPY calls DEL, 103, 48.82% INV, 70, 33.18% DUP, 9, 4.27% TRA, 29, 13.74% Filtered LUMPY calls w/o outliers Figure 8: The breakdown of calls made by LUMPY for each structural variant type before and a[er filtering. (A) DistribuGon of calls before filtering. (B) DistribuGon a[er filtering by support. A more lenient filter was applied to mutaGon calls in hotspot regions than to nonhotspot regions. (C) The distribuGon a[er filtering and removing outlier samples (all samples with more than 20 postfilter variant calls). These sixteen samples account for a disproporGonate 96.2% of postfilter deleGons and 97.0% of postfilter inversions. A B C A C B D Figure 7: (A) Comparison of the common and unique calls made by LUMPY and DELLY at various breakpoint difference thresholds. Since LUMPY and DELLY may not necessarily resolve the breakpoints of an SV to the same base, we defined a threshold difference between two equivalent LUMPY and DELLY calls: two calls with an absolute difference lower than this threshold were considered equivalent. The ideal threshold would be the minimum that produces a maximum number of equivalent SVs. Since this ideal could not be determined, a threshold of 10 was chosen based on the absolute difference of breakpoints in true posiGve calls. (B) The number of common and unique calls at a threshold of 10. Both figures demonstrate that with the currently used pipeline serngs DELLY calls a significantly larger number of SVs than LUMPY. A B RESULTS Examples of true structural variants found by LUMPY. SR (115) PE (114) All reads ROS1 EZR SR (0) PE (12) All reads CD74 ROS1 SR (40) PE (37) All reads RET NCOA4 SR (47) PE (35) All reads RET CCDC6 The current DELLYbased computaLonal pipeline idenLfied more SVs than LUMPY.