● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 10 100 10 20 30 40 50 # Mismatches per 100kbp ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.3 1.0 3.0 10.0 10 20 30 40 50 # Indels per 100kbp ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 2.5 5.0 7.5 10.0 10 20 30 40 50 Time (min) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 10 20 30 40 50 Peak Memory (GB) Tool ● ● ● ● ● ● ● ● Baseline GATK Racon Pilon ntEdit k=20 ntEdit k=25 ntEdit k=30 ntEdit iterative k=35,30,25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 10 100 20 30 40 50 # Mismatches per 100kbp ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.3 1.0 3.0 10.0 20 30 40 50 # Indels per 100kbp ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 100 200 300 20 30 40 50 Time (min) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 10 20 30 40 50 20 30 40 50 Peak Memory (GB) Tool ● ● ● ● ● ● ● ● Baseline GATK Racon Pilon ntEdit k=25 ntEdit k=30 ntEdit k=35 ntEdit iterative k=40,35,30,25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 10 100 30 40 50 60 70 k Errors per 100kbp Error type Indels per 100kbp Mismatches per 100kbp ● ● ● ● ● ● ● 0 200 400 600 800 30 40 50 60 70 k Time (min) ● ● ● ● ● ● ● 20 25 30 35 40 30 40 50 60 70 k Peak memory (GB) Tool ● ● ● ● ● Baseline GATK Racon Pilon ntEdit Coverage Coverage 3 #Edits (M) BUSCO (%) Base 10xG linked reads N/A 5,670 (90.7) +ntEdit k50i1d1 0.2 5,677 (90.8) Base PacBio N/A 1,248 (31.6) +ntEdit k40i3d3 59.0 1,354 (34.3) 11 Simao, 2015 12 Koren, 2017 complete fragmented missing Polishing ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 44 40 36 32 28 24 20 16 12 8 6 0 5 10 15 0.0001 0.0010 0.0100 0.1000 Bloom filter false positive rate Errors per 100 kbp Error Type ● ● Indels Mismatches haploid or diploid DNA source Sequence reads Bloom filter c c c c c c T T T T T ✓ T 2. Check k kmer subset (S k ) for absence If S k - ≥ k/x : 4. Insert 3’-end positions Check k kmer subset presence If S k_alt + ≥ k/y : Apply change, resume 1 If S k_alt + ≥ k/y : Apply change to sequence, resume 3 3. Permutate 3’-end base Check k kmer* subset (S k_alt ) for presence ✓ ntHits kmers kmers 5. Delete 3’-end positions Check k kmer subset presence If S k_alt + ≥ k/y : Apply change, resume 1 ✗ ✗ Sequence Edited sequence ref copy 1 2 3 edited 4 Bloom dra5 edited ntHits ntHits ntEdit ntEdit kmers NGS reads Definitions kmer..................................word of length k S k …...................................subset of overlapping k kmers S k - …..................................subset of absent, overlapping k kmers S k_alt + …..............................subset of present, overlapping, k alternate kmers x….....................................leniency factor 1, test for absence y….....................................leniency factor 2, test for presence 4 5 6 8 9 12 4 5 6 8 9 12 4 5 6 8 9 12 4 5 6 8 9 12 x: 4 x: 5 x: 6 x: 8 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0.96 0.97 0.98 4 McKenna, 2010 5 Vaser, 2017 6 Jain, 2018 7 Pendleton, 2015 *Single Molecule Sequencing draft genomes **Time for pipeline ***15GB RAM ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3 5 10 20 30 40 50 Coverage ● ● ● ● ● c1 c2 c3 c4 auto Method https://github.com/bcgsc/nthits https://github.com/bcgsc/ntedit Human* Controlled Software Funding Cacao 8 , Beluga 9 , Axolotl 10 www.bcgsc.ca [email protected] Tuning Experimental ntEdit René Warren Jessica Zhang Lauren Coombe Hamid Mohamadi Inanç Birol Effect of Bloom filter FPR coverage threshold (-c) FPR ~ 0.0005 Threshold error kmers ntCard 2 Controlled C. elegans sequence data Base Illumina White spruce Interior spruce Subs. (M) 49.39 47.29 Indels (M) 1.11 1.02 ntHits 3h29m 4h23m ntEdit 25m 23m ntHits 207.8 206.9 ntEdit 90.2 86.1 Polish w\ 54X Illumina Time ** Edits (M) BUSCO (%) GATK 4 41h45 0.97 5,654 (91.3) ntEdit*** 2h18 0.95 5,670 (91.6) Racon 5 45h54 N/A 5,681 (91.7) GATK 42h21 2.66 5,285 (85.4) ntEdit*** 2h10 3.63 5,651 (91.3) Racon 40h55 N/A 5,670 (91.6) Nanopore 6 5,647 (91.2) PacBio 7 5,285 (85.4) 8 Morrissey, 2019 9 Jones, 2017 10 Nowoshilow, 2018 1 Mikheenko, 2018 http://birol-lab.ca 2 Mohamadi, 2017 http://renewarren.ca ntHits 2h, 40GB / ntEdit 5m, 12GB Experimental Results (FPR) haploid scalable genome sequence polishing E. coli C. elegans H. sapiens chr21 30x k25 30x k35 17x k50 Sensitivity *Values of y are indicated on the plot ntHits ntEdit Bloom filter <bit size 2.3 Gbp genome 20 Gbp genome 32 Gbp genome ntHits 3h, 210GB / ntEdit 20m, 95GB [laurasiatheria] [tetrapoda] < 8X Illumina 60X Illumina diploid RAM (GB) Time net 7 23 34 0 366 385 Baseline SMS* BUSCO% Δ7 Δ106 Check absence Check presence NGS reads (e.g. Illumina) SMS, 10xG, Illumina genome assembly gene sequence, etc. New feature (v1.2.0) -m option editing mode 0-2 [default=0] 0: best substitution, or first supported indel 1: best substitution, or best indel 2: best edit overall (exhaustive) Testing 4 5 6 8 9 12 20 30 40 50 4 5 6 8 9 12 20 30 40 50 4 5 6 8 9 12 20 30 40 50 4 5 6 8 9 12 20 30 40 50 x: 4 x: 5 x: 20 x: 30 0 0.05 0.10 0 0.05 0.10 0 0.05 0.10 0 0.05 0.10 0.85 0.87 0.89 0.91 0.93 4 5 6 8 9 12 16 20 24 28 32 35 4 5 6 8 9 12 16 20 24 28 32 35 4 5 6 8 9 12 16 20 24 28 32 35 4 5 6 8 9 12 16 20 24 28 32 35 x: 4 x: 5 x: 12 x: 20 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0.94 0.95 0.96 0.97 0.98 0.99 1 False discovery rate Copy reference (ref) Subs. 0.001 Indels 0.0001 Simulate Run Assess QUAST 1 PE100, 300bp frag. err0.1% FPR~0.0005 3 hash fn 96.0 95.4 93.1 56.1 1.3 1.6 2.5 7.7 2.7 3.1 4.4 36.2 0% 20% 40% 60% 80% 100% BUSCOs (% of 1,440 searched) ‘Haploidizing’ Spruce 13 13 Warren, 2015 Canu 12 +ntEdit +pilon +pilon +ntEdit 400 Mbp genome Base Nanopore k35 30 27 25 23 i5 d5 m1 ntHits 15m 4GB / ntEdit 5m <2GB Polish w\ 30X PE100 Illumina reads Assess completeness / accuracy w\ BUSCO 11 : single-copy gene orthologs [embryophyta] k35 30 27 25 23 i5 d5 m1 Acknowledgements Controlled Warren et al. 2019. Bioinformatics. DOI: 10.1093/bioinformatics/btz400 Reference E. coli 3 Walker, 2014 Bloom filter Bloom filter Summary Short Linked Long S Sealer Assembly Correction Scaffolding Gap-filling Polishing Illumina, SMS drafts (Nanopore/PacBio), etc. Read Technology https://github.com/ bcgsc Scalable solutions for genome assembly n=6,253 n=3,950 [euarchontoglires] n=6,192 17X Illumina (pseudohap.) FPR~0.0005 3 hash fn C. elegans H. sapiens chr21 *kmers with alternate 3’end base (k_alt) Genome cacao beluga human spruce axolotl 48 threads (250bp reads, k40) 250 125 0 k50i3d3 k40i3d3 0 200 400 0 1 2 3 4 0 0.5 1 1.5 Memory (GB) Time (hours) Reads (billion) 375 x + 0 0.5 1 1.5 Bases (billions) ntEdit k50i1d1 rate~0.0023 rate~0.0001 ntHits 1. Check kmer