Top Banner
Using BioNano Maps to Improve an Insect Genome Assembly Sue Brown Oct 23, 2014 1
62

Using BioNano Maps to Improve an Insect Genome Assembly

Jul 02, 2015

Download

Documents

Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA Molecules. Video of Webinar available at BioNano Genomics website http://www.bionanogenomics.com/bionano-community/webinars/ as "Using BioNano Maps to Improve an Insect Genome Assembly​".
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Using BioNano Maps to Improve an Insect Genome Assembly

Using BioNano Maps to Improve an Insect Genome Assembly

!!!!!!!!!

Sue Brown Oct 23, 2014

1

Page 2: Using BioNano Maps to Improve an Insect Genome Assembly

Genetic model organism for developmental, physiology and toxicology studies !

• Easily cultured • Short generation time • Small genome size • Molecular and visible marker genetic maps • Genetic tools: balancers, deficiencies • Genomic libraries: lambda and BAC • cDNA libraries • Mutant analysis and RNAi • Transformation • 7x Sanger draft genome (Nature, 2008)

Tribolium castaneum, the red flour beetle

2

Page 3: Using BioNano Maps to Improve an Insect Genome Assembly

!!!!Genome size: 200 (Mb) Cot Analysis 9 autosomes, X and Y Low methylation Long period interspersion

!!!!

!!

Jeff Stuart, Purdue University

Tribolium Genome

3

Page 4: Using BioNano Maps to Improve an Insect Genome Assembly

Molecular map markers used to anchor scaffolds to Chromosome builds

Few X markers, no Y, variable marker density 4

Page 5: Using BioNano Maps to Improve an Insect Genome Assembly

Tcas 3.0 Reference Genome stats from NCBI

Input file N50 (Mb) Number Cumulative Length (Mb)

Genome contigs 0.04 8814 160.74Genome scaffolds 0.98 481 152.53

Unmapped scaffolds 352

Unmapped contigs 1884

5

Page 6: Using BioNano Maps to Improve an Insect Genome Assembly

Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on

Imaging Ultra-Long Single DNA Molecules !

Jennifer Shelton 2014

6

Page 7: Using BioNano Maps to Improve an Insect Genome Assembly

Data formats

BNX molecule 1BNX - text file of molecules

7

Page 8: Using BioNano Maps to Improve an Insect Genome Assembly

Data formats

BNX molecule 1BNX - text file of molecules

CMAP - text file of consensus maps

7

Page 9: Using BioNano Maps to Improve an Insect Genome Assembly

Data formats

in silico CMAP - from genome

FASTAin silico CMAP 1

BNX molecule 1BNX - text file of molecules

CMAP - text file of consensus maps

7

Page 10: Using BioNano Maps to Improve an Insect Genome Assembly

Data formats

in silico CMAP - from genome

FASTA

BNG CMAP - from assembled

molecules

in silico CMAP 1

BNG CMAP 1

BNX molecule 1BNX - text file of molecules

CMAP - text file of consensus maps

7

Page 11: Using BioNano Maps to Improve an Insect Genome Assembly

Data formats

in silico CMAP - from genome

FASTA

BNG CMAP - from assembled

molecules

XMAP - text file of alignment of two

CMAPs

in silico CMAP 2in silico CMAP 1

BNG CMAP 1 BNG CMAP 2

in silico CMAP 1

BNG CMAP 1

BNX molecule 1BNX - text file of molecules

CMAP - text file of consensus maps

7

Page 12: Using BioNano Maps to Improve an Insect Genome Assembly

3) use sequence reference to adjust molecule stretch for each scan

Assembly Pipeline

BNXBNXBNXscanBNX

Assembly workflow:!!1) The Irys produces tiff files that are converted into BNX text files.!2) Each chip produces one BNX file for each of two flowcells.!3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is

recalculated from the alignment.!4) Quality check graphs are created for each pre-adjusted flowcell BNX.!5) Adjusted flowcell BNXs are merged.!6) The first assemblies are run with a variety of p-value thresholds.!7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced

with a variety of minimum molecule length filters.

4) QC graphs for each flowcell

adjBNX

adjBNX

adjBNX

adjBNX

1) The Irys produces tiff files

3) Scan BNX are adjusted

7) Second assemblies

Strict minimum molecule

length

Relaxed minimum molecule

length

mergeBNX

5) Merge all flowcells

Relaxed p-value

threshold

6) First assemblies

Strict p-value

threshold

Default p-value

threshold

BNXBNXBNXscanBNX

BNXBNXBNXscanBNX

BNXBNXBNXscanBNX

2) Each chip produces flowcell

BNX files

BNX

BNX

BNX

BNX

8

Page 13: Using BioNano Maps to Improve an Insect Genome Assembly

In recent datasets when SNR is low and alignment is good we see a spike in bases per pixel (bpp) in the first scan, a plateau and a lower plateau

Assembly Pipeline

First scan in a flow cell

9

Page 14: Using BioNano Maps to Improve an Insect Genome Assembly

5) Use sequence reference to determine assembly noise parameters. Estimated genome size is used to set the p-value threshold.

Assembly Pipeline

BNXBNXBNXscanBNX

Assembly workflow:!!1) The Irys produces tiff files that are converted into BNX text files.!2) Each chip produces one BNX file for each of two flowcells.!3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is

recalculated from the alignment.!4) Quality check graphs are created for each pre-adjusted flowcell BNX.!5) Adjusted flowcell BNXs are merged.!6) The first assemblies are run with a variety of p-value thresholds.!7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced

with a variety of minimum molecule length filters.

4) QC graphs for each flowcell

adjBNX

adjBNX

adjBNX

adjBNX

1) The Irys produces tiff files

3) Scan BNX are adjusted

7) Second assemblies

Strict minimum molecule

length

Relaxed minimum molecule

length

mergeBNX

5) Merge all flowcells

Relaxed p-value

threshold

6) First assemblies

Strict p-value

threshold

Default p-value

threshold

BNXBNXBNXscanBNX

BNXBNXBNXscanBNX

BNXBNXBNXscanBNX

2) Each chip produces flowcell

BNX files

BNX

BNX

BNX

BNX

10

Page 15: Using BioNano Maps to Improve an Insect Genome Assembly

6/7) Variants of the starting p-value and default minimum molecule length are explored in nine assemblies.

Assembly Pipeline

BNXBNXBNXscanBNX

Assembly workflow:!!1) The Irys produces tiff files that are converted into BNX text files.!2) Each chip produces one BNX file for each of two flowcells.!3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is

recalculated from the alignment.!4) Quality check graphs are created for each pre-adjusted flowcell BNX.!5) Adjusted flowcell BNXs are merged.!6) The first assemblies are run with a variety of p-value thresholds.!7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced

with a variety of minimum molecule length filters.

4) QC graphs for each flowcell

adjBNX

adjBNX

adjBNX

adjBNX

1) The Irys produces tiff files

3) Scan BNX are adjusted

7) Second assemblies

Strict minimum molecule

length

Relaxed minimum molecule

length

mergeBNX

5) Merge all flowcells

Relaxed p-value

threshold

6) First assemblies

Strict p-value

threshold

Default p-value

threshold

BNXBNXBNXscanBNX

BNXBNXBNXscanBNX

BNXBNXBNXscanBNX

2) Each chip produces flowcell

BNX files

BNX

BNX

BNX

BNX

11

Page 16: Using BioNano Maps to Improve an Insect Genome Assembly

223 scaffolds from the sequence-based assembly were longer than 20 (kb) with more than 5 labels and were converted into in silico CMAPs

Current Tribolium sequence-based assembly

Input file N50 (Mb) Number of Scaffolds

Cumulative Length (Mb)

Genome FASTA 1.16 2240 160.74in silico CMAP from FASTA 1.20 223 152.53

12

Page 17: Using BioNano Maps to Improve an Insect Genome Assembly

BNG assembled molecules had a higher N50 and longer cumulative length than the sequence assembly

!The estimated size of the Tribolium genome is ~200 (Mb)

Assembly Results

Input file N50 (Mb) Number Cumulative Length (Mb)

Genome FASTA 1.16 2240 160.74in silico CMAP from FASTA 1.20 223 152.53

CMAP from assembled BNG molecules (BNG CMAP)

1.35 216 200.47

13

Page 18: Using BioNano Maps to Improve an Insect Genome Assembly

Breadth of alignment coverage for in silico CMAP: 2.1 (Mb) Total alignment length for in silico CMAP: 2.1 (Mb)

!Breadth of alignment coverage for BNG CMAP: 2.4 (Mb)

Total alignment length for BNG CMAP: 2.4 (Mb)

Simplest XMAP alignment description

1 (Mb)

1.1 (Mb) 1.3 (Mb)

in silico CMAP from genome

FASTA

CMAP from assembled molecules

in silico CMAP 2in silico CMAP 1

BNG CMAP 1 BNG CMAP 2

1.1 (Mb)

14

Page 19: Using BioNano Maps to Improve an Insect Genome Assembly

Breadth of alignment coverage for in silico CMAP: 1 (Mb) Total alignment length for in silico CMAP: 2 (Mb)

!Breadth of alignment coverage for BNG CMAP: 2.4 (Mb)

Total alignment length for BNG CMAP: 2.4 (Mb)

Complex XMAP alignment description

in silico CMAP 1

BNG CMAP 1 BNG CMAP 2

1 (Mb)

1.1 (Mb) 1.3 (Mb)

in silico CMAP from genome

FASTA

CMAP from assembled molecules

15

Page 20: Using BioNano Maps to Improve an Insect Genome Assembly

Breadth of alignment coverage compared to total aligned length can indicate relevant relationships between assemblies

!In this example differences between "breadth" and "total" length could be due to:

!Genomic duplications in sample molecules were extracted from

Assembly of alternate haplotypes Mis-assembly creating redundant contigs Collapsed repeat in sequence assembly

Alignment of CMAPs

in silico CMAP 1

BNG CMAP 1 BNG CMAP 2

1 (Mb)

1.1 (Mb) 1.3 (Mb)

in silico CMAP from genome

FASTA

CMAP from assembled molecules

16

Page 21: Using BioNano Maps to Improve an Insect Genome Assembly

Close to 4% of the alignment of the in silico CMAP appears to be redundant !

Overall 81% of the in silico CMAP aligns to the BNG consensus map

Alignment of BNG assembly to reference genome

CMAP name Breadth of alignment coverage for CMAP (Mb)

Length of total alignment for CMAP (Mb)

Percent of CMAP aligned

in silico CMAP from FASTA 124.04 132.40 81

CMAP from assembled BNG molecules (BNG CMAP)

131.64 132.34 67

17

Page 22: Using BioNano Maps to Improve an Insect Genome Assembly

Typically where redundant alignments occur two BNG consensus maps aligned suggesting they represent haplotypes although this has not been

verified

Alignment of BNG assembly to reference genome

min confidence 10

ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?

130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps

ChLG 9 super!scaffold

18

Page 23: Using BioNano Maps to Improve an Insect Genome Assembly

Potential haplotypes where overlapping BNG cmaps align

min confidence 10

ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?

128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps

ChLG 9 super!scaffold

19

Page 24: Using BioNano Maps to Improve an Insect Genome Assembly

Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

in silico CMAP aligned as reference

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

20

Page 25: Using BioNano Maps to Improve an Insect Genome Assembly

Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

in silico CMAP aligned as reference

alignment is inverted and

used as input for stitch

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

20

Page 26: Using BioNano Maps to Improve an Insect Genome Assembly

Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

in silico CMAP aligned as reference

alignment is inverted and

used as input for stitch

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4

alignments are filtered based on alignment length

relative total possible

alignment length and confidence

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

+ in silico CMAP 1

20

Page 27: Using BioNano Maps to Improve an Insect Genome Assembly

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

BNG CMAP 1

+ in silico CMAP 1

alignment passes because

the alignment length is greater than 30% of the

potential alignment length

21

Page 28: Using BioNano Maps to Improve an Insect Genome Assembly

+ in silico CMAP 2

BNG CMAP 1

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment passes because

the alignment length is greater than 30% of the

potential alignment length

22

Page 29: Using BioNano Maps to Improve an Insect Genome Assembly

- in silico CMAP 2

BNG CMAP 2

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment passes because

the alignment length is greater than 30% of the

potential alignment length

23

Page 30: Using BioNano Maps to Improve an Insect Genome Assembly

- in silico CMAP 2

BNG CMAP 2

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment fails because the

alignment length is less than 30% of the potential

alignment length

24

Page 31: Using BioNano Maps to Improve an Insect Genome Assembly

+ in silico CMAP 2

BNG CMAP 2

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment fails because the

alignment length is less than 30% of the potential

alignment length

25

Page 32: Using BioNano Maps to Improve an Insect Genome Assembly

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment passes because

the alignment length is greater than 30% of the

potential alignment length

- in silico CMAP 3

BNG CMAP 2

26

Page 33: Using BioNano Maps to Improve an Insect Genome Assembly

BNG CMAP 2

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment fails because the

alignment length is less than 30% of the potential

alignment length- in silico CMAP 3

27

Page 34: Using BioNano Maps to Improve an Insect Genome Assembly

BNG CMAP 2

Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

alignment passes because

the alignment length is greater than 30% of the

potential alignment length

+ in silico CMAP 4

28

Page 35: Using BioNano Maps to Improve an Insect Genome Assembly

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4

high quality scaffolding

alignments...+ in silico CMAP 1

29

Page 36: Using BioNano Maps to Improve an Insect Genome Assembly

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

are filtered for longest and

highest confidence

alignment for each in silico

CMAP

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4

high quality scaffolding

alignments...+ in silico CMAP 1

29

Page 37: Using BioNano Maps to Improve an Insect Genome Assembly

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

Passing alignments are used to super

scaffold

are filtered for longest and

highest confidence

alignment for each in silico

CMAP

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4

high quality scaffolding

alignments...+ in silico CMAP 1

29

Page 38: Using BioNano Maps to Improve an Insect Genome Assembly

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

Stitch is iterated and additional

super scaffolding

alignments are found

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

Page 39: Using BioNano Maps to Improve an Insect Genome Assembly

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

Stitch is iterated and additional

super scaffolding

alignments are found

- in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

Until all super scaffolds are

joined - in silico CMAP 3+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4+ in silico CMAP 1

Page 40: Using BioNano Maps to Improve an Insect Genome Assembly

Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds

- in silico CMAP 3

+ in silico CMAP 2

BNG CMAP 1 BNG CMAP 2

+ in silico CMAP 4

+ in silico CMAP 1

If gap length is estimated to be negative gaps are represented by 100 (bp) spacers

31

Page 41: Using BioNano Maps to Improve an Insect Genome Assembly

Gap lengths

Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp))

!Of the manually edited Tribolium super-scaffolds there were 66 gaps had

known lengths and 24 had negative lengths (set to 100 (bp))

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths

32

Page 42: Using BioNano Maps to Improve an Insect Genome Assembly

Gap lengths

Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp))

!Of the manually edited Tribolium super-scaffolds there were 66 gaps had

known lengths and 24 had negative lengths (set to 100 (bp))

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths

32

Page 43: Using BioNano Maps to Improve an Insect Genome Assembly

Negative gap lengths

The longest negative gap length is from a BNG consenus map joining in silico 23 and 136

Is part of scaffold_23 connected to 136?!I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. !!In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly?

22 23 129 136 137

33

Page 44: Using BioNano Maps to Improve an Insect Genome Assembly

Negative gap lengths

!Because the same region of 136 aligns to another BNG consensus map that

aligns to its chromosome linkage group this alignment was rejected and stitch was re-run

Is part of scaffold_23 connected to 136?!I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. !!In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly?

22 23 129 136 137

34

Page 45: Using BioNano Maps to Improve an Insect Genome Assembly

Negative gap lengths

Two new super scaffolds were created and the sequence similarity is being evaluated

min confidence 10

ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?

128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps

ChLG 9 super!scaffold

min confidence 10

ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?

128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps

ChLG 9 super!scaffold

min confidence 10

U 18 14 16 19 20 21 22 23 24 25 26 27 28 30

BNG consensus maps

ChLG 2!scaffolds

BNG consensus maps

ChLG 2 super!scaffold

min confidence 10

U 18 14 16 19 20 21 22 23 24 25 26 27 28 30

BNG consensus maps

ChLG 2!scaffolds

BNG consensus maps

ChLG 2 super!scaffold

35

Page 46: Using BioNano Maps to Improve an Insect Genome Assembly

Gap lengths

This negative alignment also indicated a potential assembly issue

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths

36

Page 47: Using BioNano Maps to Improve an Insect Genome Assembly

Negative gap lengths

This negative gap length is from a BNG consenus map joining in silico 81 and 102 and 103

Half of scaffold_81 aligns with ChLG7

37

Page 48: Using BioNano Maps to Improve an Insect Genome Assembly

Negative gap lengths

Because the other half of 81 aligns to another BNG consensus map that aligns to its chromosome linkage group this alignment was rejected and stitch was re-

run !

The BNG maps suggest a mis-assembly of in silico 81 at a sequence level

Half of scaffold_81 aligns with ChLG7

79 80 81 82 83

38

Page 49: Using BioNano Maps to Improve an Insect Genome Assembly

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths

Gap lengths

All extremely small negative gap lengths, < -20,000 (bp) (shaded), were independently flagged as potential sequence mis-assemblies to be checked at

the sequence-level

39

Page 50: Using BioNano Maps to Improve an Insect Genome Assembly

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths

Gap lengths

All gaps from the shaded regions were also manually rejected and stitch.pl was rerun without them for the current super-scaffolded assembly

!We suspect extremely small negative gap sizes may be useful in locating

sequence mis-assemblies !

stitch.pl version 1.4.5 rejects alignments if negative gap lengths < -20,000 (bp) but lists them in data summary

40

Page 51: Using BioNano Maps to Improve an Insect Genome Assembly

N50 of the super-scaffolded genome was ~4 times greater than the original !

Super-scaffolds tend to agree with the Tribolium genetic map

Tribolium super-scaffolds

Input file N50 (Mb) Number of Scaffolds

Cumulative Length (Mb)

genome FASTA 1.16 2240 160.74

super-scaffold FASTA

4.46 2150 165.92

41

Page 52: Using BioNano Maps to Improve an Insect Genome Assembly

For Tribolium : first minimum percent aligned = 30%

first minimum confidence = 13 second minimum percent aligned = 90%

second minimum confidence = 8 !

Lower quality alignments were manually selected if genetic map also supported the order

Complex scaffolds were broken manually for sequence level evaluation

Tribolium super-scaffolds

Input file N50 (Mb) Number of Scaffolds

Cumulative Length (Mb)

genome FASTA 1.16 2240 160.74

super-scaffold FASTA

4.46 2150 165.92

42

Page 53: Using BioNano Maps to Improve an Insect Genome Assembly

ChLG X was reduced from 13 scaffolds to 2 with one scaffold being moved to ChLG 3

Tribolium super-scaffolds

min confidence 10

From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.

BNG consensus maps

ChLG X!scaffolds

BNG consensus maps

ChLG X super!scaffold

U 3 4 5 6 7 U 8 9 10 11 12 13

43

Page 54: Using BioNano Maps to Improve an Insect Genome Assembly

The second scaffold from ChLG X aligned to scaffolds from a portion of ChLG 3

Tribolium super-scaffolds

min confidence 10

51 U 43 45 44 46

47 U U 152 48 49 50 52 53 54 U 57 55

BNG consensus maps

ChLG 3!scaffolds

BNG consensus maps

ChLG 3 super!scaffold

32 33 34 35 36 2 37 38 39 40 41 42

BNG consensus maps

ChLG 3 super!scaffold

BNG consensus maps

ChLG 3!scaffolds

BNG consensus maps

ChLG 3 super!scaffold

44

Page 55: Using BioNano Maps to Improve an Insect Genome Assembly

Two unplaced scaffolds aligned to ChLG X

Tribolium super-scaffolds

min confidence 10

From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.

BNG consensus maps

ChLG X!scaffolds

BNG consensus maps

ChLG X super!scaffold

U 3 4 5 6 7 U 8 9 10 11 12 13

45

Page 56: Using BioNano Maps to Improve an Insect Genome Assembly

4% Redundancy in alignment may be from assembly of haplotypes (generally observed as two BNG consensus maps aligning to the same in silico map)

Tribolium super-scaffolds

min confidence 10

From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.

BNG consensus maps

ChLG X!scaffolds

BNG consensus maps

ChLG X super!scaffold

U 3 4 5 6 7 U 8 9 10 11 12 13

46

Page 57: Using BioNano Maps to Improve an Insect Genome Assembly

Potential haplotypes where overlapping BNG cmaps align

min confidence 10

From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold.

BNG consensus maps

ChLG X!scaffolds

BNG consensus maps

ChLG X super!scaffold

U 3 4 5 6 7 U 8 9 10 11 12 13

47

Page 58: Using BioNano Maps to Improve an Insect Genome Assembly

For ChLG 9 21 scaffolds were reduced to 9

Tribolium super-scaffoldsmin confidence 10

ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards?

128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145

BNG consensus maps

ChLG 9!scaffolds

BNG consensus maps

ChLG 9 super!scaffold

48

Page 59: Using BioNano Maps to Improve an Insect Genome Assembly

For ChLG 5 17 scaffolds were reduced to 4

Tribolium super-scaffoldsmin confidence 10

BNG consensus maps

ChLG 5!scaffolds

BNG consensus maps

ChLG 5 super!scaffold

69 68 70 71 72 73 74 U 75 76 77 78 79 80 81 82 83

49

Page 60: Using BioNano Maps to Improve an Insect Genome Assembly

Future directions: Structural Variant (SV)

Use SV-detect pipelines to resize existing gaps in scaffolds and identify mis-assemblies

50

Page 61: Using BioNano Maps to Improve an Insect Genome Assembly

K-INBRE Bioinformatics Core!Susan Brown - PI Nic Herndon - script development Nanyan Lu - manual evaluation Michelle Coleman - extractions and running the Irys! Zachary Sliefert - metric summaries !Bionano Genomics!Ernest Lam - assembly pipeline best practices assistance Weiping Wang - assistance with data formats Palak Sheth - collaboration to standardize analysis !Script availability!https://github.com/i5K-KINBRE-script-share/Irys-scaffolding BNG scripts available by request from BNG !Slide availability!http://www.slideshare.net/kstatebioinformatics/using-bionano-maps-to-improve-an-insect-genome-assembly !This project was supported by grants from the National Center for Research Resources (5P20RR016475) and the National Institute of General Medical Sciences (8P20GM103418) from the National Institutes of Health.

Acknowledgements

Improving the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA Molecules!Jennifer M. Shelton Kansas State University, Michelle Coleman Kansas State University, Nic Herndon Kansas State University, Warren Andrews BioNano Genomics, Weiping Wang BioNano Genomics, Ernest Lam BioNano Genomics, Susan J. Brown Kansas State University

We use the red flour beetle, a pest of stored grain, as a genetic model organism for developmental studies. As members of the i5k, we join scientists around the globe who are gearing up to sequencing 5000 insect genomes to improve human welfare and understand key ecosystem services that insects provide. By investigating insect genomes, we can take a fresh look at how insects transmit some of the most devastating diseases of humans, livestock, and plants on one hand, yet also serve as medical models for cancer, obesity, alcoholism, and neurological disease on the other. !Genome sequencing is becoming very affordable, but genome assembly is still challenging. Most are basically drafts of the genome, but even heavily curated reference assemblies contain misassemblies and truncations or gaps in repetitive regions. We are using a form of optical mapping to validate and extend the contigs and scaffolds that constitute a genome assembly. The 7x draft assembly of the red flour beetle, Tribolium castaneum genome is based on paired-end Sanger sequencing of 4-6 Kb insert plasmid libraries, scaffolded with paired-end reads from 40Kb fosmid and ~130Mb BAC clones1. The total assembled length of ~156 Mb represents 75% of the estimated genome (200Mb) and presumably lacks a significant portion of repetitive DNA. Superscaffolds or chromosome linkage group

builds (ChLG 2-10 and X) were constructed by mapping molecular markers from the genetic recombination map to the assembly scaffolds, anchoring greater than 90% of the assembled sequence.!To improve this draft assembly, we constructed physical maps of the T. castaneum genome. Using the Irys system designed by BioNano Genomics (http://www.bionanogenomics.com/). !Consensus maps of the imaged molecules were compared with in silico maps generated from the assembly sequence. Here we report our progress on using these comparisons to validate the assembly in regions were they agree and reanalyze the assembly in regions were they do not. Additional scaffolds have been anchored to the chromosomes, order and orientation of scaffolds have been corrected, and scaffolds have been extended by spanning repetitive regions.!1) Nature 2008 452:949-55.!2) Baylor College of Medicine Human Genome Sequencing Center 2012 https://www.hgsc.bcm.edu/software/!3) Genome Biology 2012 13:R56!4) BMC Bioinformatics 2013, 14(Suppl 7):S6!

T. castaneum 5.1!!!

Sequence scaffolds were aligned to BNG maps with IrysView. The alignments were filtered and used to create new scaffolds.!

!length (Mb): 164.60!scaffolds: 2157!scaffold N50 (Mb): 4.00

T. castaneum 5.0!!!!

Illumina long distance jump-libraries extended scaffolds using Atlas gap-link2 and Base Clear gapfiller3.!

!length (Mb): 160.745!scaffolds: 2240!scaffold N50 (Mb): 1.16

T. castaneum 3.0 !!!!

Baylor Sanger 7x draft assembly and molecular genetic map!!!

!length (Mb): 160.466!scaffolds: 2321!scaffold N50 (Mb): 0.98

Figure 2 Genome refinements using consensus physical mapsFigure 1 Generation of consensus physical maps!a) Ultra long molecules (Mb) were nicked on one strand and labeled with fluorescent nucleotides. !!b) Individual molecules were imaged on a massively parallel scale in nanochannels printed on silicon chips. !!c) Consensus maps were assembled from populations of overlapping molecules.

a.

b.

c.

Figure 3 Super scaffolding alignmentsT. castaneum assembly before (5.0) and after (5.1) super scaffolding. Assembly contigs (top in green) aligned with BNG consensus maps (bottom in blue) with BNG map molecule coverage in dark blue. Arrows indicate newly placed scaffolds.

Table 1 Super scaffolding of T. castaneum 5.0 with BNG consensus maps Assembly metrics forT. castaneum assembly 5.0 and for T. castaneum assembly 5.1 after BNG maps were used to super scaffold

Figure 4 Comparative Genomics Comparison T. castaneum 5.1 (top in green) with BNG consensus maps for T. freemani (bottom in blue with BNG map molecule coverage in dark blue)

51

Page 62: Using BioNano Maps to Improve an Insect Genome Assembly

Gap lengths

Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp))

!Of the manually edited Tribolium super-scaffolds there were 66 gaps had

known lengths and 24 had negative lengths (set to 100 (bp))

Distribution of gap lengths for automated output

Gap length (bp)

Cou

nt

−1500000 −1000000 −500000 0 500000 1000000

05

1015

20 Negative gap lengthsPositive gap lengths