Top Banner
Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence Alan Archibald The Roslin Institute and R(D)SVS University of Edinburgh
32

Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Jan 17, 2017

Download

Science

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Exploiting long read sequencing technology to build a substantially improved pig

reference genome sequenceAlan Archibald

The Roslin Institute and R(D)SVSUniversity of Edinburgh

Page 2: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Draft reference pig genome sequenceSwine Genome Sequencing Consortium

Page 3: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Hybrid Shotgun Sequencing Strategy

Whole- genome shotgun reads

Combine overlapping whole-genome and BAC-derived reads

Assemble clone sequences to represent chromosomes and annotate using Ensembl automated pipeline

BAC shotgun reads

Minimal set of overlapping BACs selected from physical map

Sequence

assembly

Page 4: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Sscrofa10.2 – chromosome assigned scaffolds only

  Length (bp)Chromosomes 1-18, X, Y  Contigs N50 80,720Contigs N90 13,487Average contig length 31,604Largest contig length 1,598,650Scaffold N50 637,332Scaffold N90 189,449Average scaffold length 436,176Largest scaffold length 3,862,550

Page 5: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

BAC Contigs / Fragments

(paired) end sequencesof subclone libraries

768 subclones / BACAv read: 707 bp

phrap

create fragment chains

Submission to EMBL/Genbank

A B C D E F G

GA C B E F DNNNN NNN NNN NNN NNN NNNN

fragment chain 1 fragment chain 2

A B C D E F G

Page 6: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Limitations of Sscrofa10.2• Missing coverage ~10%

– Poorly captured in unplaced scaffolds• Local scaffolding issues

– Order & orientation of sequence contigs within BACs not resolved unambiguously

– No BAC clone sequence assigned to > 1 scaffold• Unresolved redundancy from overlapping BAC clones• Project memory loss

– e.g. unplaced FPC contigs listed at end of q-arm

Page 7: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

http://geval.sanger.ac.uk/PGP_pig_10_2/Info/Index

Page 8: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence
Page 9: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Sscrofa10.2- QC

• Illumina PE reads from same pig mapped to Sscrofa10.2

• Looked for indicators of structural variation– including high/low coverage, incorrect orientation and abnormal insert sizes.

• Looked for homozygous variants

Page 10: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Sscrofa10.2-Chr 1

Page 11: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

De novo genome assemblies using Pacific Biosystems long read technology

TJTabaso (Duroc 2-14) MARC1423004

Duroc sow Duroc/Landrace/Yorkshire barrow

Page 12: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

PacBio – draft WGS assembly• Duroc 2-14 (same pig as most of Sscrofa10.2)• 65x genome coverage• Pacific Biosystems P6 chemistry• Length cut-off for reads for assembly 13 kbp• Coverage of corrected reads for assembly 19x

Page 13: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Contig QC

Page 14: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Variants

• Homozygous SNPs:– Sscrofa10.2: 415,056– Pacbio contigs: 34,545

• Homozygous indels:– Sscrofa10.2: 168,037– Pacbio contigs: 1,729,510

Page 15: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Scaffolding

• Scaffold by mapping contigs to Sscrofa10.2– using Nucmer– Assumme Sscrofa10.2 gross structure is correct

• Radiation Hybrid and Linkage maps, 60K SNPs• FPC physical map

• 2.36 Gb ungapped length

• 434 contigs

Page 16: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Chromosome 6

Page 17: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Chromosome 6

Page 18: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Gap Filling

• Gap filling was done using PBJelly

• Further gaps filled using large finished BACs from Sscrofa10.2 assembly– 7 had large sequenced BAC contigs crossing them–We sequenced 5 more

• Plus manual placing of some fiddly contigs

• 181 gaps remaining

• N50 increased to 35.8Mb #35MbCtgClub

Page 19: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Targeted gap closureCH242-323K10

Page 20: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Targeted gap closureCH242-284F8

Page 21: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Targeted gap closureCH242-284F8

Page 22: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Sequencing Additional BACs

• 5 BACs with ends that appear to cross gaps in the assembly– Sequenced using the MinION and were assembled into individual contigs using Canu

– Polished using Pilon

• Mapping of the assembled BAC contigs to the scaffolds showed they could be placed in their expected regions

• Potential to fill 129 more gaps in this way

#porecamp

Page 23: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Error Correction

• Arrow (succeeds Quiver)– Using PacBio reads to error correct assembled sequence– Reduced homozygous SNPs

• from 34,545 to 27,018

– Reduced homozygous indels• 1,729,510 to 1,036,696 

• Pilon (currently running)– Using Illumina mate pair and Illumina paired ends libraries– Can detect and correct SNPs and indels, structural abnormalities, plus potential for gap filling

– Expecting to reduce the remaining false variants

Page 24: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Evaluate• Order and Orientation wrt RH map• Order, orientation, distance between paired ends

– CH242 BAC ends– Fosmid ends– Illumina mate pairs (5-7 Kbp, 9-11 Kbp)– Illumina paired ends (500-660 bp)

• Gene models

Page 25: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

BAC end sequence alignments – orientation & insert size

Page 26: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

BBS4

Page 27: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

IGF2

Page 28: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

CFTR – ST7

Page 29: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

ST7

Page 30: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Sscrofa11 - a new pig reference genome sequence worthy of adoption by the GRC

Alan ArchibaldThe Roslin Institute and R(D)SVS

University of Edinburgh

Page 31: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Adding pig genome to GRC High quality, highly contiguous genome Resources for gap closure

- Isogenic BAC library CHORI242, ends sequenced- Isogenic fosmid library WTSI_1005, ends sequenced

User communities, incl. SGSC, FAANG Funding

- BBSRC strategic funding (The Roslin Institute)- BBSRC BBR Ensembl- COST Action CA15112 (FAANG-Europe)

Page 32: Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence

Acknowledgements

• Roslin Institute– Amanda Warr– Mick Watson– David Hume– Heather Finlayson– Christine Burkard– Lel Eory– Richard Talbot– John Hickey

• PacBio– Richard Hall– Jason Chin– Harold Lee– Regina Lam– Kirsti Kim– Jim Burrows  [email protected]

@AlanArchibald51

• USDA– Tim Smith– Derek Bickhart– Ben Rosen– Steve Schroeder

• gEVAL– Will Chow– Kerstin Howe

• Other– Sergey Koren– Chris Warkup– Swine Genome Sequencing Consortium

MARC BARC

@FAANGEurope