Exploiting long read sequencing technology to build a substantially improved pig reference genome sequence Alan Archibald The Roslin Institute and R(D)SVS University of Edinburgh
Exploiting long read sequencing technology to build a substantially improved pig
reference genome sequenceAlan Archibald
The Roslin Institute and R(D)SVSUniversity of Edinburgh
Draft reference pig genome sequenceSwine Genome Sequencing Consortium
Hybrid Shotgun Sequencing Strategy
Whole- genome shotgun reads
Combine overlapping whole-genome and BAC-derived reads
Assemble clone sequences to represent chromosomes and annotate using Ensembl automated pipeline
BAC shotgun reads
Minimal set of overlapping BACs selected from physical map
Sequence
assembly
Sscrofa10.2 – chromosome assigned scaffolds only
Length (bp)Chromosomes 1-18, X, Y Contigs N50 80,720Contigs N90 13,487Average contig length 31,604Largest contig length 1,598,650Scaffold N50 637,332Scaffold N90 189,449Average scaffold length 436,176Largest scaffold length 3,862,550
BAC Contigs / Fragments
(paired) end sequencesof subclone libraries
768 subclones / BACAv read: 707 bp
phrap
create fragment chains
Submission to EMBL/Genbank
A B C D E F G
GA C B E F DNNNN NNN NNN NNN NNN NNNN
fragment chain 1 fragment chain 2
A B C D E F G
Limitations of Sscrofa10.2• Missing coverage ~10%
– Poorly captured in unplaced scaffolds• Local scaffolding issues
– Order & orientation of sequence contigs within BACs not resolved unambiguously
– No BAC clone sequence assigned to > 1 scaffold• Unresolved redundancy from overlapping BAC clones• Project memory loss
– e.g. unplaced FPC contigs listed at end of q-arm
http://geval.sanger.ac.uk/PGP_pig_10_2/Info/Index
Sscrofa10.2- QC
• Illumina PE reads from same pig mapped to Sscrofa10.2
• Looked for indicators of structural variation– including high/low coverage, incorrect orientation and abnormal insert sizes.
• Looked for homozygous variants
Sscrofa10.2-Chr 1
De novo genome assemblies using Pacific Biosystems long read technology
TJTabaso (Duroc 2-14) MARC1423004
Duroc sow Duroc/Landrace/Yorkshire barrow
PacBio – draft WGS assembly• Duroc 2-14 (same pig as most of Sscrofa10.2)• 65x genome coverage• Pacific Biosystems P6 chemistry• Length cut-off for reads for assembly 13 kbp• Coverage of corrected reads for assembly 19x
Contig QC
Variants
• Homozygous SNPs:– Sscrofa10.2: 415,056– Pacbio contigs: 34,545
• Homozygous indels:– Sscrofa10.2: 168,037– Pacbio contigs: 1,729,510
Scaffolding
• Scaffold by mapping contigs to Sscrofa10.2– using Nucmer– Assumme Sscrofa10.2 gross structure is correct
• Radiation Hybrid and Linkage maps, 60K SNPs• FPC physical map
• 2.36 Gb ungapped length
• 434 contigs
Chromosome 6
Chromosome 6
Gap Filling
• Gap filling was done using PBJelly
• Further gaps filled using large finished BACs from Sscrofa10.2 assembly– 7 had large sequenced BAC contigs crossing them–We sequenced 5 more
• Plus manual placing of some fiddly contigs
• 181 gaps remaining
• N50 increased to 35.8Mb #35MbCtgClub
Targeted gap closureCH242-323K10
Targeted gap closureCH242-284F8
Targeted gap closureCH242-284F8
Sequencing Additional BACs
• 5 BACs with ends that appear to cross gaps in the assembly– Sequenced using the MinION and were assembled into individual contigs using Canu
– Polished using Pilon
• Mapping of the assembled BAC contigs to the scaffolds showed they could be placed in their expected regions
• Potential to fill 129 more gaps in this way
#porecamp
Error Correction
• Arrow (succeeds Quiver)– Using PacBio reads to error correct assembled sequence– Reduced homozygous SNPs
• from 34,545 to 27,018
– Reduced homozygous indels• 1,729,510 to 1,036,696
• Pilon (currently running)– Using Illumina mate pair and Illumina paired ends libraries– Can detect and correct SNPs and indels, structural abnormalities, plus potential for gap filling
– Expecting to reduce the remaining false variants
Evaluate• Order and Orientation wrt RH map• Order, orientation, distance between paired ends
– CH242 BAC ends– Fosmid ends– Illumina mate pairs (5-7 Kbp, 9-11 Kbp)– Illumina paired ends (500-660 bp)
• Gene models
BAC end sequence alignments – orientation & insert size
BBS4
IGF2
CFTR – ST7
ST7
Sscrofa11 - a new pig reference genome sequence worthy of adoption by the GRC
Alan ArchibaldThe Roslin Institute and R(D)SVS
University of Edinburgh
Adding pig genome to GRC High quality, highly contiguous genome Resources for gap closure
- Isogenic BAC library CHORI242, ends sequenced- Isogenic fosmid library WTSI_1005, ends sequenced
User communities, incl. SGSC, FAANG Funding
- BBSRC strategic funding (The Roslin Institute)- BBSRC BBR Ensembl- COST Action CA15112 (FAANG-Europe)
Acknowledgements
• Roslin Institute– Amanda Warr– Mick Watson– David Hume– Heather Finlayson– Christine Burkard– Lel Eory– Richard Talbot– John Hickey
• PacBio– Richard Hall– Jason Chin– Harold Lee– Regina Lam– Kirsti Kim– Jim Burrows [email protected]
@AlanArchibald51
• USDA– Tim Smith– Derek Bickhart– Ben Rosen– Steve Schroeder
• gEVAL– Will Chow– Kerstin Howe
• Other– Sergey Koren– Chris Warkup– Swine Genome Sequencing Consortium
MARC BARC
@FAANGEurope