Combining PacBio with short read technology for improved de novo genome assembly
Post on 10-May-2015
3899 Views
Preview:
Transcript
The best of both worldsCombining PacBio with short read technology
for improved de novo genome assembly
Lex Nederbragt, NSC and CEESlex.nederbragt@bio.uio.no
This talk
Why does everybody want longer reads?
… for genome assemblies
What is a genome assembly
reads
contigs
scaffolds
Hierarchical structure
Sequence data
Reads
http://www.cbcb.umd.edu/research/assembly_primer.shtml
reads
contigs
scaffolds
original DNA
fragments
original DNA
fragments
Sequenced ends
Contigs
Building contigsreads
contigs
scaffolds
Contigs
Building contigsreads
contigs
scaffolds
Repeat copy 1 Repeat copy 2
Collapsed repeat consensus
Contig orientation?Contig order?
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Mate pairs
Other read type
Repeat copy 1 Repeat copy 2
reads
contigs
scaffolds
mate pair reads(much) longer fragments
Scaffolds
Ordered, oriented contigsreads
contigs
scaffolds
contigs
mate pairs
gap size estimate
What is a genome assembly
reads
contigs
scaffolds
Hierarchical structure
Genome assembly
So, what’s so hard about it?
1) Repeats
reads
contigs
scaffolds
Repeat copy 1 Repeat copy 2
Collapsed repeat consensus
Repeats break up contigs
http://www.cbcb.umd.edu/research/assembly_primer.shtml
2) Heterozygosity
http://commons.wikimedia.org/wiki/File:Chromosome_1.svg
*
*
*
Differences between sister chromosomes
2) Heterozygosity
Polymorphic contig 2Polymorphic contig 2
Polymorphic contig 3Polymorphic contig 3
Contig 4Contig 1
2) Heterozygosity
http://www.astraean.com/borderwars/wp-content/uploads/2012/04/heterozygoats.jpg and many other sites
3) Many programs to choose from
Zhang et al. (2011) doi:10.1371/journal.pone.0017915.g001
Assembly: challenges
Repeat copy 1 Repeat copy 2
Polymorphic contig 2Polymorphic contig 2
Polymorphic contig 3Polymorphic contig 3
Contig 4Contig 1
Knowing how to use the programs
Heterozygosity
So, why does everybody want longer reads?
http://www.autobizz.com.my/forum/forum/General-Chat/944-The-worlds-longest-car.html
Longer reads?
Repeat copy 1 Repeat copy 2
Long reads can span repeats
Polymorphic contig 2Polymorphic contig 2
Polymorphic contig 3Polymorphic contig 3
Contig 4Contig 1
and heterozygous regions
PacBio to the rescue?
High-throughput sequencing
Library preparation
Large Insert Sizes Single pass
Small Insert Sizes
Multiple passes
Continued generationsof reads
High-throughput sequencing
Raw read length
Large Insert Sizes Single pass
Small Insert Sizes
Multiple passes
High-throughput sequencing
Raw reads and subreads
‘Subreads’
Large Insert Sizes Single pass
PacBio: uses
Long reads low quality
Useful for assembly?
85-87% accuracy
Solutions for assembly
Solutions for assembly (1)
Designed by Pacific Biosciences
http://www.clker.com/clipart-4245.html
Solutions for assembly (2)Broad Institute
Need a special recipefor sequencing
Solutions for assembly (3)
PacBioToCAError correct with short reads
http://schatzlab.cshl.edu/presentations/2012-01-17.PAG.SMRTassembly.pdf
Celera assembler
PacBioToCA
Koren et al, 2012
Shameless self-promotion
flxlexblog.wordpress.com
Shameless self-promotion
@lexnederbragt
The Atlantic cod genome project
First draft
Fragmented assembly- short contigs- many gap bases
http://en.wikipedia.org
First draft
6467 scaffolds
35% gap bases
The causes
Short Tandem Repeats (>20% of gaps)
The causes
Polymorphic contig 2Polymorphic contig 2
Polymorphic contig 3Polymorphic contig 3
Contig 4Contig 1
Heterozygosity?
23 pseudochromosomes
Below 5% gap bases
Longer contigs
The goal
PacBio to the rescue?
Large Insert Sizes
The approach
Libraries
Aim for looooong insert sizes
Large Insert Sizes Single pass
The approach
Sequencing
Sequence with 90 minute movies
10 x coverage in reads of at least 3000 bp
No, we don’t throw this away…
The approach
Error-correction
PacBio results
Fraction of bases at minimum length
4kb insert
10kb insert 110kb insert 2
Large library insert size important!
PacBio results
64 SMRT Cells
Large Insert Sizes
2.2 Gigabytes in longest subreads readsLargest 15 kbp
3.2 Gigabytes in raw reads at least 3kb3.8 x coverage
PacBio results
Mapping to the cod genome11.4 kbp subread
10.9 kbp subread
10.6 kbp subread
Example 1
232 bp Gap
ACACAC repeat
TGTGTG repeat
Example 1
Example 1
Example 1
PacBio reads
Scaffold
Unplaced contig
...ACACAC TGTGTG...
Example 2
TGTGTG repeat
344 bp Gap
Example 2
Example 2
PacBio reads
Scaffold
Heterozygosity?
...TGTGTG
Example 3
PacBio reads
Scaffold
300 bp misassembly?
Error-correction
http://openclipart.org/
Outlook
Will PacBio solve our problems?
Outlook
Or
Outlook
Will we find the heterozygous regions?
Polymorphic contig 2Polymorphic contig 2
Polymorphic contig 3Polymorphic contig 3
Contig 4Contig 1
Outlook
http://www.pasteur.fr/recherche/unites/Bbi/en.wikipedia.organd Martin Malmstrøm
top related