Combining PacBio with short read technology for improved de novo genome assembly

Post on 10-May-2015

3899 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

Transcript

The best of both worldsCombining PacBio with short read technology

for improved de novo genome assembly

Lex Nederbragt, NSC and CEESlex.nederbragt@bio.uio.no

This talk

Why does everybody want longer reads?

… for genome assemblies

What is a genome assembly

reads

contigs

scaffolds

Hierarchical structure

Sequence data

Reads

http://www.cbcb.umd.edu/research/assembly_primer.shtml

reads

contigs

scaffolds

original DNA

fragments

original DNA

fragments

Sequenced ends

Contigs

Building contigsreads

contigs

scaffolds

Contigs

Building contigsreads

contigs

scaffolds

Repeat copy 1 Repeat copy 2

Collapsed repeat consensus

Contig orientation?Contig order?

http://www.cbcb.umd.edu/research/assembly_primer.shtml

Mate pairs

Other read type

Repeat copy 1 Repeat copy 2

reads

contigs

scaffolds

mate pair reads(much) longer fragments

Scaffolds

Ordered, oriented contigsreads

contigs

scaffolds

contigs

mate pairs

gap size estimate

What is a genome assembly

reads

contigs

scaffolds

Hierarchical structure

Genome assembly

So, what’s so hard about it?

1) Repeats

reads

contigs

scaffolds

Repeat copy 1 Repeat copy 2

Collapsed repeat consensus

Repeats break up contigs

http://www.cbcb.umd.edu/research/assembly_primer.shtml

2) Heterozygosity

http://commons.wikimedia.org/wiki/File:Chromosome_1.svg

*

*

*

Differences between sister chromosomes

2) Heterozygosity

Polymorphic contig 2Polymorphic contig 2

Polymorphic contig 3Polymorphic contig 3

Contig 4Contig 1

2) Heterozygosity

http://www.astraean.com/borderwars/wp-content/uploads/2012/04/heterozygoats.jpg and many other sites

3) Many programs to choose from

Zhang et al. (2011) doi:10.1371/journal.pone.0017915.g001

Assembly: challenges

Repeat copy 1 Repeat copy 2

Polymorphic contig 2Polymorphic contig 2

Polymorphic contig 3Polymorphic contig 3

Contig 4Contig 1

Knowing how to use the programs

Heterozygosity

So, why does everybody want longer reads?

http://www.autobizz.com.my/forum/forum/General-Chat/944-The-worlds-longest-car.html

Longer reads?

Repeat copy 1 Repeat copy 2

Long reads can span repeats

Polymorphic contig 2Polymorphic contig 2

Polymorphic contig 3Polymorphic contig 3

Contig 4Contig 1

and heterozygous regions

PacBio to the rescue?

High-throughput sequencing

Library preparation

Large Insert Sizes Single pass

Small Insert Sizes

Multiple passes

Continued generationsof reads

High-throughput sequencing

Raw read length

Large Insert Sizes Single pass

Small Insert Sizes

Multiple passes

High-throughput sequencing

Raw reads and subreads

‘Subreads’

Large Insert Sizes Single pass

PacBio: uses

Long reads low quality

Useful for assembly?

85-87% accuracy

Solutions for assembly

Solutions for assembly (1)

Designed by Pacific Biosciences

http://www.clker.com/clipart-4245.html

Solutions for assembly (2)Broad Institute

Need a special recipefor sequencing

Solutions for assembly (3)

PacBioToCAError correct with short reads

http://schatzlab.cshl.edu/presentations/2012-01-17.PAG.SMRTassembly.pdf

Celera assembler

PacBioToCA

Koren et al, 2012

Shameless self-promotion

flxlexblog.wordpress.com

Shameless self-promotion

@lexnederbragt

The Atlantic cod genome project

First draft

Fragmented assembly- short contigs- many gap bases

http://en.wikipedia.org

First draft

6467 scaffolds

35% gap bases

The causes

Short Tandem Repeats (>20% of gaps)

The causes

Polymorphic contig 2Polymorphic contig 2

Polymorphic contig 3Polymorphic contig 3

Contig 4Contig 1

Heterozygosity?

23 pseudochromosomes

Below 5% gap bases

Longer contigs

The goal

PacBio to the rescue?

Large Insert Sizes

The approach

Libraries

Aim for looooong insert sizes

Large Insert Sizes Single pass

The approach

Sequencing

Sequence with 90 minute movies

10 x coverage in reads of at least 3000 bp

No, we don’t throw this away…

The approach

Error-correction

PacBio results

Fraction of bases at minimum length

4kb insert

10kb insert 110kb insert 2

Large library insert size important!

PacBio results

64 SMRT Cells

Large Insert Sizes

2.2 Gigabytes in longest subreads readsLargest 15 kbp

3.2 Gigabytes in raw reads at least 3kb3.8 x coverage

PacBio results

Mapping to the cod genome11.4 kbp subread

10.9 kbp subread

10.6 kbp subread

Example 1

232 bp Gap

ACACAC repeat

TGTGTG repeat

Example 1

Example 1

Example 1

PacBio reads

Scaffold

Unplaced contig

...ACACAC TGTGTG...

Example 2

TGTGTG repeat

344 bp Gap

Example 2

Example 2

PacBio reads

Scaffold

Heterozygosity?

...TGTGTG

Example 3

PacBio reads

Scaffold

300 bp misassembly?

Error-correction

http://openclipart.org/

Outlook

Will PacBio solve our problems?

Outlook

Or

Outlook

Will we find the heterozygous regions?

Polymorphic contig 2Polymorphic contig 2

Polymorphic contig 3Polymorphic contig 3

Contig 4Contig 1

Outlook

http://www.pasteur.fr/recherche/unites/Bbi/en.wikipedia.organd Martin Malmstrøm

top related