Top Banner
MICHAEL STRÖMBERG Boston College Data Club April 2008
38
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Design Goals Crash Course: Reference-guided Assembly.

MICHAEL STRÖMBERGBoston College Data Club

April 2008

Page 2: Design Goals Crash Course: Reference-guided Assembly.
Page 3: Design Goals Crash Course: Reference-guided Assembly.

Design Goals

Page 4: Design Goals Crash Course: Reference-guided Assembly.

Crash Course: Reference-guided Assembly

Page 5: Design Goals Crash Course: Reference-guided Assembly.

Crash Course: Reference-guided Assembly

Page 6: Design Goals Crash Course: Reference-guided Assembly.

Crash Course: Reference-guided Assembly

Page 7: Design Goals Crash Course: Reference-guided Assembly.

Sequencing Technologie

s

future

Page 8: Design Goals Crash Course: Reference-guided Assembly.

Next-Gen Sequence Lengths

Capillary (Sanger) Roche 454 FLX0

200

400

600

800

1000

1200

1400

1600

maxmeanmin

Sequencing Technology

Sequence L

ength

(bp)

Illumina AB SOLiD Helicos0

10

20

30

40

50

60

70

80

maxmeanmin

Sequencing Technology

Sequence L

ength

(bp)

Page 9: Design Goals Crash Course: Reference-guided Assembly.
Page 10: Design Goals Crash Course: Reference-guided Assembly.

3 6 9 12 15 18 21 24 27 30 330%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Unique Genome Coverage (H. sapiens)

Sequence Length

Uniq

ue G

enom

e C

overa

ge

Page 11: Design Goals Crash Course: Reference-guided Assembly.

Mixing It Up: Paired-end Reads

0 50 100 150 200 250 300 3500

200

400

600

800

1000

1200

1400

1600

1800

fragment length (bp)

read p

air

s (

count)

Page 12: Design Goals Crash Course: Reference-guided Assembly.
Page 13: Design Goals Crash Course: Reference-guided Assembly.

How Does It Work?

Page 14: Design Goals Crash Course: Reference-guided Assembly.

How Does It Work?

Page 15: Design Goals Crash Course: Reference-guided Assembly.
Page 16: Design Goals Crash Course: Reference-guided Assembly.

C. elegans: a case for INDELs

SPEED100 million Illumina readsAlignment time: 93 min (17,800 reads/s)

Assembly time: 100 min

INDELS

INDEL validation rate: 89.3 % (216)SNP validation rate: 97.8 % (229)

Page 17: Design Goals Crash Course: Reference-guided Assembly.

P. stipitis: Co-assembly

Capillary454 FLX

454 GS20

Illumina

Page 18: Design Goals Crash Course: Reference-guided Assembly.

Scaling Up

Dec-05 Mar-06 Jul-06 Oct-06 Jan-07 Apr-07 Aug-07 Nov-07 Feb-08 Jun-08 10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

10,000,000,000

Project Date

Refe

rence S

equence L

ength

(bp)

C. elegans

M. musculus

H. sapiens

P. stipitis

M. musculus mtDNA

H. sapiens CAPON region

D. melanogaster

H. sapiens ENCODE region

Page 19: Design Goals Crash Course: Reference-guided Assembly.

Performance: Aligners

Page 20: Design Goals Crash Course: Reference-guided Assembly.

Aligners: Feature Set

ELAND MAQNewble

r SHRiMP SOAP

SequencingPlatforms

Illumina454

SOLiDcapillary

Illumina IlluminaSOLiD

454 IlluminaSOLiD

Illumina

AlignmentAlgorithm

Smith-Waterma

n

Hash-based

Hash-based

FlowMapper

Smith-Waterma

n

Hash-based

Co-assemblyCreation

?

GappedAlignments ?

Paired-end Reads

PlatformBinaries

Windows, Mac, Linux,

Sun, iPhone

Mac, Linux Linux Mac, Linux Mac, Linux

Page 21: Design Goals Crash Course: Reference-guided Assembly.

Performance: AlignerIllumina 35 bp (X Chromosome)

program aligned reads/s

MOSAIK 180 - 16,658

ELAND 7,716

SOAP 1,637

MAQ 1,376

SHRIMP 39

MOSAIK (fast)

MOSAIK (single)

MOSAIK (multi)

MOSAIK (all)

ELAND MAQ SOAP SHRIMP0

2000

4000

6000

8000

10000

12000

14000

16000

Page 22: Design Goals Crash Course: Reference-guided Assembly.

Performance: AlignerRoche 454 FLX ~250 bp

program aligned reads/s

Roche 454 Newbler 1,176

MOSAIK 317 - 616

Using P. stipitis (15.4 Mbp) 454 FLX data set. 932,565 reads basecalled by PyroBayes†.

† Quinlan et al. Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nature Methods (2008)

Page 23: Design Goals Crash Course: Reference-guided Assembly.

Accuracy: Synthetic Data Sets

1 per 1.3 kb 1 per 7.2 kb

H. sapiens Xchromosome

1 million

Page 24: Design Goals Crash Course: Reference-guided Assembly.
Page 25: Design Goals Crash Course: Reference-guided Assembly.

Accuracy: Classification

MOSAIK

(fas

t)

MOSAIK

(sin

gle)

MOSAIK

(mul

ti)

MOSAIK

(all)

ELAND

MAQSO

AP0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

unique readsnon-unique reads

Page 26: Design Goals Crash Course: Reference-guided Assembly.

Accuracy: Unique Read Alignment

MOSAIK (fast) MOSAIK (single) MOSAIK (multi) MOSAIK (all) ELAND MAQ SOAP0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

readsINDELsSNPs

Page 27: Design Goals Crash Course: Reference-guided Assembly.

Reasons to use ?

• FAST• Accurate• Multiprocessor (OPENMP)

• Co-assemblies• Gapped alignments• Widely used

“One tool, many technologies,

many applications”

Page 28: Design Goals Crash Course: Reference-guided Assembly.

(Near) Future Development

• All technologies– Pacific BioSciences– Helicos

• All application areas– Adapter trimming– Coverage graphs

• Optimization• Improved paired-end read support• File format standardization (SAF & SRF)

Page 29: Design Goals Crash Course: Reference-guided Assembly.
Page 30: Design Goals Crash Course: Reference-guided Assembly.

1000 Genomes Project

• Many samples with light coverage (1000 dg)

– 100 samples from 10 populations at 2x coverage– Find 90% of the 1 % frequency variants per

population

• Trios with moderate coverage (990 dg)

– 30 trios at 11x coverage

• If you’re looking for SNPs, are your tools and methods robust?

Page 31: Design Goals Crash Course: Reference-guided Assembly.

Scaling Up: Disk Footprint

• Current situation: files created by MOSAIK are not optimized for speed or size– Assembly can take a long time (slow disk

speed)

• Hypothetical solution– Optimize the file formats– Ditch the built-in index– Keep data sorted by aligned location

Page 32: Design Goals Crash Course: Reference-guided Assembly.

Scaling Up: Disk Footprint

Page 33: Design Goals Crash Course: Reference-guided Assembly.

Scaling Up: Memory Footprint

• Current situation: storing the entire human genome stored with all associated hash locations

– Optimized hash table ≈ 55 GB RAM

– File-based hash table (BerkeleyDB)• User selects how much RAM to use• Dreadfully slow performance• Large disk footprint ≈ 65 GB file

Page 34: Design Goals Crash Course: Reference-guided Assembly.

Scaling Up: Memory Footprint

Page 35: Design Goals Crash Course: Reference-guided Assembly.

Scaling Up: Memory Footprint

9 10 11 12 13 14 15 16 17 180

5

10

15

20

25

30

35

40

45

50

55

60

65

70

JumpDB Memory Usage (Human Genome)

JumpDB MOSAIK hash table

hash size (bp)

mem

ory

used (

GB

RA

M)

Berkeley (all positions in database)

Berkeley (1 position in database)

Jump (all positions in file-based database)

Mosaik hash table

0 4 8 12 16 20

Alignment Performance with 35bp human reads

Reads/s

Page 36: Design Goals Crash Course: Reference-guided Assembly.

Scaling Up: Speed & Sensitivity

• Current situation: speed increases as the hash size increases, sensitivity decreases

• Hypothetical solution: use small hash sizes and require a clustering of a predefined length.

• Status: Implemented but not tested.

Page 37: Design Goals Crash Course: Reference-guided Assembly.

BORK! BORK! BORK!

(translated: when will MOSAIK get published?)

Page 38: Design Goals Crash Course: Reference-guided Assembly.

Acknowledgements

Boston CollegeGabor MarthDerek BarnettMichele BusbyWeichun HuangAaron QuinlanChip Stewart

Thomas SeyfriedMike Kiebish

Washington University School of Medicine

Elaine MardisJarret GlasscockVincent Magrini

AgencourtDouglas SmithWei Tao