Design Goals Crash Course: Reference-guided Assembly.

MICHAEL STRÖMBERGBoston College Data Club

April 2008

Design Goals

Crash Course: Reference-guided Assembly



Sequencing Technologie

s

future

Next-Gen Sequence Lengths

Capillary (Sanger) Roche 454 FLX0

200

400

600

800

1000

1200

1400

1600

maxmeanmin

Sequencing Technology

Sequence L

ength

(bp)

Illumina AB SOLiD Helicos0

10

20

30

40

50

60

70

80

maxmeanmin

Sequencing Technology

Sequence L

ength

(bp)

3 6 9 12 15 18 21 24 27 30 330%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Unique Genome Coverage (H. sapiens)

Sequence Length

Uniq

ue G

enom

e C

overa

ge

Mixing It Up: Paired-end Reads

0 50 100 150 200 250 300 3500

200

400

600

800

1000

1200

1400

1600

1800

fragment length (bp)

read p

air

s (

count)

How Does It Work?

How Does It Work?

C. elegans: a case for INDELs

SPEED100 million Illumina readsAlignment time: 93 min (17,800 reads/s)

Assembly time: 100 min

INDELS

INDEL validation rate: 89.3 % (216)SNP validation rate: 97.8 % (229)

P. stipitis: Co-assembly

Capillary454 FLX

454 GS20

Illumina

Scaling Up

Dec-05 Mar-06 Jul-06 Oct-06 Jan-07 Apr-07 Aug-07 Nov-07 Feb-08 Jun-08 10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

10,000,000,000

Project Date

Refe

rence S

equence L

ength

(bp)

C. elegans

M. musculus

H. sapiens

P. stipitis

M. musculus mtDNA

H. sapiens CAPON region

D. melanogaster

H. sapiens ENCODE region

Performance: Aligners

Aligners: Feature Set

ELAND MAQNewble

r SHRiMP SOAP

SequencingPlatforms

Illumina454

SOLiDcapillary

Illumina IlluminaSOLiD

454 IlluminaSOLiD

Illumina

AlignmentAlgorithm

Smith-Waterma

n

Hash-based

Hash-based

FlowMapper

Smith-Waterma

n

Hash-based

Co-assemblyCreation

?

GappedAlignments ?

Paired-end Reads

PlatformBinaries

Windows, Mac, Linux,

Sun, iPhone

Mac, Linux Linux Mac, Linux Mac, Linux

Performance: AlignerIllumina 35 bp (X Chromosome)

program aligned reads/s

MOSAIK 180 - 16,658

ELAND 7,716

SOAP 1,637

MAQ 1,376

SHRIMP 39

MOSAIK (fast)

MOSAIK (single)

MOSAIK (multi)

MOSAIK (all)

ELAND MAQ SOAP SHRIMP0

2000

4000

6000

8000

10000

12000

14000

16000

Performance: AlignerRoche 454 FLX ~250 bp

program aligned reads/s

Roche 454 Newbler 1,176

MOSAIK 317 - 616

Using P. stipitis (15.4 Mbp) 454 FLX data set. 932,565 reads basecalled by PyroBayes†.

† Quinlan et al. Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nature Methods (2008)

Accuracy: Synthetic Data Sets

1 per 1.3 kb 1 per 7.2 kb

H. sapiens Xchromosome

1 million

Accuracy: Classification

MOSAIK

(fas

t)

MOSAIK

(sin

gle)

MOSAIK

(mul

ti)

MOSAIK

(all)

ELAND

MAQSO

AP0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

unique readsnon-unique reads

Accuracy: Unique Read Alignment

MOSAIK (fast) MOSAIK (single) MOSAIK (multi) MOSAIK (all) ELAND MAQ SOAP0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

readsINDELsSNPs

Reasons to use ?

• FAST• Accurate• Multiprocessor (OPENMP)

• Co-assemblies• Gapped alignments• Widely used

“One tool, many technologies,

many applications”

(Near) Future Development

• All technologies– Pacific BioSciences– Helicos

• All application areas– Adapter trimming– Coverage graphs

• Optimization• Improved paired-end read support• File format standardization (SAF & SRF)

1000 Genomes Project

• Many samples with light coverage (1000 dg)

– 100 samples from 10 populations at 2x coverage– Find 90% of the 1 % frequency variants per

population

• Trios with moderate coverage (990 dg)

– 30 trios at 11x coverage

• If you’re looking for SNPs, are your tools and methods robust?

Scaling Up: Disk Footprint

• Current situation: files created by MOSAIK are not optimized for speed or size– Assembly can take a long time (slow disk

speed)

• Hypothetical solution– Optimize the file formats– Ditch the built-in index– Keep data sorted by aligned location

Scaling Up: Disk Footprint

Scaling Up: Memory Footprint

• Current situation: storing the entire human genome stored with all associated hash locations

– Optimized hash table ≈ 55 GB RAM

– File-based hash table (BerkeleyDB)• User selects how much RAM to use• Dreadfully slow performance• Large disk footprint ≈ 65 GB file



9 10 11 12 13 14 15 16 17 180

5

10

15

20

25

30

35

40

45

50

55

60

65

70

JumpDB Memory Usage (Human Genome)

JumpDB MOSAIK hash table

hash size (bp)

mem

ory

used (

GB

RA

M)

Berkeley (all positions in database)

Berkeley (1 position in database)

Jump (all positions in file-based database)

Mosaik hash table

0 4 8 12 16 20

Alignment Performance with 35bp human reads

Reads/s

Scaling Up: Speed & Sensitivity

• Current situation: speed increases as the hash size increases, sensitivity decreases

• Hypothetical solution: use small hash sizes and require a clustering of a predefined length.

• Status: Implemented but not tested.

BORK! BORK! BORK!

(translated: when will MOSAIK get published?)

Acknowledgements

Boston CollegeGabor MarthDerek BarnettMichele BusbyWeichun HuangAaron QuinlanChip Stewart

Thomas SeyfriedMike Kiebish

Washington University School of Medicine

Elaine MardisJarret GlasscockVincent Magrini

AgencourtDouglas SmithWei Tao

Design Goals Crash Course: Reference-guided Assembly.

Documents

linux slide

classification slide

shrimp39 slide

location slide

disk footprint slide

coassembly slide

memory footprint slide

gb file slide