Design Goals Crash Course: Reference-guided Assembly.

MICHAEL STRÖMBERGBoston College Data Club

April 2008

Design Goals

Crash Course: Reference-guided Assembly

Sequencing Technologie

future

Next-Gen Sequence Lengths

Capillary (Sanger) Roche 454 FLX0

maxmeanmin

Sequencing Technology

Sequence L

Illumina AB SOLiD Helicos0

maxmeanmin

Sequencing Technology

Sequence L

3 6 9 12 15 18 21 24 27 30 330%

Unique Genome Coverage (H. sapiens)

Sequence Length

Mixing It Up: Paired-end Reads

0 50 100 150 200 250 300 3500

fragment length (bp)

read p

count)

How Does It Work?

C. elegans: a case for INDELs

SPEED100 million Illumina readsAlignment time: 93 min (17,800 reads/s)

Assembly time: 100 min

INDELS

INDEL validation rate: 89.3 % (216)SNP validation rate: 97.8 % (229)

P. stipitis: Co-assembly

Capillary454 FLX

454 GS20

Illumina

Scaling Up

Dec-05 Mar-06 Jul-06 Oct-06 Jan-07 Apr-07 Aug-07 Nov-07 Feb-08 Jun-08 10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

10,000,000,000

Project Date

rence S

equence L

C. elegans

M. musculus

H. sapiens

P. stipitis

M. musculus mtDNA

H. sapiens CAPON region

D. melanogaster

H. sapiens ENCODE region

Performance: Aligners

Aligners: Feature Set

ELAND MAQNewble

r SHRiMP SOAP

SequencingPlatforms

Illumina454

SOLiDcapillary

Illumina IlluminaSOLiD

454 IlluminaSOLiD

Illumina

AlignmentAlgorithm

Smith-Waterma

Hash-based

FlowMapper

Smith-Waterma

Hash-based

Co-assemblyCreation

GappedAlignments ?

Paired-end Reads

PlatformBinaries

Windows, Mac, Linux,

Sun, iPhone

Mac, Linux Linux Mac, Linux Mac, Linux

Performance: AlignerIllumina 35 bp (X Chromosome)

program aligned reads/s

MOSAIK 180 - 16,658

ELAND 7,716

SOAP 1,637

MAQ 1,376

SHRIMP 39

MOSAIK (fast)

MOSAIK (single)

MOSAIK (multi)

MOSAIK (all)

ELAND MAQ SOAP SHRIMP0

Performance: AlignerRoche 454 FLX ~250 bp

program aligned reads/s

Roche 454 Newbler 1,176

MOSAIK 317 - 616

Using P. stipitis (15.4 Mbp) 454 FLX data set. 932,565 reads basecalled by PyroBayes†.

† Quinlan et al. Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nature Methods (2008)

Accuracy: Synthetic Data Sets

1 per 1.3 kb 1 per 7.2 kb

H. sapiens Xchromosome

1 million

Accuracy: Classification

MOSAIK

unique readsnon-unique reads

Accuracy: Unique Read Alignment

MOSAIK (fast) MOSAIK (single) MOSAIK (multi) MOSAIK (all) ELAND MAQ SOAP0%

readsINDELsSNPs

Reasons to use ?

• FAST• Accurate• Multiprocessor (OPENMP)

• Co-assemblies• Gapped alignments• Widely used

“One tool, many technologies,

many applications”

(Near) Future Development

• All technologies– Pacific BioSciences– Helicos

• All application areas– Adapter trimming– Coverage graphs

• Optimization• Improved paired-end read support• File format standardization (SAF & SRF)

1000 Genomes Project

• Many samples with light coverage (1000 dg)

– 100 samples from 10 populations at 2x coverage– Find 90% of the 1 % frequency variants per

population

• Trios with moderate coverage (990 dg)

– 30 trios at 11x coverage

• If you’re looking for SNPs, are your tools and methods robust?

Scaling Up: Disk Footprint

• Current situation: files created by MOSAIK are not optimized for speed or size– Assembly can take a long time (slow disk

speed)

• Hypothetical solution– Optimize the file formats– Ditch the built-in index– Keep data sorted by aligned location

Scaling Up: Disk Footprint

Scaling Up: Memory Footprint

• Current situation: storing the entire human genome stored with all associated hash locations

– Optimized hash table ≈ 55 GB RAM

– File-based hash table (BerkeleyDB)• User selects how much RAM to use• Dreadfully slow performance• Large disk footprint ≈ 65 GB file

Scaling Up: Memory Footprint

9 10 11 12 13 14 15 16 17 180

JumpDB Memory Usage (Human Genome)

JumpDB MOSAIK hash table

hash size (bp)

used (

Berkeley (all positions in database)

Berkeley (1 position in database)

Jump (all positions in file-based database)

Mosaik hash table

0 4 8 12 16 20

Alignment Performance with 35bp human reads

Reads/s

Scaling Up: Speed & Sensitivity

• Current situation: speed increases as the hash size increases, sensitivity decreases

• Hypothetical solution: use small hash sizes and require a clustering of a predefined length.

• Status: Implemented but not tested.

BORK! BORK! BORK!

(translated: when will MOSAIK get published?)

Acknowledgements

Boston CollegeGabor MarthDerek BarnettMichele BusbyWeichun HuangAaron QuinlanChip Stewart

Thomas SeyfriedMike Kiebish

Washington University School of Medicine

Elaine MardisJarret GlasscockVincent Magrini

AgencourtDouglas SmithWei Tao

Design Goals Crash Course: Reference-guided Assembly.

linux slide

classification slide

shrimp39 slide

location slide

disk footprint slide

coassembly slide

memory footprint slide

gb file slide

Documents

Evaluating Linux Kernel Crash Dumping...

CONNECTICUT TRAFFIC CRASH FACTS 2012 · crash report to...

Pre-Crash Scenario Typology for Crash Avoidance Research

BICYCLE PEDESTRIAN MASTER PLAN UPDATE...BIKE/PED SAFETY...

Software refactoring guided by multiple soft-goals

Table of Contents · The goals developed were determined...

New Employee Orientation. Safety objectives and goals v An.....

City of Madison, WI Crash GIS. Short term Crash GIS Goals...

AT-FAULT CMV CRASHES: Top Behavioral Factors And Outcomes...

Living in the Anthropocene : Crash or Crash Through

Crash Early, Crash Often, Explain Well

Louisiana Crash Report Manual 2019 · 4/15/2019 · State....

United Nations Development Assistance Framework UNDAF...

The Technology Strategy Companysm CMMI Crash...

crash - Univerzita...

Road Departure Crashes & High Crash Corridors/High Crash...