Top Banner
JAMES LINDSAY* , HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut* Georgia State University
31

Scaffolding Large Genomes Using Integer Linear Programming

Feb 23, 2016

Download

Documents

Emilie

Scaffolding Large Genomes Using Integer Linear Programming. James Lindsay* , Hamed Salooti , Alex Zelikovski , Ion Mandoiu * ACM-BCB 2012. University of Connecticut*. Georgia State University. De-novo Assembly Paradigm. short reads. the genome. shotgun sequencing. d enovo a ssembly. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scaffolding Large Genomes Using Integer Linear Programming

JAMES LINDSAY* , HAMED SALOOTI , ALEX ZELIKOVSKI , ION MANDOIU*

ACM-BCB 2012

Scaffolding Large Genomes Using Integer Linear

Programming

University of Connecticut* Georgia State University

Page 2: Scaffolding Large Genomes Using Integer Linear Programming

De-novo Assembly Paradigm

shotgun sequencing

short contigs

the scaffolds

short reads

the genome

denovoassembly

scaffolding

Page 3: Scaffolding Large Genomes Using Integer Linear Programming

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation! gene XYZ 3’ UTR

5’ UTR

Scaffold

gene XYZ

No scaffold

Page 4: Scaffolding Large Genomes Using Integer Linear Programming

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation!gene XYZ 3’

UTR5’

UTR

Sanger Sequencing

gene XYZ 3’ UTR

5’ UTR

Biologist: There are holes in my genes!

Page 5: Scaffolding Large Genomes Using Integer Linear Programming

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap Filling

Structural variation!

Page 6: Scaffolding Large Genomes Using Integer Linear Programming

Massive Sequencing Projects Effects of Read Length

I5k 5000 insect and

arthropod species

G10k 10,000 vertebrate

species

Dog Genome 7.5x Sanger N50: 180Kb

Chicken Genome 6x Illumina N50: 12Kb

Human Genome 100x Illumina N50: 24Kb

Fragmented Genomes

Page 7: Scaffolding Large Genomes Using Integer Linear Programming

The Scaffolding Problem

GIVEN• CONTIGS, PAIRED READSFIND• ORIENTATION, ORDERING,

RELATIVE DISTANCEGOAL• RECREATE TRUE SCAFFOLDS

Page 8: Scaffolding Large Genomes Using Integer Linear Programming

Paired Read Construction Paired Read Styles

Mate Pair

Paired End

Paired Reads

2kb

2kb

same strand and orientation

R1 R2

100b 100b 10kb

different strand and orientation

R1 R2

Page 9: Scaffolding Large Genomes Using Integer Linear Programming

Linkage Information

Possible States (mate pair)Two contigs are adjacent if:

A read pair spans the contigs

State (A, B, C, D) Depends on orientation of the

read Order of contigs is arbitrary

Each read pair can be “consistent” with one of the four states

5’ 3’

contig i contig j

R1 R2A

B

C

D

Page 10: Scaffolding Large Genomes Using Integer Linear Programming

Nodes Edges

Nodes are contigs Adjacent contigs have 4 edges (one for each state)

Weighted by overlap with repetitive region

Scaffolding Graph

contig i contig jState A

𝑊 𝑖𝑗𝐴= ∑

𝑟 𝑒𝑎𝑑𝑝𝑎𝑖𝑟𝑠1− ¿ 𝑏𝑝𝑖𝑛𝑟𝑒𝑝𝑒𝑎𝑡 𝑟𝑒𝑔𝑖𝑜𝑛

¿𝑏𝑝𝑖𝑛𝑟𝑒𝑎𝑑

Page 11: Scaffolding Large Genomes Using Integer Linear Programming

Integer Linear Program Formulation

Variables

, ,

𝑧=max ∑( 𝑖 , 𝑗 ) ∈𝐸

(𝑊 ¿¿ 𝑖𝑗 𝐴 𝐴𝑖𝑗 )+(𝑊 ¿¿ 𝑖𝑗𝐵 𝐵𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐶𝐶𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐷 𝐷𝑖𝑗)¿¿¿¿

Contig pair state:

Contig orientation: 𝑆 𝑖∈ {0,1 }Adjacent contig consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

Objective Maximize weight of consistent pairs

Page 12: Scaffolding Large Genomes Using Integer Linear Programming

Constraints

Variables

, , Contig pair state:

Contig orientation: 𝑆 𝑖∈ {0,1 }Adjacent contig consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

Pairwise Orientation

𝑆 𝑖𝑗≤𝑆 𝑗+𝑆𝑖𝑆 𝑖𝑗≤2−𝑆𝑖−𝑆 𝑗

𝑆 𝑖𝑗≥𝑆 𝑗−𝑆 𝑖𝑆 𝑖𝑗≥𝑆𝑖−𝑆 𝑗

Page 13: Scaffolding Large Genomes Using Integer Linear Programming

Constraints

Variables

, , Contig pair state:

Contig orientation: 𝑆 𝑖∈ {0,1 }Adjacent contig consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

State Variables

2 𝐴𝑖𝑗≤(1−𝑆¿¿ 𝑖)+(1−𝑆 𝑗)¿ 2𝐵𝑖𝑗≤(1−𝑆¿¿ 𝑖)+𝑆 𝑗¿

2𝐶𝑖𝑗≤𝑆 𝑖+(1−𝑆 𝑗) 2𝐷𝑖𝑗≤𝑆𝑖+𝑆 𝑗

Page 14: Scaffolding Large Genomes Using Integer Linear Programming

Constraints

Variables

, , Contig pair state:

Contig orientation: 𝑆 𝑖∈ {0,1 }Adjacent contig consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

𝐴𝑖𝑗+𝐷 𝑖𝑗≤1−𝑆𝑖 𝑗 𝐵𝑖𝑗+𝐶𝑖𝑗≤𝑆𝑖 𝑗

Mutual Exclusivity

Page 15: Scaffolding Large Genomes Using Integer Linear Programming

Constraints

Forbid 2 Cycles𝐵𝑖𝑗+𝐶𝑖 𝑗≤𝑆𝑖 𝑗 𝐴𝑖𝑗+𝐷 𝑖 𝑗≤1−𝑆 𝑖 𝑗

Forbid 3 Cycles2222

2222

*larger cycles are broken at the end

Page 16: Scaffolding Large Genomes Using Integer Linear Programming

Largest Connected Component

Page 17: Scaffolding Large Genomes Using Integer Linear Programming

Graph Decomposition: Articulation Points

solve

Articulation point

MIP, Salmela 2011

Page 18: Scaffolding Large Genomes Using Integer Linear Programming

Largest Biconnected Component

Page 19: Scaffolding Large Genomes Using Integer Linear Programming

Non-Serial Dynamic Programming

A technique which exploits the sparsity of the scaffolding graph by computing the solution in stages, incorporating the results from previous stages

~inspired by (Neumaier, 06)

Page 20: Scaffolding Large Genomes Using Integer Linear Programming

Non-Serial Dynamic Programming

2-cut+

+

+

-

-

+

-

-

𝑧 𝐴 𝑧𝐵

𝑧𝐶 𝑧𝐷

Page 21: Scaffolding Large Genomes Using Integer Linear Programming

Non-Serial Dynamic Programming

+

+

+

-

-

+

-

-

𝑧 𝐴 𝑧𝐵

𝑧𝐶 𝑧𝐷

+

Objective Modification:

𝑧 𝐴

𝑧𝐵

𝑧𝐶

𝑧𝐷

Page 22: Scaffolding Large Genomes Using Integer Linear Programming

SPQR-tree Based Implementation

• SPQR-tree efficiently finds 2 cuts (Tarjan, 73)

• DFS of SPQR-tree is used to schedule elimination order for NSDP

Page 23: Scaffolding Large Genomes Using Integer Linear Programming

Post Processing ILP Solution

May have cyclesNot a total ordering

for each connected components

A

B

C

DF

E

ILP Solutionoutgoing incoming

A

B

C

D

E

F

A

B

C

D

E

F

Bipartite matching Objectives:

Max weight Max cardinality Max cardinality / Max weight

Page 24: Scaffolding Large Genomes Using Integer Linear Programming

GAGE Framework

Genome Size (Mb) # readsStaphlococcus Aureus 2.9 3,494,070

Rhodobacter sphaeorides

4.6 2,050,868

Human Chr14 107 22,669,408Assembled using:

ABySS, Allpaths-LG, Bambus2, CABOG, MSR-CA, SGA, SOAPdenovo, Velvet

Scaffolded using: SILP (our method), Opera, MIP, Bambus2

Page 25: Scaffolding Large Genomes Using Integer Linear Programming

Testing Metrics

TPN50 Break scaffold at incorrect edges, then find N50 Size of contig where 50% of the contigs are this size

Binary Classification Given n contigs in a scaffold How many of n-1 adjacencies can you predict

PPV Sensitivity MCC

Page 26: Scaffolding Large Genomes Using Integer Linear Programming

Results

staph rhodo chr140

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

Scaffolding TPN50

silpoperamipbambus2

Genome

TPN

50 (

bp)

Page 27: Scaffolding Large Genomes Using Integer Linear Programming

Results

staph rhodo chr140.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

PPV

silpoperamipbambus2

Genome

PPV

Page 28: Scaffolding Large Genomes Using Integer Linear Programming

Results

staph rhodo chr140.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

Sensitivity

silpoperamipbambus2

Genome

Sens

itiv

ity

Page 29: Scaffolding Large Genomes Using Integer Linear Programming

Results

staph rhodo chr140.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

Matthews Correlation Coefficient

silpoperamipbambus2

Genome

MCC

Page 30: Scaffolding Large Genomes Using Integer Linear Programming

Conclusions

Success ILP solves scaffolding problem! NSDP works

Improvements Include SOAPdenovo, Allpaths-LG scaffolds in comparison Look at parameter effects Practical considerations (read style, multi-libraries, merge

ctgs)Future Work

Where else can I apply NSDP? Scaffold before assembly … promising Structural Variation??

Page 31: Scaffolding Large Genomes Using Integer Linear Programming

Questions?