Top Banner
Scaffolding Problems Gao Song 2010/04/27
23

Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Dec 17, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Scaffolding ProblemsGao Song

2010/04/27

Page 2: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

OutlineConceptsProblem definitionNon-error CaseEdge-error CaseDisconnected ComponentsSimulated DataFuture Work

Page 3: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

ConceptsContig:Edge (PET): library sizeScaffolding: a sequence of contigsHappy Edge:

Real distance <= expected distanceOrientation of both contigs are correct

Page 4: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Problem DefinitionVersion 1: Given a set of contigs and a set of

edges, find a scaffold which has at most p unhappy edges

Version 2: Given a set of contigs and a set of edges, find a scaffold which has at most p unhappy edges and is also the optimal solution

Page 5: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Non-error CaseConnected graphPartial Layout:

Dangling Edge: only one end in partial layoutActive region: the sequence from the first

contig having dangling edges to the end of partial layout; less than library size

Domain of a partial layout: all nodes in partial layout

Page 6: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Non-error CaseTheorem: if two partial layout l1 and l2 have

same active region and dangling set, then (1) they have same domain(2) both or neither of them can extend to a

solutionProof:

Page 7: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Procedure

Find the unassigned nodeSelect the nearest node as next assigned node

Update current partial layoutRemove all dangling edges incident to new

nodeAdd new dangling edges of new nodeRemove contigs from active region

Page 8: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Main ProcedureFind all nodes which has no ancestors and

select one to startFrom an active region, get all unassigned

nodes, and update the partial layoutRemember all visited partial layoutIf dangling edge set is empty, output the

results

Page 9: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Time and space complexityTwo possibilities

k vertices in active region – one possible next nodes

Less than k vertices in active region – n possible next nodes

ComlexityO(nk)*O(1)O(nk-1)*O(n)Total time complexity: O(nk)Total space complexity: store all visited partial

order

Page 10: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Introduce Edge ErrorTypes of edge error

Chimeric PETs: Mapping errorMisassembled contigs

SolutionFiltering – filter chimeric PETs

Select x% of PETs Shuffle them to get chimeric PETs Cluster them to find threshold

Local threshold

.

.

.

.

.

.

Page 11: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Introduce Edge ErrorThere are p unhappy edges in final

scaffoldingPartial layout

Dangling edges: real dangling edges; wrong edges

Page 12: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Equivalent ClassActive region, dangling edges’ set,

count of current wrong edgesSame domainAssumption: the partial order is a connected

graph

Page 13: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Get Unassigned NodesSort the unassigned nodesProperties of nodes:

Steps to reach this nodeDistance to the end of active regionUnhappy edges introduced due to this node

Page 14: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Sort Unassigned NodesBreadth-first search

Select the smallest possible distance: > threshold

Sort nodes:Less than 5 steps, compare with distance;

same distance, compare with unhappy edges

Page 15: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Update Partial LayoutCheck if all incident un-wrong dangling edges are

happyIf yes, just remove all those edges and add new

nodeIf no, check if setting all unhappy edges as omitted

will result in disconnected graph If no, just add new node and remove dangling edges If yes, discard current partial layout – to avoid insert

disconnected component into sequence

Add new dangling edgesRemove all dangling edges which is not happy –

check connectness

Page 16: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.
Page 17: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Main ProcedureIf active region is empty

Current connected component is finishedCheck if dangling edge set is empty

If yes, output the result If no, using dangling edges to find a new node and

start another scaffolding

Page 18: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Disconnected ComponentsFirst find all the connected components and sort them

according to the number of nodes

From the first component, find a solution, which omits p1 edges

For ith component, if there is no solution omits p-sum(p1,…, pi-1) edges, remember all the stop point, return to (i-1)th component, and see if it can find a solution which omits less than pi-1 edges. If yes, continue from the stop point of ith component.

Page 19: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

If ith component finishes the whole search and found more than one solutions. Then, only remember the solution with minimum pi. Then, in the future, when comes to this component, just use this solution as part of the partial results

Page 20: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Optimal SolutionBranch and Bound

P’ edges

Page 21: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Simulated Data ResultNode Num: 1522 nodesContig length: 600 - 10,000

Wrong edges p Time(ms)

0 0 2765

1 1 2984

2 2 4984

3 3 6562

4 4 7000

5 5 7328

6 6 7281

7 7 7343

8 8 7406

9 9 51813

10 10 216984

Page 22: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Future WorkFind the optimal solutionWrong contigsRepeatsHow to deal with large pFind a good way to sort the unassigned nodes

Page 23: Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.

Thank you