Top Banner
Genome sequence assembly Assembly concepts and methods Mihai Pop & Michael Schatz Center for Bioinformatics and Computational Biology University of Maryland August 13, 2006
47

Assembly concepts and methods - schatzlab.cshl.edu

Nov 01, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Assembly concepts and methods - schatzlab.cshl.edu

Genome sequence assembly

Assembly concepts and methods

Mihai Pop & Michael SchatzCenter for Bioinformatics and Computational Biology

University of Maryland

August 13, 2006

Page 2: Assembly concepts and methods - schatzlab.cshl.edu

Outline

• Shotgun sequencing overview• Shotgun sequencing statistics• Theoretical Foundations• Assembly algorithms• Scaffolding

Page 3: Assembly concepts and methods - schatzlab.cshl.edu

A Genome Sequencing Project

Random sequencing Genome Assembly Annotation Data Release

Library construction

Colony picking

Template preparation

Sequencing reactions

Base calling

Sequence files

Celera AssemblerGenome scaffold

Ordered contig set

Gap closuresequence editing

Re-assembly

ONE ASSEMBLY!

Combinatorial PCRPOMP

Gene finding

Homology searches

Initial role assignments

Metabolic pathwaysGene families

Comparative genomics

Transcriptional/translational

regularory elementsRepetitive sequences

Publicationwww.tigr.org

Sample tracking

Page 4: Assembly concepts and methods - schatzlab.cshl.edu

Building a library

• Break DNA into random fragments (8-10x coverage)

Actual situation

Page 5: Assembly concepts and methods - schatzlab.cshl.edu

Building a library

• Break DNA into random fragments (8-10x coverage)• Sequence the ends of the fragments

– Amplify the fragments in a vector– Sequence 800-1000 (500-700) bases at each end of the fragment

Page 6: Assembly concepts and methods - schatzlab.cshl.edu

Assembling the fragments

Page 7: Assembly concepts and methods - schatzlab.cshl.edu

Assembling the fragments

• Break DNA into random fragments (8-10x coverage)• Sequence the ends of the fragments• Assemble the sequenced ends

Page 8: Assembly concepts and methods - schatzlab.cshl.edu

Forward-reverse constraints• The sequenced ends are facing towards each other • The distance between the two fragments is known

(within certain experimental error)

Clone

Insert

F R

FR

I II

R

I

F

II

F

II

R

I

Page 9: Assembly concepts and methods - schatzlab.cshl.edu

Building Scaffolds

• Break DNA into random fragments (8-10x coverage)• Sequence the ends of the fragments• Assemble the sequenced ends• Build scaffolds

Page 10: Assembly concepts and methods - schatzlab.cshl.edu

Assembly gaps

sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap

physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap

Sequencing gaps

Physical gaps

Page 11: Assembly concepts and methods - schatzlab.cshl.edu

Finishing the project

• Break DNA into random fragments (8-10x coverage)• Sequence the ends of the fragments• Assemble the sequenced ends• Build scaffolds• Close gaps

Page 12: Assembly concepts and methods - schatzlab.cshl.edu

Unifying view of assembly

Contigs

Scaffolding

Page 13: Assembly concepts and methods - schatzlab.cshl.edu

Shotgun sequencing statistics

Page 14: Assembly concepts and methods - schatzlab.cshl.edu

Typical contig coverage

123456 C

over

age

Contig

Reads

Imagine raindrops on a sidewalk

Page 15: Assembly concepts and methods - schatzlab.cshl.edu

Lander-Waterman statistics

L = read lengthT = minimum overlapG = genome sizeN = number of readsc = coverage (NL / G)σ = 1 – T/L

E(#islands) = Ne-cσ

E(island size) = L(ecσ – 1) / c + 1 – σcontig = island with 2 or more reads

Page 16: Assembly concepts and methods - schatzlab.cshl.edu

Example

1

20

121

698

bases not in any read

5

57

250

614

#contigs

335

6,735

49,787

367,806

bases not in contigs

7

78

304

655

#islands

13,3348

8,3345

5,0003

1,6671

Nc

Genome size: 1 Mbp Read Length: 600 Detectable overlap: 40

Page 17: Assembly concepts and methods - schatzlab.cshl.edu

Read coverage vs. Clone coverage

4 kbp

1 kbp

Read coverage = 8 x

Clone (insert) coverage = ?16

BAC-end 2x coverage implies 100x coverage by BACs

(1 BAC clone = approx. 100kbp)

Page 18: Assembly concepts and methods - schatzlab.cshl.edu

Theoretical Foundations

Page 19: Assembly concepts and methods - schatzlab.cshl.edu

Given: S = {s1, …, sn}

Problem: Find minimal superstring of S

s1,s2,s3 = CACCCGGGTGCCACC 15

s1,s3,s2 = CACCCACCGGGTGC 14

s2,s1,s3 = CCGGGTGCACCCACC 15

s2,s3,s1 = CCGGGTGCCACCC 13

s3,s1,s2 = CCACCCGGGTGC 12

s3,s2,s1 = CCACCGGGTGCACCC 15

s1 CACCC

s2 CCGGGTGC

s3 CCACC

Shortest Common Superstring

NP-Complete by reduction from VERTEX-COVER and later DIRECTED-HAMILTONIAN-PATH

Page 20: Assembly concepts and methods - schatzlab.cshl.edu

Given: F = {f 1, …, fn}, error rate ε

Problem: Find minimal sequence S over F such that for all fi in F, there is a substring B of S such that:

min(ed(fi,B), ed(fic,B)) ≤ ε |fi|

f1c GGGTG

f2c GCACCCGG

f3c GGTGG

ed(ACGTA, ACGGTA) =1

ed(ACGGGTA, ACGGTA) =1

ed(ACGCTA, ACGGTA) = 1

RECONSTRUCT

Also NP-complete: Take instance of SUPERSTRING, expand strings to force the original orientation, set ε = 0, and attempt to solve with RECONSTRUCT.

Page 21: Assembly concepts and methods - schatzlab.cshl.edu

Overlap Graph

V = {s1, s2, s3} E = {si, sj}

o(si,sj) = |v| | si = uv, sj = vw

s2

CCGGGTGC

s3

CCACC

2

24 1

12

Go = (V,E,o)s1

CACCC

The overlap graph, Go, encodes the amount of overlap between all pair of strings.

Page 22: Assembly concepts and methods - schatzlab.cshl.edu

Paths through graphs and assembly

• Hamiltonian circuit: visit each node (city) exactly once, returning to the start

A

B D C

E

H G

I

F

A

B

C

D H

I

F

G

E

Genome

Page 23: Assembly concepts and methods - schatzlab.cshl.edu

s1

CACCC

s2

CCGGGTGCs3

CCACC

2

Go = (V,E,o)

4

GREEDY(S) ≤ 2.5 OPT(S)

Runtime O( l2)

SUPERSTRINGis MAX SNP-hard, so one of the best approximation algorithms possible.

n2)(

Greedy Approximation

Page 24: Assembly concepts and methods - schatzlab.cshl.edu

Greedy Assembly

Build a rough map of fragment overlaps

1. Pick the largest scoring overlap2. Merge the two fragments3. Repeat until no more merges can

be done

• TIGR Assembler• phrap• gap

Page 25: Assembly concepts and methods - schatzlab.cshl.edu

Overlap-layout-consensus

Main entity: readRelationship between reads: overlap

12

3

45

6

78

9

1 2 3 4 5 6 7 8 9

1 2 3

1 2 3

1 2 3 12

3

1 3

2

13

2

ACCTGAACCTGAAGCTGAACCAGA

Page 26: Assembly concepts and methods - schatzlab.cshl.edu

Repeats!

1 2 3R1 R2

1 2R1 + R2 3

True Layout of Reads

Greedy Reconstruction

Page 27: Assembly concepts and methods - schatzlab.cshl.edu

Mis-assembled repeats

a b c

a c

b

a b c d

I II III

I

II

III

a

b c

d

b c

a b d c e f

I II III IV

I III II IV

a d b e c f

a

collapsed tandem excision

rearrangement

Page 28: Assembly concepts and methods - schatzlab.cshl.edu

Modern Assembly

Try to detect presence of repeats by

1. Unusual depth of coverage (arrival rate)2. Mate Pair information

3. Forks in overlap graph

1 2R1 + R2 3

Page 29: Assembly concepts and methods - schatzlab.cshl.edu

Modern Assembly

Try to detect presence of repeats by

1. Unusual depth of coverage (arrival rate) 2. Mate Pair information

3. Forks in overlap graph

1 2R1 + R2 3

Page 30: Assembly concepts and methods - schatzlab.cshl.edu

Modern Assembly

Try to detect presence of repeats by

1. Unusual depth of coverage (arrival rate)2. Mate Pair information

3. Forks in overlap graph

A

T

1 2R1 + R2

Page 31: Assembly concepts and methods - schatzlab.cshl.edu

SCAFFOLDING

Page 32: Assembly concepts and methods - schatzlab.cshl.edu

Scaffolding

• Given a set of non-overlapping contigsorder and orient them along a chromosome

III III IV

I

IIIII

IV

Page 33: Assembly concepts and methods - schatzlab.cshl.edu

Clone-mates

Clone

Insert

F R

FR

I II

R

I

F

II

F

II

R

I

Page 34: Assembly concepts and methods - schatzlab.cshl.edu

Scaffolder output

Sequencing gaps

Physical gaps

• order and orientation of contigs• size of gaps between contigs• linking evidence: mate-pairs spanning gaps

Page 35: Assembly concepts and methods - schatzlab.cshl.edu

Problems with the data

• Incorrect sizing of inserts– cut from gel – sizing is subjective– error increases with size

• Chimeras (ends belong to different inserts)– biological reasons (esp. for large sized inserts)– sample tracking (human error)

• Software must handle a certain error rate.

Page 36: Assembly concepts and methods - schatzlab.cshl.edu

Theoretical abstraction

• Given a set of entities (reads/contigs) and constraints between them (overlaps/mate pairs) provide a linear/circular embedding that preserves most constraints.

Page 37: Assembly concepts and methods - schatzlab.cshl.edu

Graph representation• Nodes: contigs• Directed edges: constraints on relative

placement of contigs – relative order and relative orientation

• Embedding: order (coordinate along chromosome) and orientation (strand sampled)

Page 38: Assembly concepts and methods - schatzlab.cshl.edu

Challenges

• Orientation – node coloring problem (forward/reverse)– feasibility – no cycles with odd number of

“reversal” edges

– optimality – remove minimum number of edges such that a solution exists (NP-hard)

Page 39: Assembly concepts and methods - schatzlab.cshl.edu

Challenges

• Ordering – generate a linear embedding– feasibility – lengths of parallel DAG paths

are consistent– optimality – remove minimum number of

edges such that DAG is feasible (NP-hard)

Page 40: Assembly concepts and methods - schatzlab.cshl.edu

The real world

• Use of scaffolds– Analysis – longest unambiguous sub-graphs– Finishing – present all “reliable” relationships

between contigs

• Sources of error– mis-assemblies– sizing errors (increases with library size)– chimeras

Page 41: Assembly concepts and methods - schatzlab.cshl.edu

Ambiguous scaffold

I

II

III

I II III

I IIIII

I III I’ II

Page 42: Assembly concepts and methods - schatzlab.cshl.edu

Repeats vs. Haplotypes

2 3

4

1

6 5

7 8 92% 95% 87% 83% 90%

Page 43: Assembly concepts and methods - schatzlab.cshl.edu

Hierarchical scaffolding1. For each contig pair, consolidate all

linking data into a single relationship –2 correct links required

Page 44: Assembly concepts and methods - schatzlab.cshl.edu

Hierarchical scaffolding

2. Use most reliable links to build scaffolds

3. Repeatedly build super-scaffolds based on less reliable linking data

Page 45: Assembly concepts and methods - schatzlab.cshl.edu

Linking information

• Overlaps

• Mate-pair links

• Similarity links

• Physical markers

• Gene synteny

reference genome

physical map

Page 46: Assembly concepts and methods - schatzlab.cshl.edu

BAMBUS(bamboo)

Best effort Attempt Multiple Branches allowed

Order, Orient

Page 47: Assembly concepts and methods - schatzlab.cshl.edu

ReferencesReview of assembly Pop, M. Shotgun sequence assembly; in Advances in Computers vol. 60.

Elsevier, 2004, pp. 193-247

TIGR Assembler Sutton, G.G., et al., TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects. Genome Science and Technology, 1995. 1:9-19.

Celera Assembler Myers, E.W. et al. 2000. A whole-genome assembly of Drosophila. Science 287: 2196-2204.

Arachne Batzoglou, S., et al. 2002. ARACHNE: a whole-genome shotgun assembler. Genome Res 12: 177-189.

Jaffe, D.B., et al. 2003. Whole-genome sequence assembly for Mammalian genomes: arachne 2. Genome Res 13: 91-96.

phrap Green, P., PHRAP documentation: ALGORITHMS. 1994 http://www.phrap.org.

Euler Pevzner, P. et al. 2001. Fragment assembly with double-barreled data. Bioinformatics. 17: S225-S233.

CAP3 Huang, X. and A. Madan, CAP3: A DNA Sequence Assembly Program.Genome Research, 1999. 9:868-877.

BAMBUS Pop, M. et al. Hierarchical scaffolding with Bambus, Genome Research, 2004, 14(1):149-159