Top Banner
1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE , Lan CSE497 Feb.24, 2004
43

1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

1

Sequencing and Sequence Assembly

--overview of the genome sequenceing process

Presented by NIE , LanCSE497

Feb.24, 2004

Page 2: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

2

Introduction

Q: What is Sequence A: To sequence a DNA molecule is to obtain

the string of bases that it contains. Also know as read

Q: How to sequence A: Recall the Sanger Sequencing technology

mentioned in Chapter 1

Page 3: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

3

Introduction

Cut DNA at each base:A,C,G,T

Fragment’s migrate

distance is inversely

proportional to their

size

Sanger Sequencing

TCGCGATAGCTGTGCTA

Run gel and read off

sequence

Page 4: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

4

Introduction

Limitation

The size of DNA fragments that can be read in this way is about 700 bps

Problem

Most genomes are enormous (e.g 108 base pair in case of human).So it is impossible to be sequenced directly! This is called Large-Scale Sequencing

Page 5: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

5

Introduction

Solution Break the DNA into small

fragments randomly Sequence the readable

fragment directly Assemble the fragment

together to reconstruct the original DNA

Scaffolder gaps

Solving a one-dimensional jigsaw puzzle with millions of pieces(without the box) !

Page 6: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

6

1. Break

2. Sequence

3. Assemble

4. Scaffolder

5. Conclusion

Page 7: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

7

Break

DNA can be cutten into pieces through mechanical means

Page 8: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

8

Issues in Break

How?

• Coverage

The whole fragments provide an 8X oversampling of

the genome

• Random

Libraries with pieces sizes of 2,4,6,10, 12 and 40 k bp were

produced

• Clone

Obtaining several copies of the original genome and fragments

Page 9: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

9

1. Break

2. Sequence

3. Assemble

4. Scaffolder

5. Conclusion

Page 10: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

10

Sequence

clone

Directed sequencing

(GEL)

GTCCAGCCT

Q: can we read the fragment from both end?

Page 11: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

11

1. Break

2. Sequence

3. Assemble

4. Scaffolder

5. Conclusion

Page 12: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

12

3. Assemble

A Simple Example

ACCGT CGTGC TTAC

Overlap: The suffix of a fragment is same as the prefix of another.

Assemble: align multiple fragments into single continuous sequence based on fragment overlap

--ACCGT

----CGTGC

TTAC

TTACCGTGC

Page 13: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

13

3. Assemble

fragments

fragments

assemble

target

original

contig1 contig2 gap

Page 14: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

14

A simple model

The simplest, naive approximation of DNA assemble corresponds to Shortest Superstring Problem(SCS): Given a set of string s1, ... , sn, find the shortest string s such that each si appears as a substring of s.

--ACCGT

----CGTGC

TTAC

TTACCGTGC

Page 15: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

15

(1) Overlap step

Create an overlap graph in which every node is a

fragment and edges indicate an overlap

(2) Layout step

Determine which overlaps will be used in

the final assembly, find an optimal spanning

forest on the overlap graph

Page 16: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

16

Overlap step

Finding overlap Compare each fragment with other fragments to find

whether there’s overlap on its end part and another’s beginning part.

We call ‘a overlap b’ when a’s suffix equal to b’s prefix

Page 17: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

17

Overlap step

Overlap graphDirected, weighted graph G(V,E,w)V: set of fragmentsE : set of directed edge indicates the overlap between two fragments. An edge <a,b,w> means an overlap between a and b with weight w. this equal to suffix(a,w)=prefix(b,w)

Page 18: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

18

Example

w

z

x

u

s y

W=AGTATTGGCAATC Z=AATCGATGU=ATGCAAACCTX=CCTTTTGGY=TTGGCAATCAS=AATCAGG

43

3

9 4

5

Page 19: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

19

Layout step

Looking for shortest common superstring is the same as looking for path of maxium weight

Using greedy algorithm to select a edge with the best weight at every step.

The selected edge is checked by Rule. If this check is accepted, the edge is accepted, otherwise omit this edge

Rule: for either node on this edge, indegree and outdegree <=1; Acyclic

Page 20: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

20

At last the fragments merged together , from the point of graph, it is a forest of hamitonian paths(a path through the graph that contains each node at most once)., each path correspond to a contig

Page 21: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

21

Example

w

z

x

u

s y

W=AGTATTGGCAATC Z=AATCGATGU=ATGCAAACCTX=CCTTTTGGY=TTGGCAATCAS=AATCAGG

43

3

9 4

5

W->Y->SAGTATTGGCAATC TTGGCAATCA AATCAGG

AGTATTGGCAATCAGG

Z->U->XAATCGATG

ATGCAAACCT

CCTTTTGG

AATCGATGCAAACCT TTTGG

Page 22: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

22

Geedy Algorithm is neither optimal nor complete, and will introduce gap

2

2

3

GCC

ATGC TGCAT

Can’t correctly model the assembly problem due to complication in the real problem instance

Page 23: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

23

Complication with Assemble

Sequencing errors. Most sequencers have around 1% error in the best case.

Unknown orientation. Could have sequenced either strand.

Bias in the reads. Not all regions of the sequence will be covered equally.

Repeats. There is much repetitive sequence, especially in human and higher plants

Page 24: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

24

Sequenceing Errors

Fragments contains3 kinds of errors: insert, deletion, substitution

Possibility :Substitutions ( 0.5-2% ), insert and deletion occur roughly 10 times less frequently

http://compbio.uchsc.edu/Hunter_lab/Hunter/bioi7711/lecture6.ppt

Page 25: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

25

x:ACCGT

Y:CGTGC

Z:TTAC

U:TACCGT

Problems with the simple model - Errors

A

G

3

x y

u z

3

52

x y

u z

--ACCGT

----CGTGC

TTAC

-TACCGT

TTACCGTGC

Page 26: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

26

Problems with the simple model - Errors

Solution

Allow for bounded number of mismatches between overlapping fragments ----- Approximate overlaps

Criterion: minimum overlap length(40 bps), error rate(less than 6% mismatches )

How?

Using semi-global alignment to find the best match between the suffix of one sequence and the prefix of another.

Page 27: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

27

semi-global alignment

Score system: 1 for matches, -1 for mismatches, -2 for gaps

Initializing the first row and first column of zero, ignore gap in both extremities

Algorithm is same as global comparision

Search last column for higest score and obtain alignment by tracing back to start point ( overlap of x over y). overlap of y over x corresponds to the max in the last row

0000000000000……y0

0

0

x

Page 28: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

28

0 0 0 0 0 0

0 -1 1 1 -1 -2

0 -1 -1 0 2 0

0 1 -1 -2 1 1

0 -1 0 -2 -1 2

0 -1 -2 -1 -1 0

0 -1 0 -1 -2 -2

C

G

A

T

G

C

A C C G T

Overlap: y->x

CGATGC---

----ACCGT

Y:

X:

Overlap:x->yACCG-T—

--CGATGC

Page 29: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

29

x:ACCGT

Y:CGTGC

Z:TTAC

U:TACCGT

Problems with the simple model - Errors

A

G

3

x y

u z

3

52

--ACCGT

----CGTGC

TTAC

-TACCGT

TTACCGTGC

--ACCG-T

----CGATGC

TT-C

-TAGCGT

TTACCGTGC

Criterion

1.Score>-3

2. Mismatch<2

2x y

u z

0

-2

0

Page 30: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

30

Problems with the simple model - Unkown orientation

Unknowns Orientation:

Fragments can be read from both of the DNA strands.Solution

Try all possible combination

CACGT

ACGT

ACTACG

GTACT

CACGT

ACGT

CGTAGT

AGTAC

CACGT

-ACGT

--CGTAGT

-----AGTAC

CACGTAGTACTGA

z

x

y

Z’

X’

Y’

Page 31: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

31

Problems with the simple model - Repeat

Repeats can be characterized by length, copy number & fidelity between copies– Human T-cell receptor: 5x of a 4kb gene w/ ~3%

variation– ALUs. ~300bp w/5-15% variation, clustering to be

50-60% of many human sequence regions– microsatellites, 3-6bp with thousands of repeats in

centromeric and telemeric regions, 1-2% variation.

gepard.bioinformatik.uni-saarland.de/html/BioinformatikIIIWS0304-Dateien/ V3-Assembly.ppt

Page 32: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

32

Problems with the simple model - Repeat2

Original OneDX1 B C X3X2A

X2X1

3X

AC B

D

Fragment

X3X2

X1

AC

BD

Assembler

Rearrangment

DX2 BC X1X3AConsensus

Page 33: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

33

Problems with the simple model - Repeat3

X1 B CX2AOriginal one

AX2X1

BC

Assembly

XA C BTarget one

gapContig 1 Contig2

Overcollapsing

! Shortest string is not always the

best!

Page 34: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

34

Problems with the simple model -Lack of coverage

Lack of coverage

Not all regions of the sequence will be covered equally

Target DNA

Uncovered area

Solution

Do more sampling to increase the coverage level

Using scaffolder technology

Page 35: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

35

1. Break

2. Sequence

3. Assemble

4. Scaffolder

5. Conclusion

Page 36: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

36

4. Scaffolder

Scaffold

Given a set of non-overlapping contigs, order and orient them to reconstruct the original DNA

How?

Is there any relationsip can be built between different contigs?

X B CA

X B’C’A

Page 37: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

37

4. Scaffolder -Mate Pairs

Mate pairs: The sequenced ends are facing towards each other The distance between the two fragments is known( insert size – fragment

size) The mate pairs is extremly valuable during the scaffold step.

Mate Pair

Page 38: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

38

4. Scaffolder -Method

• A scaffold retrieve the original mate pairs spanning in different contigs

• Using the link information of the pairs( Distance, Orientation) to orients contigs and estimates the gap size, this is calles “walk”

Page 39: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

39

4 Scaffolder -Example

Contig 1 Contig 2

gap

Page 40: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

40

4 Scaffolder

Graph Representation Nodes: contigs Directed edges: constraints on relative

placement of contigs – relative order and relative orientation

http://jbpc.mbl.edu/jbpc/GenomesMedia/10_14POP.PPT

Page 41: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

41

1. Break

2. Sequence

3. Assemble

4. Scaffolder

5. Conclusion

Page 42: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

42

5. Conclusion

The whole genome sequencing processBreak-> Sequence -> Assemble-> Scaffolder

A Simple Model Using overlap graph to construct the shortest common string However, it can’t corrctly model the assembly problem

Page 43: 1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.

43

Conclusion-Repeat

• Repeat detection– pre-assembly: find fragments that belong to repeats

statistically (most existing assemblers) repeat database (RepeatMasker)

– during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001)

– post-assembly: find repetitive regions and potential mis-assemblies. (Reputer, RepeatMasker)

• Repeat resolution– find DNA fragments belonging to the repeat– determine correct tiling across the repeat