Top Banner
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake
20

Biological Motivation for Fragment Assembly

Jan 14, 2016

Download

Documents

Anoush

Biological Motivation for Fragment Assembly. Rhys Price Jones Anne R. Haake. What is fragment assembly?. The reconstruction of the contiguous chromosomal DNA sequence from short, experimentally-generated fragments. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Biological Motivation for Fragment Assembly

Biological Motivation for Fragment Assembly

Rhys Price Jones

Anne R. Haake

Page 2: Biological Motivation for Fragment Assembly

What is fragment assembly?

• The reconstruction of the contiguous chromosomal DNA sequence from short, experimentally-generated fragments.

• The sequence reassembly process must realign the short fragments, in the correct order, and then generate a consensus sequence.

Page 3: Biological Motivation for Fragment Assembly

A Simple Case

• Suppose target sequence is known to be about 10 bp

• Sequenced fragments are:

ACCGTCGTGCTTACTACCGT

Page 4: Biological Motivation for Fragment Assembly

--ACCGT------CGTGCTTAC------TACCGT--

__________TTACCGTGC

Overlaps between fragments and the estimated length of the target sequence guide the assembly

Page 5: Biological Motivation for Fragment Assembly

Why is fragment assembly important?

• We need to have reliable, complete genomic sequences of human and other model organisms

• base-pair sequence is the most basic piece of DNA information (gene structure and function described by sequence)

Page 6: Biological Motivation for Fragment Assembly

Why fragment the DNA in the first place?

• Human genome is large: ~3 X 109 base pairs long

• Sequencers can generate sequences only approx. 500-600 bp long at a time

Page 7: Biological Motivation for Fragment Assembly

Solutions?

• Directed Sequencing: use custom primers to sequentially sequence from genomic DNA This is a slow and expensive process

• Shotgun Sequencing: DNA is extracted, fragmented (e.g. sheared), cloned, sequenced from both ends of clone, reassembled, and finished (gaps are closed)

Page 8: Biological Motivation for Fragment Assembly

Solutions?

• Cloning of fragments is accomplished using different vectors, chosen according to the size of the fragments (inserts into the vector).

• Large fragments: YACs 1 Mb, BACs 100-200 Kb

• Intermediate: Cosmids, Lamba• Small: Plasmids, M13

Page 9: Biological Motivation for Fragment Assembly

Human Genome Project vs Celera

• HGP: initially used “tiling set” of large clones that cover genome

• ends of the tiling set clones sequenced to allow ordering/mapping to the chromosome

• individual clones subjected to shotgun sequencing

• the sequences from the clones (shotgun fragments) then reassembled

Page 10: Biological Motivation for Fragment Assembly

Celera: Whole Genome Sequencing

• Celera (which won the race) took a whole genome sequence strategy

• cloned all of the fragmented human genome into 3 different sized clone libraries

• sequenced both ends of each clone• reassembly • advances in automated sequencing speed

and accuracy were key to the success of the Celera approach

Page 11: Biological Motivation for Fragment Assembly

Another Reason Fragment Assembly is Important:

• Assembly and/or clustering sets of expressed sequence tags (ESTs)

• The problem is that these are partial and they may span more than one exon (intron sequences, present in the genomic sequence have been spliced out)

• Identity of the ESTs and assignment to genes is aided by finding overlap with other ESTs.

Page 12: Biological Motivation for Fragment Assembly

Biological issues present some challenges for algorithm development

• DNA sequencing data is imperfect• Every base in the DNA should be covered several

times (at least twice; once in each direction) to minimize effects of random errors

• Base calling (determining of the base identity from the DNA sequencer trace) errors can occur -the quality of traces is not always high. Capillary tube sequencing has reduced errors caused by lane bleed-through of slab gel sequencing

Page 13: Biological Motivation for Fragment Assembly

• Basecalling software (e.g. Phred) attempts to assign base to each position in sequence as well as quality data

• The quality of the sequence tends to degrade at the ends.

• Vector sequence also contaminating at ends.• NHGR standard: 99.99% accuracy before

submission of sequence to GenBank.

Page 14: Biological Motivation for Fragment Assembly

A big issue:

• Human genome contains many repeats• Highly repetitive: not-transcribed, role

unknown, present in millions of copies. Satellite (5-50 bp), Minisatellite (12-100 bp), Microsatellite (2-6 bp)

• Moderately repetitive: some are transcribed, present in up to 100,000’s of copies– larger repeats with high copy number:

• telomeres, SINE (e.g. Alu), LINE, tRNAs, rRNAs

Page 15: Biological Motivation for Fragment Assembly

Another issue:

• Orientation of the fragments is unknown• Is the input fragment or its reverse

complement a substring of the consensus?

CACGT CACGT--------ACGT -ACGT---------ACTACG --CGTAGT----GTACT -----AGTAC---ACTGA --------ACTGACTGA ---------CTGA

Page 16: Biological Motivation for Fragment Assembly

Yet, another

• Chimeras (mixed or heterogeneous DNA) may be introduced during the cloning process

• DNA from non-contiguous regions of the chromosome may be introduced as well as host DNA (for example, when growing plasmids in E. coli, the E. coli chromosomal DNA often contaminates clones)

Page 17: Biological Motivation for Fragment Assembly

General Considerations:

• The algorithms used to generate the consensus sequence must take the biological issues into account.

• Need to consider prior biological information when analyzing a program’s assembly output.– e.g. known chromosomal sites or DNA

fingerprinting data may be inconsistent with the program’s assembly output.

Page 18: Biological Motivation for Fragment Assembly
Page 19: Biological Motivation for Fragment Assembly
Page 20: Biological Motivation for Fragment Assembly