Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake
Jan 16, 2016
Biological Motivation for Fragment Assembly
Rhys Price Jones
Anne R. Haake
What is fragment assembly?
• The reconstruction of the contiguous chromosomal DNA sequence from short, experimentally-generated fragments– i.e. sequence reassembly
• The sequence reassembly process must realign the short fragments, in the correct order, and then generate a consensus sequence.
A Simple Case
• Suppose target sequence is known to be about 10 bp
• Sequenced fragments are:
ACCGTCGTGCTTACTACCGT
--ACCGT------CGTGCTTAC------TACCGT--
__________TTACCGTGC
Overlaps between fragments and the estimated length of the target sequence guide the assembly
Why is fragment assembly important?
• We need to have reliable, complete genomic sequences of human and other model organisms
• Base-pair sequence is the most basic piece of DNA information (gene structure and function described by sequence)
Why fragment the DNA in the first place?
• Human genome is large: ~3 X 109 base pairs long
• Sequencers can generate sequences only approx. 500-600 bp long at a time
Solutions?
• Directed Sequencing: use custom primers to sequentially sequence from genomic DNA This is a slow and expensive process
• Shotgun Sequencing: DNA is extracted, fragmented (e.g. sheared), cloned, sequenced from both ends of clone, reassembled, and finished (gaps are closed)
Solutions?
• Cloning of fragments is accomplished using different vectors, chosen according to the size of the fragments (inserts into the vector).
• Large fragments: YACs 1 Mb, BACs 100-200 Kb
• Intermediate: Cosmids, Lamba• Small: Plasmids, M13
Genome Sequencing Strategies
• Human Genome Project: map-based strategy– initially used “tiling set” of large clones that cover
genome– ends of the tiling set clones sequenced to allow
ordering/mapping to the chromosome– individual clones subjected to shotgun sequencing – the sequences from the clones (shotgun
fragments) then reassembled
• Celera: whole genome sequence strategy– shotgun sequencing
Celera: Whole Genome Sequencing
• Celera (which won the race for the draft human sequence) took a whole genome sequence strategy
• cloned all of the fragmented human genome into 3 different sized clone libraries
• sequenced both ends of each clone• reassembly • advances in automated sequencing speed
and accuracy were key to the success of the Celera approach
Another Reason Fragment Assembly is Important:
• Assembly and/or clustering sets of expressed sequence tags (ESTs)
• The problem is that these are partial and they may span more than one exon (intron sequences, present in the genomic sequence have been spliced out)
• Identity of the ESTs and assignment to genes is aided by finding overlap with other ESTs.
Experimental issues present some challenges for algorithm development
• DNA sequencing data is imperfect• Every base in the DNA should be covered several
times (at least twice; once in each direction) to minimize effects of random errors
• Base calling (determining of the base identity from the DNA sequencer trace) errors can occur -the quality of traces is not always high. Capillary tube sequencing has reduced errors caused by lane bleed-through of slab gel sequencing
• Basecalling software (e.g. Phred) attempts to assign base to each position in sequence as well as quality data
• The quality of the sequence tends to degrade at the ends.
• Vector sequence also contaminating at ends.• NHGR standard: 99.99% accuracy before
submission of sequence to GenBank.
SeqManContig assembler and trace viewer. Can align
against a reference sequence
http://www.dnastar.com/images2/r13a_lg.gif
http://www.dnastar.com/images2/r13a_lg.gif
A big issue:
• Human genome contains repetitive sequences– Highly repetitive: not-transcribed, role unknown,
present in millions of copies. Satellite (5-50 bp), – Moderately repetitive: some are transcribed,
present in up to 100,000’s of copies• Tandem repeats e.g. Minisatellite (12-100 bp),
Microsatellite (2-6 bp), telomeres• Interspersed repeats: larger repeats with high copy
number e.g. SINE (Alu), LINE, tRNAs, rRNAs
Another issue:
• Orientation of the fragments is unknown• Is the input fragment or its reverse
complement a substring of the consensus?
CACGT CACGT--------ACGT -ACGT---------ACTACG --CGTAGT----GTACT -----AGTAC---ACTGA --------ACTGACTGA ---------CTGA
Yet, another
• Chimeras (mixed or heterogeneous DNA) may be introduced during the cloning process
• DNA from non-contiguous regions of the chromosome may be introduced as well as host DNA (for example, when growing plasmids in E. coli, the E. coli chromosomal DNA often contaminates clones)
General Considerations:
• The algorithms used to generate the consensus sequence must take the biological issues into account (although some don’t!).
• Need to consider prior biological information when analyzing a program’s assembly output.– e.g. known chromosomal sites or DNA
fingerprinting data may be inconsistent with the program’s assembly output.
Fragment Assembly Programs
• GeneSkipper• Phred/Phrap/Consed