Top Banner
De-novo Assembly Day 4
19

De-novo Assembly

Jan 23, 2016

Download

Documents

keitha

De-novo Assembly. Day 4. The Challenge. Given many (millions or billions) of reads, produce a linear (or perhaps circular) genome Issues : Coverage Errors in reads Reads vary from very short (35bp) to quite long (800bp), and are double-stranded Non-uniqueness of solution - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: De-novo Assembly

De-novo Assembly

Day 4

Page 2: De-novo Assembly

The Challenge

• Given many (millions or billions) of reads, produce a linear (or perhaps circular) genome

• Issues: – Coverage– Errors in reads– Reads vary from very short (35bp) to quite long (800bp),

and are double-stranded– Non-uniqueness of solution– Running time and memory

Page 3: De-novo Assembly

In an ideal world…

• Reads have no error

• Reads are long enough and each appears once

• Each read given in the same orientation (all 5’ to 3’, for example)

• Contiguous reads have enough overlap so that they can easily be assembled

Page 4: De-novo Assembly

DNA target sampleSHEAR & SIZE

End Reads / Mate Pairs

550bp

10,000bp

Shotgun DNA Sequencing

Not all sequencing technologies produce mate-pairs. Different error modelsDifferent read lengths

Page 5: De-novo Assembly

Terminology

• Reads are what you start with (35bp-800bp)

• Fragmented assemblies produce contigs

• Contigs can be put together into scaffolds

Page 6: De-novo Assembly

Reference-based vs de novo assembly

• Comparative assembly means you have a “reference” genome– Same OR related species– Can get weird in bacteria and viruses due to recombination

• De novo assembly means you do everything from scratch

Page 7: De-novo Assembly

Comparative Assembly

Much easier than de novo!

Basic idea: • Take the reads and map them onto the

reference (allowing for small mismatches)• Collect all overlapping reads, produce a

multiple sequence alignment, and produce consensus sequence

Page 8: De-novo Assembly

Comparative Assembly

• Fast

• Short reads can map to several places (especially if they have errors)

• Needs close reference genome

• Repeats are problematic

• Can be highly accurate even when reads have errors

Page 9: De-novo Assembly

De Novo assembly

• Much easier to do with long reads

• Need very good coverage

• Generally produces fragmented assemblies

• Necessary when you don’t have a closely related (and correctly assembled) reference genome

Page 10: De-novo Assembly

Conceptual steps in de novo assembly

1. Find reads that overlap by a specified number of bases (the k-mer size)

2. Merge overlapping, “good” reads into longer contigs

3. Link contigs to form scaffolds using paired-end information

Diagrams from Serafim Batzoglou, Stanford

Page 11: De-novo Assembly

11

De Novo Assembly paradigms

• Overlap-layout-consensus methods

• k-mer graph (especially useful for assembly from short reads)– aka de Bruijn graph

Page 12: De-novo Assembly

de Bruijn graph hask-mers as nodes connected by reads;

assembly involves finding Eulerian path through graph

Diagram from Michael Schatz, Cold Spring Harbor

Page 13: De-novo Assembly

de Bruijn graph• Vertices are k-mers that appear in some read, and edges defined by

overlap of k-1 nucleotides

• Small values of k produce small graphs

• Does not require all-pairs overlap calculation!

• But: loss of information about reads can lead to “chimeric” contigs, and incorrect assemblies

• Also produces fragmented assemblies (even shorter contigs)

Page 14: De-novo Assembly

de Bruijn Graph use

• Create the de Bruijn graph for the following string, using k=5– ACATAGGATTCAC

• Find the Eulerian path

• Is the Eulerian path unique?

• Reconstruct the sequence from this path

Page 15: De-novo Assembly

But…

• Because of – Errors in reads– Repeats– Insufficient coveragede Bruijn graphs generally don’t Eulerian paths/circuits

• This means the first step doesn’t completely assemble the genome

Page 16: De-novo Assembly

Velvet and SOAPdenovo are two leading assemblers

• Velvet is from EMBL-EBI– http://www.ebi.ac.uk/~zerbino/velvet

• SOAPdenovo is from BGI– http://soap.genomics.org.cn/soapdenovo.html

• k-mer size is adjustable parameter– Typically it is adjusted to maximize N50 length of scaffolds or contigs– N50 length is central measure of distribution weighted by lengths

Page 17: De-novo Assembly

NOTE: QC step slightly different

1. You still do standard checks with something like FASTQC

2. Read Trimming is different– Algorithms/tools can deal with different read lengths– Finding overlaps give longer contigs– So we never want to sacrifice good reads

3. Solution: 1. remove sequencing adapters 2. trim individual reads as needed

• http://www.usadellab.org/cms/index.php?page=trimmomatic

Page 18: De-novo Assembly

Three useful measures for optimization of k-mer length

• Mean length: the usual average of the lengths

• Median length: the length for which half the sequences are shorter & half are longer

• N50 length: the length that splits the total bases in half, after the lengths are ordered

• Example values for distribution of contig lengths:– Mean length: 627– Median length: 200– N50 length: 1,718

• We’ll look at using N50 in in the practical

Page 19: De-novo Assembly

Practical prep

• Download and install– TRIMMOMATIC:

http://www.usadellab.org/cms/index.php?page=trimmomatic

– VELVET: http://www.ebi.ac.uk/~zerbino/velvet/• Use make ’MAXKMERLENGTH=70’ when compiling

• Grab practical dataset from course page