Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
Post on 15-Jan-2016
213 Views
Preview:
Transcript
1
Pamela Ferretti
Laboratory of Computational Metagenomics
Centre for Integrative BiologyUniversity of Trento
Italy
Microbial Genome Assembly
2
Outline-summary
4. CASE STUDY
2. GENOME ASSEMBLY
3. ASSEMBLY STRATEGIES
1. QUICK INTRODUCTION
3
DNA packaging
4
DNA packaging
5
Outline-summary
4. CASE STUDY
2. GENOME ASSEMBLY
3. ASSEMBLY STRATEGIES
1. QUICK INTRODUCTION
6
Next Generation Sequencing
TCTTATTGTGACC TAGGCTAGCTTAG
GCAATGCAGTAAC TCCAGCTAGGTTC
ACGTAGGCTAGCGTTAGCGA ........ CTGCAT C
7
Genome Assembly
1. GENOME SEQUENCING2. PRELIMINARY ANALYSIS3. ASSEMBLY4. ADVANCED BIOINFORMATIC ANALYSIS
OVERLAPPING SEQUENCE ALIGMENT
Sequencing the human genome with shotgun sequencing + assembly is the only feasible strategy
Computational assembly of shotgun sequencing data is simply unfeasible, and a bad idea anyway
Weber, James L., and Eugene W. Myers. "Human whole-genome shotgun sequencing." Genome Research 7.5 (1997): 401-409.
Green, Philip. "Against a whole-genome shotgun.“Genome Research 7.5 (1997): 410-417.
They were both right!(…well, Weber and Myers were a bit more right from the practical viewpoint…)
On the feasibility of sequence assembly
9
Outline-summary
4. CASE STUDY
2. GENOME ASSEMBLY
3. ASSEMBLY STRATEGIES
1. QUICK INTRODUCTION
10
Genome assembly strategies Greedy approach → SSAKE
De Bruijn graph (DBG) → Velvet, SOAPdenovo
Overlap Consensus Layout (OLC) → MIRA
Mixed approaches → MaSuRCA
11
Genome assembly strategies DE BRUIJN GRAPH APPROACH (DBG)
Velvet, SOAPdenovo2
Nodes = overlapping sequences of reads of uniform lengthEdges = kmer (unique subsequences within reads)
EULERIAN PATH
12
Genome assembly strategies
OVERLAP CONSENSUS LAYOUT (OLC)
MIRA
Nodes = readsEdges = overlap between reads
1. OVERLAP2. LAYOUT3. CONSENSUS
HAMILTONIAN PATH
13
Genome assembly strategies
14
Genome assembly strategies
DBG OLC
ADVANTAGES Very sensitive to repeats Modular algorithmic design
Kmer storaged just once Flexibility and robustness
Eulerian cycle
Never explicitly computes pairwise computation
DISADVANTAGES Sensitive to sequencing errors (new k-mers)
Hamiltonian cycle
Large computational memory space requirements
Overlap stage istime-consuming
Genome-size limitations
15
Greedy approach → SSAKE
De Bruijn graph (DBG) → Velvet, SOAPdenovo
Overlap Consensus Layout (OLC) → MIRA
Mixed approaches → MaSuRCA
Genome assembly strategies
16
Genome Assemblers
Average CoverageNumber of ContigsNumber of Contigs > 1KbN50 contig sizeFraction of reads assembledTotal consensus (in nt)Number of scaffolds N50 scaffolds size
Ion Torrent PGM → MIRA 3.9
Illumina → MaSuRCA MIRA 3.9 too produced good quality results, but it has a longer execution time
and it becomes unstable with large amount of small reads
17
Outline-summary
4. CASE STUDY
2. GENOME ASSEMBLY
3. ASSEMBLY STRATEGIES
1. QUICK INTRODUCTION
18
Mycobacteria Assembly: Case Study
Responsible for many animal and human diseases M. tuberculosis and M. leprae (TM)M. fortuitum (NTM) outbreak (nail salon, 2002)M. chelonae (NTM) outbreak (face lifts, 2004)
Illumina HiSeq sequencing (NGS Facility – CIBIO/UNITN) Twenty mycobacterial strains From 20 different Mycobacteria species
→ MaSuRCA
Novel mycobacteria detection clinical tests
19
Fastq-mcf tool
• poor quality ends of reads• Ns, duplicates and sequencing
adapters• reads that are too short
Reduction up to 73%
Raw data quality assessment and pre-processing
20
K-mers: strings of a particular length k, which are shorter than entire reads
Best empirical k-mer length: 91 bases long
Assembly parameters setting
High coverage
21
MaSuRCA results of Mycobacteria
Abnormal GC content
Genome size too high
22
Examples of environmental contaminations
GC content based quality analysis
Staphylococcus epidermidis
Thanks
Photocoming
soon
http://gcat.davidson.edu/phast/#methods
top related