Metagenomics workflow - Kursused · 2014. 12. 12. · Data Usually sequenced using next generation sequencing methods Contains reads from thousands of organisms Publicly available

Metagenomics workflow

Fanny-Dhelia PajusteSupervisor: Balaji Rajashekar

12.12.2014

Metagenomics

● Studying genomic sequences directly from environmental samples

● Samples contain sequences of thousands of different organisms

● Can be used for:○ personal medicine○ environmental studies○ agriculture etc

Metagenomics Tasks

Identifying:● organisms (species, strains, ..)● the abundance of organisms● genes● functions

Data

● Usually sequenced using next generation sequencing methods

● Contains reads from thousands of organisms● Publicly available data from:

○ MG-RAST (over 140 000 metagenomes)○ IMG○ EBI

Data for This Project

● From MG-RAST● Metagenome of human oral cavity under

health and diseased conditions● Eight samples● Different oral health status

Workflow and Methods

Data Preprocessing

Reads are filtered based on:● quality● length● ambiguous bases● (replication)Can be purified from some species

Assembly

● Assembling to larger DNA sequences (contigs and scaffolds)

● Uses de Brujin or overlap graphs● Depends on type of reads● Might need manual

inspection (errors)

http://genome.jgi-psf.org/help/scaffolds.html

http://genome.jgi-psf.org/help/scaffolds.htmlhttp://genome.jgi-psf.org/help/scaffolds.htmlhttp://genome.jgi-psf.org/help/scaffolds.htmlhttp://genome.jgi-psf.org/help/scaffolds.html

Assembly: Isolate vs Metagenomes● �Assuming a uniform coverage depth across a

genome ○ Identifying repeat regions○ Estimating the size of a genome

Different coverage depth (relative abundance)● Repeat regions in a single genome vs between

multiple genomes

Assembly: Isolate vs Metagenomes

● Sequencing errors○ Introduce false overlaps○ Disrupt true overlaps

Error correction using consensus sequence for isolate genomes

Assembly: Methods

● IDBA-UD● MetaVelvet ● SOAPdenovo● MetaSim● Omega

Gene Calling● �Prediction of genes:

Identifying protein or RNA sequences coded on the DNA present in the sample

● Data used:○ Initial reads○ Assembled contigs○ Both

Gene Calling: Approaches● Evidence based:

○ Metagenome is search for similar genes that are already known - homology searches

● Ab initio:○ Without previous knowledge○ Relying on internal feature of DNA○ Can use evidence-based found genes as training set

Gene Calling: Methods● BLAST● CRITICA● Orpheus● GLIMMER● MetaGene

Function Calling● Identifying the functions of the organisms in

a sample○ What enables the organisms to have certain effects○ Identify the functional relations between samples

● We should know the coding and functional capacity of most of the species present in this sample

Function Calling: Approaches● Homology based

○ Compare predicted query proteins to known sequence databases

○ Might not be present in database○ Computationally hard

● Motif based○ Same/similar function, but different sequences

● + Genomic neighborhood information

Function Calling: Methods● BLAST● HMMER

Classification● Also called binning● Identifying the organisms present● Approaches:

○ Assembly based○ Marker genes○ Supervised methods○ Unsupervised methods

● Different accuracy - species, strains etc

Classification: Methods● BLAST● MEGAN● PhymmPL● Naive Bayes Classifier● Kraken

MEGAN● MEtaGenome ANalyser● BLAST - to search reads against database● NCBI taxonomy - to assign a taxon ID for

each sequence● Each read is assigned to LCA of the set of

taxa● Bottleneck - comparison of sequences

Kraken● Exact matching of k-mers to databases● Mapped to LCA of the genome ● Classification tree - Taxa and it ancestors +

number of k-mers mapped to it as weights● Maximal root-to-leaf paths are calculated● Leaf is used as the classification

Kraken

Kraken

● Standard database: 150 GB● MiniKraken - 4 GB● Takes fasta/fastq● Classifies every sequence from the input

file

Kraken: Output● Output has five columns:

○ C/U - classified or unclassified○ Sequence ID from input file header○ Taxonomy ID for classification○ Length of the sequence in bp○ List of LCA mappings: "562:13 561:4 A:31 0:1 562:3"

Kraken: OutputU GF8803K01A0000 0 506 0:476

C GF8803K01A000R 553174 496 0:216 553174:1 0:83 553174:1 0:165

U GF8803K01A001D 0 458 0:428

C GF8803K01A001U 649638 533 0:257 95818:1 0:82 649638:1 0:1 649638:1 0:10 2:1 0:39 541000:1 0:109

U GF8803K01A0028 0 297 0:267

U GF8803K01A003Q 0 481 0:451

U GF8803K01A004I 0 134 0:104

C GF8803K01A004M 767031 485 0:39 767031:1 0:56 767031:1 0:25 767031:1 0:7 767031:1 0:19 767031:1 0:24 767031:1 0:67 767031:1 0:22 767031:1 0:76 767031:1 0:5 767031:1 0:52 767031:1 0:2 767031:1 0:18 767031:1 0:29 767031:1

U GF8803K01A0058 0 512 0:482

Kraken: Report 1 71.90 244099 244099 U 0 unclassified

28.10 95404 10 - 1 root

28.08 95317 11 - 131567 cellular organisms

28.07 95296 509 D 2 Bacteria

12.65 42959 2 - 68336 Bacteroidetes/Chlorobi group

12.65 42951 29 P 976 Bacteroidetes

12.54 42563 0 C 200643 Bacteroidia

12.54 42563 291 O 171549 Bacteroidales

12.05 40898 0 F 171552 Prevotellaceae

12.05 40898 1402 G 838 Prevotella

6.62 22485 0 S 28132 Prevotella melaninogenica

Kraken: Results

Conclusion● Metagenomic data contains fragments of DNA

sequences of a great number of organisms in an environmental sample

● Metagenomics is an important field - can be used for medicine, environmental studies etc

● Can be used to identify organisms, genes or functions

Thank you!

Metagenomics workflow - Kursused · 2014. 12. 12. · Data Usually sequenced using next generation sequencing methods Contains reads from thousands of organisms Publicly available

Documents