Top Banner
Metagenomics workflow Fanny-Dhelia Pajuste Supervisor: Balaji Rajashekar 12.12.2014
31

Metagenomics workflow - Kursused · 2014. 12. 12. · Data Usually sequenced using next generation sequencing methods Contains reads from thousands of organisms Publicly available

Feb 19, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Metagenomics workflow

    Fanny-Dhelia PajusteSupervisor: Balaji Rajashekar

    12.12.2014

  • Metagenomics

    ● Studying genomic sequences directly from environmental samples

    ● Samples contain sequences of thousands of different organisms

    ● Can be used for:○ personal medicine○ environmental studies○ agriculture etc

  • Metagenomics Tasks

    Identifying:● organisms (species, strains, ..)● the abundance of organisms● genes● functions

  • Data

    ● Usually sequenced using next generation sequencing methods

    ● Contains reads from thousands of organisms● Publicly available data from:

    ○ MG-RAST (over 140 000 metagenomes)○ IMG○ EBI

  • Data for This Project

    ● From MG-RAST● Metagenome of human oral cavity under

    health and diseased conditions● Eight samples● Different oral health status

  • Workflow and Methods

  • Data Preprocessing

    Reads are filtered based on:● quality● length● ambiguous bases● (replication)Can be purified from some species

  • Assembly

    ● Assembling to larger DNA sequences (contigs and scaffolds)

    ● Uses de Brujin or overlap graphs● Depends on type of reads● Might need manual

    inspection (errors)

    http://genome.jgi-psf.org/help/scaffolds.html

    http://genome.jgi-psf.org/help/scaffolds.htmlhttp://genome.jgi-psf.org/help/scaffolds.htmlhttp://genome.jgi-psf.org/help/scaffolds.htmlhttp://genome.jgi-psf.org/help/scaffolds.html

  • Assembly: Isolate vs Metagenomes● �Assuming a uniform coverage depth across a

    genome ○ Identifying repeat regions○ Estimating the size of a genome

    Different coverage depth (relative abundance)● Repeat regions in a single genome vs between

    multiple genomes

  • Assembly: Isolate vs Metagenomes

    ● Sequencing errors○ Introduce false overlaps○ Disrupt true overlaps

    Error correction using consensus sequence for isolate genomes

  • Assembly: Methods

    ● IDBA-UD● MetaVelvet ● SOAPdenovo● MetaSim● Omega

  • Gene Calling● �Prediction of genes:

    Identifying protein or RNA sequences coded on the DNA present in the sample

    ● Data used:○ Initial reads○ Assembled contigs○ Both

  • Gene Calling: Approaches● Evidence based:

    ○ Metagenome is search for similar genes that are already known - homology searches

    ● Ab initio:○ Without previous knowledge○ Relying on internal feature of DNA○ Can use evidence-based found genes as training set

  • Gene Calling: Methods● BLAST● CRITICA● Orpheus● GLIMMER● MetaGene

  • Function Calling● Identifying the functions of the organisms in

    a sample○ What enables the organisms to have certain effects○ Identify the functional relations between samples

    ● We should know the coding and functional capacity of most of the species present in this sample

  • Function Calling: Approaches● Homology based

    ○ Compare predicted query proteins to known sequence databases

    ○ Might not be present in database○ Computationally hard

    ● Motif based○ Same/similar function, but different sequences

    ● + Genomic neighborhood information

  • Function Calling: Methods● BLAST● HMMER

  • Classification● Also called binning● Identifying the organisms present● Approaches:

    ○ Assembly based○ Marker genes○ Supervised methods○ Unsupervised methods

    ● Different accuracy - species, strains etc

  • Classification: Methods● BLAST● MEGAN● PhymmPL● Naive Bayes Classifier● Kraken

  • MEGAN● MEtaGenome ANalyser● BLAST - to search reads against database● NCBI taxonomy - to assign a taxon ID for

    each sequence● Each read is assigned to LCA of the set of

    taxa● Bottleneck - comparison of sequences

  • MEGAN

  • Kraken● Exact matching of k-mers to databases● Mapped to LCA of the genome ● Classification tree - Taxa and it ancestors +

    number of k-mers mapped to it as weights● Maximal root-to-leaf paths are calculated● Leaf is used as the classification

  • Kraken

  • Kraken

    ● Standard database: 150 GB● MiniKraken - 4 GB● Takes fasta/fastq● Classifies every sequence from the input

    file

  • Kraken: Output● Output has five columns:

    ○ C/U - classified or unclassified○ Sequence ID from input file header○ Taxonomy ID for classification○ Length of the sequence in bp○ List of LCA mappings: "562:13 561:4 A:31 0:1 562:3"

  • Kraken: OutputU GF8803K01A0000 0 506 0:476

    C GF8803K01A000R 553174 496 0:216 553174:1 0:83 553174:1 0:165

    U GF8803K01A001D 0 458 0:428

    C GF8803K01A001U 649638 533 0:257 95818:1 0:82 649638:1 0:1 649638:1 0:10 2:1 0:39 541000:1 0:109

    U GF8803K01A0028 0 297 0:267

    U GF8803K01A003Q 0 481 0:451

    U GF8803K01A004I 0 134 0:104

    C GF8803K01A004M 767031 485 0:39 767031:1 0:56 767031:1 0:25 767031:1 0:7 767031:1 0:19 767031:1 0:24 767031:1 0:67 767031:1 0:22 767031:1 0:76 767031:1 0:5 767031:1 0:52 767031:1 0:2 767031:1 0:18 767031:1 0:29 767031:1

    U GF8803K01A0058 0 512 0:482

  • Kraken: Report 1 71.90 244099 244099 U 0 unclassified

    28.10 95404 10 - 1 root

    28.08 95317 11 - 131567 cellular organisms

    28.07 95296 509 D 2 Bacteria

    12.65 42959 2 - 68336 Bacteroidetes/Chlorobi group

    12.65 42951 29 P 976 Bacteroidetes

    12.54 42563 0 C 200643 Bacteroidia

    12.54 42563 291 O 171549 Bacteroidales

    12.05 40898 0 F 171552 Prevotellaceae

    12.05 40898 1402 G 838 Prevotella

    6.62 22485 0 S 28132 Prevotella melaninogenica

  • Kraken: Report 2 d__Viruses 77

    d__Bacteria 95296

    d__Archaea 10

    d__Bacteria|p__Cyanobacteria 20

    d__Bacteria|p__Proteobacteria 1346

    d__Bacteria|p__Firmicutes 39094

    d__Bacteria|p__Deinococcus-Thermus 6

    d__Bacteria|p__Candidatus_Saccharibacteria 54

    d__Bacteria|p__Cloacimonetes 3

    d__Bacteria|p__Fusobacteria 4686

    d__Bacteria|p__Verrucomicrobia 3

  • Kraken: Results

  • Conclusion● Metagenomic data contains fragments of DNA

    sequences of a great number of organisms in an environmental sample

    ● Metagenomics is an important field - can be used for medicine, environmental studies etc

    ● Can be used to identify organisms, genes or functions

  • Thank you!