Top Banner
Introduction to the CGE servers
43

Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Dec 23, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Introduction to the CGE servers

Page 2: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Center for Genomic Epidemiology

Aim: • To provide the scientific foundation for future internet-based solutions, where a central database will enable simplification of total genome sequence information and comparison to all other sequenced isolates including spatial-temporal analysis.

• To develop algorithms for rapid analyses of whole genome DNA-sequences, tools for analyses and extraction of information from the sequence data and internet/web-interfaces for using the tools in the global scientific and medical community.

Page 3: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.
Page 4: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Tools for species identification

Name of Service Description

URL (cge.cbs.dtu.dk/services/) Status Publication

SpeciesFinder Species identification using 16S rRNA

SpeciesFinder Online Published Feb 2014 PMID: 24574292

KmerFinder Species identification using overlapping 16mers

KmerFinder Online Published Jan 2014 PMID: 24172157

TaxonomyFinder Taxonomy identification using functional protein domains

TaxonomyFinder Published in PMID: 24574292 + Oksana's PhD thesis

Reads2Type Species identification on client computer

Reads2Type Online Published Feb 2014 PMID: 24574292

Page 5: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Benchmarking of Methods for Bacterial Species Identification

PMID: 24574292

Page 6: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Training data 1,647 completed / almost completed genomes downloaded

from NCBI in 2011 (1,009 different species)

Evaluation data NCBI draft genomes

• 695 isolates from species that overlap with training set (151 species)

SRA draft genomes• 10,407 sets of short reads from Illumina (168 species)

• 10,407 draft genomes from Illumina data (168 species)

Page 7: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

16S rRNA

• 16S rRNA sequencing has dominated molecular taxonomy of prokaryotes for more than 30 years (Fox et al, Int. J. Syst. Bacteriol., 1977)

• Tremendous amounts of 16S rRNA sequence data are available in databases

Concerns: • Low resolution • Some genomes contain several copies of the 16S rRNA gene with inter-gene variation• The 16S rRNA gene represents only about 0.1% of the coding part of a microbial genome

Page 8: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Reference database • 16S rRNA genes are isolated from genomes in training data using RNAmmer (Lagesen, NAR, 2007).

Method•Input genomes are BLASTed against 16S rRNA genes in reference database.

•Best hit is selected based on a combination of coverage, % identity, bitscore, number of mistmatches and number of gaps in the alignments.

CGE implementation of 16S species identification

SpeciesFinder

Page 9: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

KmerFinder• Genomes in training data is chopped into 16mers:

A T G A C G T A T G A T T G A T G A C G T A G T A G T C C

• Immune system inspired downsampling• Only 16mers with specific prefix are kept

MHC-I

9mer

Page 10: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

ATGAATGTGTGAGTGA

ATGACTGTGCCCCTGA

ATGAAAAAAAAAAAA

Unique 16 mers:

Species Match No. of Kmer hits

Acinetobacter baumannii CP001921 2

Acinetobacter baumannii CP000521 1

Acinetobacter baumannii CP002521 1

Buchnera aphidicola CP002301 1

ATGAATGTGTGAGTGACP001921 (Acinetobacter baumanii)CP000521 (Acinetobacter baumanii)CP002522 (Acinetobacter baumanii)

ATGACTGTGCCCCTGA CP001921 (Acinetobacter baumanii)CP002301 (Buchnera aphidicola)

16mer database

Unknown isolate

Page 11: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

KmerFinder is very robust – it only needs one 16mer! Desulfovibrio piger GOR1 SRR097356

>NODE 4 length 92 cov 23.119566TAGGACGTGGAATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAACATCCACTTGACGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAGAAATGC>NODE 15 length 82 cov 2.792683AGCGAAAAATGTCATAACAACGATCACGACCGATAACCATCTTTGGTCCAAACTTACTCACGCAGCAGGCGTATAACTCGCGCATACCAGCTTTGGGCAT

N50 = 110Total no. of bp: 210

Species Match No. of Kmer hits

Flavobacterium psycrophilum

AM398681 1

Prediction

Page 12: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

TaxonomyFinder

Page 13: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Reads2Type

• Read2Type pushes analysis to user, server provides 50-mers database

• SuffixTree: efficient data structure for string matching

• Narrow Down Approach: – Reads2Type compares 50-mers

of combined marker genes against raw reads

– Shared Probes vs Unique Probe

• Definition: Quick & dirty taxonomy identification of single isolates

• 50-mer of marker gene DB–16S rRNA: Training data

genomes RNAmmer (other)

– ITS: Training data (Mycobacterium)

–GyrB: Training data (Enterobacteriaceae)

–Resulting database ~5 MB

Page 14: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

rMLST

CGE implementation

•For each genome in the training data the 53 ribosomal genes were extracted.

•Genomes in evaluation sets were aligned using blat to each gene collection (only hits with at least 95% identity and 95% coverage were considered as a potential match).

•The closets match of the training genomes was selected based on a combination of coverage, %identity, bitscore, number of mistmatches and number of gaps in the alignments across all genes.

Jolley KA, Bliss CM, Bennett JS, Bratcher HB, Brehony C, Colles FM, Wimalarathna H, Harrison OB, Sheppard SK, Cody AJ, Maiden MC. Ribosomal multilocus sequence typing: universal characterization of bacteria from domain to strain. Microbiology. 2012 Apr;158(Pt 4):1005-15.

Page 15: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Results

(16s rRNA)

Page 16: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Overlap in predictions

Page 17: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Isolates in the NCBIdrafts set for which all four methods predict the species to be different from the annotated one. * NZAEPO00000000 has been re-annotated as S. oralis since we downloaded the data.

Page 18: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.
Page 19: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.
Page 20: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Speed

Method Estimated speed (mm:ss)

16S 00:13*

KmerFinder 00:09*

TaxonomyFinder 11:33*

rMLST 00:45*

Reads2Type 00:55**

*Estimation based on draft genomes**Estimation based on short reads

Page 21: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Summary of taxonomy benchmark study

• KmerFinder had the highest accuracy and was the fastest method.

• SpeciesFinder (16S rRNA-based) had the lowest accuracy.

• Methods that only sample genomic loci (16S, Reads2Type, rMLST) had difficulties distin-guishing species that only recently diverged, especially when main difference is a plasmid.

Page 22: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.
Page 23: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.
Page 24: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Tools for further typing

Name of Service Description

URL (https://cge.cbs.dtu.dk/services/ ) Publication

MLSTMultilocus sequence typing MLST

Published Apr 2012, PMID: 22238442

Plasmid-Finder

Identification of plasmids in Enterobacteriaceae

PlasmidFinder Published Apr 2014, PMID 24777092

pMLST pMLST of plasmids in Enterobacteriaceae

pMLST Published Apr 2014, PMID 24777092

Page 25: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Multilocus Sequence Typing (MLST)

First developed in 1998 for Neisseria meningitis (Maiden et al. PNAS 1998. 95:3140-3145)

The nucleotide sequence of internal regions of app. 7 housekeeping genes are determined by PCR followed by Sanger sequencing

Different alleles are each assigned a random number

The unique combination of alleles is the sequence type (ST)

Page 26: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Using WGS data for MLST

Page 27: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

www.cbs.dtu.dk/services/MLST

Assembled genome454 – single end reads454 – paired end readsIllumina – single end readsIllumina – paired end readsIon TorrentSOLiD – single end readsSOLiD – mate pair reads

Acinetobacter baumannii #1Acinetobacter baumannii #2 Arcobacter Borrelia burgdorferi Bacillus cereus Brachyspira hyodysenteriae Bifidobacterium Brachyspiria intermedia Bordetella Burkholderia pseudomallei Brachyspira Burkholeria cepacia complex Campylobacter jejuni Clostridium botulinum Clostridium difficile #1 Clostridium difficile #2 Campylobacter helveticus Campylobacter insulaenigrae Clostridium septicum C. diphtheriae Campylobacter fetus Chlamydiales

Campylobacter lari Cronobacter C. upsaliensis Escherichia coli #1 Escherichia coli #2 Enterococcus faecalis Enterococcus faecium F. psychrophilum Haemophilus influenzae Haemophilus parasuis Helicobacter pylori Klebsiella pneumoniae Lactobacillus casei Lactococcus lactis Leptospira Listeria Listeria monocytogenes Moraxella catarrhalis Mannheimia haemolytica Neisseria P. gingivalis P. acne

Pseudomonas aeruginosa Pasteurella multocida Pasteurella multocida Staphylococcus aureus Streptococcus agalactiae Salmonella enterica Staphylococcus epidermidis S. maltophilia Streptococcus pneumoniae Streptococcus oralis S. zooepidemicus Streptococcus pyogenes Streptococcus suis Streptococcus thermophilus Streptomyces Streptococcus uberis Vibrio parahaemolyticus Vibrio vulnificus Wolbachia Xylella fastidiosa Y. pseudotuberculosis

Page 28: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.
Page 29: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Extended Output

Page 30: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Extended Output

aro: WARNING, Identity: 100%, HSP/Length: 349/498, Gaps: 0, aro_122 is the best match for aro

Page 31: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

What is the MLST web-service used for?

Page 32: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

PlasmidFinder and pMLST

The PlasmidFinder database contains replicons, not entire plasmids.

Page 33: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Tools for phenotyping

Name of Service Description

URL (https://cge.cbs.dtu.dk/services/ ) Publication

ResFinder

Identification of acquired antibiotic resistance genes ResFinder

Published Nov 2012, PMID: 22782487

Virulence-Finder

Identification of virulence genes in E. coli (and S. aureus and Enterococcus)

VirulenceFinder E. coli published Feb 2014, PMID: 24574290.

MyDbFinder Identification of genes from the users own database

MyDbFinder Will be published in book chapter

Pathogen-Finder

Prediction of pathogenic potential

PathogenFinder Published Oct 2013, PMID: 24204795

Page 34: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

ResFinder

ResFinder(BLAST)

NGSIllumina

Ion torrent454..

Sanger

Fasta

Resistance gene profile

Assembly pipeline

List of genesAccession numbers

Theoretical resistance phenotype

Sanger

Fasta

Page 35: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.
Page 36: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.
Page 37: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

200 isolates from 4 different species (Salmonella Typhimurium, Escherichia coli, Enterococcus faecalis and Enterococcus faecium)

ResFinder, 98 %ID, 60% length coverage

Phenotypic tests, 3,051 in total• 482 Resistant• 2569 Susceptible

=> 99,74% of the results were in agreement between ResFinder and the phenotypic tests

23 discrepancies -> 16, typically in relation to spectinomycin in E. coli

Page 38: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Alternatives to ResFinder

Page 39: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Unpublished or uncategorizedName of Service Description

URL (https://cge.cbs.dtu.dk/

services/ ) Status Publication

PanFunPro Groups homologous proteins based on functional domain content

PanFunProOnline

Published in F1000Research 2013, 2:265

Serotype-Finder

Identification of serotypes SerotypeFinder-1.0

Online

Not yet published

Restriction-ModificationFinder

Identification of RM system genes

Restriction-ModificationFinder

Online

Will only be published in book chapter

HostPhinder Prediction of the host of a bacteriophage

HostPhinderOnline, but under development

Not yet published

MetaVir-Finder

Identification of virus in metegenomic data

MetaVirFinderOnline, but under development

Not yet published

MGmapper

Identifies the content of metagenomic samples MGmapper

Online, but under development

Not yet published

Page 40: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Tools for phylogeny

Name of Service Description URL (cge.cbs.dtu.dk/services) Status Publication

SnpTree

Creation of phylogenetic trees based on SNPs snpTree Online

Published Dec 2012, PMID: 23281601

CSIPhylo-geny

Creation of phylogenetic trees based on SNPs

CSIPhylogenyOnline

Planned

NDtree Creation of phylogenetic trees

NDtree Online Published in Feb 2014, PMID: 24505344

Page 41: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Web-service usage

Page 42: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.

Type of data uploaded to MLST web-service

454, single reads454, paired-endIon torrentIllumina, single readsIllumina, paired-end readsAssembled draft genomes

Page 43: Introduction to the CGE servers. Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where.