1/42 Identifying Bacteriophages in Metagenomic Data Sets RCAM 2017 Vanessa Jurtz Technical University of Denmark October 17, 2017
1/42
Identifying Bacteriophages in Metagenomic DataSets
RCAM 2017Vanessa Jurtz
Technical University of Denmark
October 17, 2017
2/42
Contents
I Phages and why they matter
I MetaPhinder - Identifying phages
I Further characterization of sequences
I Phage cocktail data sets
3/42
Phages and why they matter
”Bacteria rule the world and phages rule bacteria” -by MyaBreitbart
I most abundant organisms in biosphere 1031
I outnumber bacteria 10:1
I kill up to 50% of bacteria produced every day
I impact biogeochemical cycling of key elements such ascarbon, nitrogen and phosphorus
4/42
Phages and why they matter
Figure adapted from http://upload.wikimedia.org/wikipedia/commons/e/e7/Phage
5/42
Phages and why they matter
I 13 families based on morphology and nucleic acidcomposition (ICTV)
I phage genomes: single or double stranded DNA or RNA
I most sequenced phages today are of the orderCaudovirales tailed dsDNA phages
6/42
Phages and why they matter
Figure adapted from Ceyssenset al. 2010 Inrtoduction to Bacteriophage biology and diversity; ASM Press
7/42
Phages and why they matter
Phages have greatly impactedour understanding of biology!
I central dogma ofmolecular biology: DNA→ RNA → proteins
I first organism to besequenced phage MS2(ssRNA) in 1976 andphage φX174 (ssDNA) in1977
I phage typing, phagedisplay, CRISPR-Cas
Figure adapted from https://www.quora.com/How-are-scientists-able-to-identify-specific-bacteria
8/42
Phages and why they matter
I Phages were discovered byF. Twort (1915) and F.d’Herelle (1917)
I F. d’Herelle was the firstto apply phages fortherapeutic purposes
I G. Eliava founded theEliava institute in Tsibilisi,Georgia in 1923
Felix d’Herelle Frederik Twort
George Eliava
9/42
Phages and why they matter - Phage therapy
I antibiotics are easy to produce and store
I antibiotic resistances cause problems → postantibiotic era (WHO 2014)
I phages are specific to certain bacteria
I safety concerns (integrases, virulence factors)
I difficult licensing in western countries
I complex and dose independent pharmacokinetics
Eliava Institute
10/42
Challenges in phage identification+characterization
I small phage genome size
I contribute around 2-5% of total DNA in metagenomic sample∗
I few fully sequenced phage genomes in public databases (< 6000)
I little annotation in general (protein function, host etc.)
*https://www.ncbi.nlm.nih.gov/pubmed/22864264
11/42
MetaPhinder
identifying phage sequences inmetagenomic samples bydatabase comparison
12/42
MetaPhinder- similar methods
MetaPhinder’s aim is only identifying phage contigs, therefore themethod itself remains very simple.
13/42
MetaPhinder
14/42
MetaPhinder
mosaic genomes:
genomic rearrangement:
15/42
MetaPhinder
ANI = average nucleotide identityN = number of hitsid = blastn identityal = alignment lengthmcov = merged coverage
16/42
MetaPhinder
I Can we find a %ANI threshold to classify a contig as phage?
I Which method should be used for database comparison?
17/42
MetaPhinder
Which method should be used for database comparison?
●
●
● ● ●
0.7
0.8
0.9
1.0
0.5−5 5−25 25−50 50−100 100+length [kbp]
AU
C
● blastn
KmerFinder
tBLASTx
18/42
MetaPhinder
Can we find a %ANI threshold to classify a contig as phage?
AUC: 0.9690.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00false positive rate
true
pos
itive
rat
e
A
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100threshold [%ANI]
rate
false positive rate
true positive rate
B
threshold = 1.7 %ANI
19/42
MetaPhinder
20/42
MetaPhinder
Predicting prophage data sets:
21/42
MetaPhinder
Practical experience on a data set of sewage samplesfrom all over the world:
→ %ANI threshold is too low
→ developement of MetaPhinder version 2
22/42
MetaPhinder
No threshold specification!
I no need to redefine threshold if database is updated
I contig selection left at discretion of user
23/42
MetaPhinder
24/42
MetaPhinder
min. 10%ANI and %ANI > bacterial coverage
25/42
MetaPhinder
MetaPhinder limitations:
I small size of phage database
I no discovery of completely new phages possible
I removal of prophage kmers from bacterial DB incomplete(due to incomplete annotation)
What about prophages?
I MetaPhinder is not designed for prophage annotation
I use specialized software: PHASTER, VirSorter, PhiSpy etc.
26/42
Further characterization of sequences
27/42
VirulenceFinder
Searches for virulence genes of Listeria, S. aureus, E. coli, Enterococcususing blastn.
Webservice: https://cge.cbs.dtu.dk/services/VirulenceFinder/
28/42
ResFinder
ResFinder identifies acquired antimicrobial resistance genes.
Webservice: https://cge.cbs.dtu.dk/services/ResFinder/
29/42
VirulenceFinder and ResFinder results
30/42
HostPhinder
Julia Villaroel(PhD student DTU)
HostPhinder identifies the bacterialhost of a query phage genome based onits genomic similarity to a database ofphage genomes with known host.
Webservice: https://cge.cbs.dtu.dk/services/HostPhinder/
31/42
HostPhinder
I kmer based comparison todatabase
I calculate coverage
I use scoring criterion wherenormalized coverages of databasehits with the same host aresummed
I correct predictions: genus 81%species 74%
Webservice: https://cge.cbs.dtu.dk/services/HostPhinder/
32/42
HostPhinder
HostPhinder can only predict hosts that are part of the database!
Webservice: https://cge.cbs.dtu.dk/services/HostPhinder/
33/42
Phage Cocktail data sets
phage solution for medical application consisting of several different phagespecies
34/42
Phage Cocktail data sets
Henrike Zschach(PhD student DTU)
Julia Villaroel(PhD student DTU)
INTESTI cocktail:
I active against E. coli, Enterococcus, Proteus, P.aeruginosa, Shigella, Salmonella, Staphylococcus
I in use since 1937 (regularly updated every 6months)
I against intestinal infections
I analysis in 2015/2016
PYO cocktail:
I active against Staphylococcus, Streptococcus,Proteus, E. coli, P. aeruginosa
I against skin or wound infections
I analysis in 2017
35/42
Phage Cocktail data sets
INTESTI PYO
36/42
Phage Cocktail data sets
INTESTI PYO
I predicted hosts correspond well with advertised specificity
I no harmful genes discovered
37/42
Phage Cocktail data sets
PYO cocktail: which DB phage is most similar to a given bin?→ reverse engineer MetaPhinder!
38/42
Conclusion
I MetaPhinder compares contigs to a phage database
I new version also compares sequences to a bacterial database
I flexibility - users can create their own database
I small amount of sequenced phages in public databases
I phage therapy provides an alternative to antibiotics, therefore abetter understanding of phages is important
39/42
Acknowledgments
Morten Nielsen(Professor DTU)
Henrike Zschach(PhD student DTU)
Julia Villaroel(PhD student DTU)
Mette Voldby Larsen(CEO GoSeqIt)
Ole Lund (Professor DTU)Frank Møller Aarestrup (Professor DTU)
40/42
e-value
●
●
● ● ●
0.7
0.8
0.9
1.0
0.5−5 5−25 25−50 50−100 100+length [kbp]
AU
C
● blastn %ANI + e−value 0.05
blastn %ANI + e−value 1
blastn %ANI + e−value 1e−10
41/42
KmerFinder vs. blastn
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●●●
●●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●●●●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●● ●●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
● ●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●●●
● ●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
● ●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●
● ●●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●●
●●●
●
●●●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●●
●●
●●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●●●●●
●
●●
●●
●
●
●●● ●●●
●
●
●
●
●●
●
●
●●●
●●
●●●
●
●●
●●●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●●●●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●●●●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●● ●●● ●
●
●
●
●●●●●●●●●●●●●
●
●●
●●●●● ●●●●●●
●
●
●●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0 0.2 0.4 0.6 0.8 1.0
020
4060
8010
0
KmerFinder q_cov
blas
tn %
AN
I
%ANI = 100(q_cov( 1
16))
%ANI = 100(q_cov)
42/42
Top hit ANI vs. ANI all
0.000.010.020.03
0 25 50 75 100
dens
ity
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●●
●
●●
●●●
●
●
●●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●● ●
●●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●●
●
●
●
●●●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●●
●
●●
●●
●
●
●●
●●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●● ●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●● ●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
25
50
75
100
0 25 50 75 100% ANI top hit
%A
NI a
ll hi
ts
0
25
50
75
100
0.00
0.01
0.02
0.03
density