Faculty of Medicine and Health Sciences Genomic insights into the emergence and spread of ‘high-risk’ Klebsiella pneumoniae and Acinetobacter baumannii clones Thesis submitted for the degree of doctor in Medical Sciences at the University of Antwerp to be defended by Mattia PALMIERI Supervisors: Prof. Herman Goossens Prof. Alex van Belkum Dr. Pieter Moons Antwerp, 2020
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Faculty of Medicine and Health Sciences
Genomic insights into the emergence and spread of ‘high-risk’
Klebsiella pneumoniae and Acinetobacter baumannii clones
Thesis submitted for the degree of doctor in Medical Sciences at the
University of Antwerp to be defended by
Mattia PALMIERI
Supervisors:
Prof. Herman Goossens
Prof. Alex van Belkum
Dr. Pieter Moons
Antwerp, 2020
Genomic insights into the emergence and spread of ‘high-risk’
Klebsiella pneumoniae and Acinetobacter baumannii clones
Genomische inzichten in het ontstaan en de verspreiding van
“hoog-risico” Klebsiella pneumoniae en Acinetobacter baumannii
klonen
Thesis submitted for the degree of doctor in Medical Sciences at the
List of abbreviations ................................................................................................................................ 4
List of figures ........................................................................................................................................... 6
While antibiotics still represent the major antibacterial agents for the treatment of bacterial
infections, an increasing number of bacteria is becoming (multi-drug) resistant (MDR), complicating
the treatment of infections. Carbapenems are highly effective antibiotics commonly used for the
treatment of severe bacterial infections of MDR bacteria, which are resistant to first-line antibiotics.
Of major concern, carbapenem resistance is on the rise, and in some countries it is so high that other
drugs, usually reserved as last options, are widely used. As an example, colistin, an old drug that was
essentially unused due to its toxicity, it’s now commonly adopted in some countries, and resistance
toward this antibiotic is on the rise.
Of the several pathogens associated with MDR, carbapenem-resistant K. pneumoniae and A.
baumannii represent major concerns. Both pathogens frequently cause outbreaks of infections, while
strains which are resistant to all available antibiotics are emerging. Concerning K. pneumoniae, a
novel kind of superbug has been emerging recently. While MDR K. pneumoniae clones causing
hospital outbreaks and hypervirulent, drug susceptible clones causing severe community-acquired
infections were two separate concerns, strains that showed convergence of the two traits are
emerging. Acquisition of hypervirulence and resistance genes have been observed in MDR and
hypervirulent clones, respectively, especially in Asia. Tracking the emergence and evolution of such
novel clones, which cause severe infections with limited treatment options, is fundamental.
The decreasing cost of Whole Genome Sequencing (WGS) is allowing its increase implementation in
bacterial diagnosis. However, there is still a lack of surveillance investigations for last-line resistance
mechanisms and for convergence of resistance and hypervirulence traits. Moreover, while the
phenotype prediction from the genomic data showed encouraging results, the understanding of the
genetic resistance mechanisms of some drugs, such as colistin, is still limited, and novel in silico tools
for the phenotype prediction are needed.
We employed WGS and bioinformatics, together with phenotypic techniques, to address different
problems: i) to decipher the colistin resistance mechanisms and the genomic epidemiology of clinical
isolates of K. pneumonia and A. baumannii from countries where carbapenem resistance is sky-high,
and colistin represent a life-saving agent. ii) to explore the longitudinal population dynamics of K.
pneumonia in a major Chinese hospital, focusing on the simultaneous carriage of resistance and
hypervirulence genes. iii) to predict the phenotype of K. pneumonia strains from their genomes. iv) to
study a novel carbapenemase-encoding gene obtained from environmental bacteria.
2
Samenvatting
Hoewel antibiotica de belangrijkste antibacteriële middelen zijn voor de behandeling van bacteriële
infecties, wordt een toenemend aantal bacteriesoorten (multi-) resistent (MDR), wat de behandeling
van infecties bemoeilijkt. Carbapenems zijn zeer effectieve antibiotica die vaak worden gebruikt voor
behandeling van ernstige MDR bacteriële infecties, die resistent bleken tegen eerstelijns antibiotica.
Zorgwekkend is dat de carbapenem-resistentie toeneemt en in sommige landen zo hoog is dat
andere geneesmiddelen, die meestal alleen als laatste optie worden gebruikt, op grote schaal
worden gebruikt. Colistine, een oud medicijn dat meestal niet werd gebruikt vanwege toxiciteit,
wordt nu in sommige landen algemeen gebruikt en de resistentie tegen dit antibioticum neemt toe.
Van de verschillende MDR pathogenen vormen carbapenem-resistente Klebsiella pneumoniae en
Acinetobacter baumannii klinisch belangrijke voorbeelden. Beide ziekteverwekkers veroorzaken vaak
uitbraken van infecties, terwijl er stammen ontstaan die resistent zijn tegen alle beschikbare
antibiotica. In het geval van K. pneumoniae is onlangs een nieuw soort superbacterie waargenomen.
Terwijl normaalgesproken MDR en hypervirulentie in K. pneumoniae klonen apart werden
waargenomen zij er nu klonen geïdentificeerd die convergentie van deze twee eigenschappen laten
zien. Acquisitie van hypervirulentie- en resistentiegenen is vooral in Azië gezien. Het volgen van de
opkomst en evolutie van dergelijke nieuwe klonen, die ernstige infecties veroorzaken met beperkte
behandelingsmogelijkheden, is van fundamenteel belang.
De dalende kosten van Whole Genome Sequencing (WGS) maakt het mogelijk de implementatie
ervan in de bacteriële routinematige diagnostiek van infectieziekten te versnellen. Er is echter nog
steeds een gebrek aan surveillance van bestaande en nieuwe resistentiemechanismen en naar
convergentie van resistentie- en hypervirulentie-eigenschappen. Bovendien, alhoewel de fenotype-
voorspelling uit de genomische gegevens bemoedigende resultaten liet zien, is het begrip omtrent
resistentiemechanismen rond sommige geneesmiddelen, zoals colistine, nog steeds beperkt, en zijn
nieuw bio-informatische in silico instrumenten voor de fenotype-voorspelling nodig.
In mijn proefschrift gebruikte ik WGS en bio-informatica, samen met fenotypische technieken, om
verscheidene problemen aan te pakken. Ten eerste heb ik onderzoek uitgevoerd naar colistine-
resistentiemechanismen en de genomische
epidemiologie van klinische isolaten van K. pneumoniae en A. baumannii uit landen waar de
carbapenem-resistentie torenhoog is. Ten tweede bestudeerde ik de longitudinale
populatiedynamiek van K. pneumoniae in een groot Chinees ziekenhuis, met nadruk op de analyse
van lokale en internationale verspreiding van resistentie- en hypervirulentiegenen. Ik analyseerde en
3
ontwikkelde methoden om het fenotype van K. pneumoniae stammen uit hun genomen te
voorspellen. Tenslotte bestudeerde ik een nieuw carbapenemase-coderend gen dat was gevonden in
omgevingsbacteriën. Resultaten van deze onderzoekingen zijn samengevat in dit proefschrift.
4
List of abbreviations
Abbreviations Full description
ACL adaptive cluster lasso
AMR antimicrobial resistance
AST antimicrobial susceptibility testing
AUC area under the curve
bACC balanced accuracy
CC clonal complex
cDBG compacted De Bruijn Graph
CG clonal group
cKp classical K. pneumoniae
colR/ColR Colistin resistant
ColS colistin susceptible
cps Capsular polysaccharide
CRAB carbapenem resistant A. baumannii
CRKP/CR-Kp carbapenem-resistant K. pneumoniae
dNTP deoxyribonucleotide triphosphate
ESBL extended spectrum β-lactamase
GI gastro-intestinal
GWAS genome-wide association studies
HAI hospital acquired infection
hvKp hyper-virulent K. pneumoniae
IC international clone
ICU intensive care unit
IS insertion sequence
KPC Klebsiella pneumoniae carbapenemases
L-Ara4N L-aminoarabinose
LD linkage disequilibrium
LPS lipopolysaccharide
MAF minor allele frequency
MALDI-TOF MS matrix-assisted laser desorption/ionization–time of flight mass spectrometry
MBL metallo-β-lactamase
MDR multidrug-resistant
MIC minimum inhibitory concentration
5
ML machine learning
MLST multi-locus sequence typing
NGS Next-Generation Sequencing
NS non-susceptible
OCL outer core locus
ONT Oxford Nanopore Technologies
PBS phosphate-buffered saline
pEtN phosphoethanolamine
PFGE pulsed-field gel electrophoresis
ROC Receiver Operating Characteristic
S susceptible
SMRT single-molecule real-time
SNP single nucleotide polymorphism
UTI urinary tract infection
VNTR variable-number tandem repeat
WGS whole genome sequencing
WHO World Health Organization
ZMW zero-mode waveguide
6
List of figures
Figure 1. Antibiotic resistance strategies in bacteria. From Erik Gullberg, 2014.
Figure 2. Predicted global deaths due to antimicrobial-resistant infections every year, compared to
other major diseases. From O’Neill, 2014.
Figure 3. WHO priority pathogens list for R&D of new antibiotics. *Enterobacteriaceae include: K.
pneumoniae, E. coli, Enterobacter spp., Serratia spp., Proteus spp., Providencia spp. and
Morganella spp. From Tacconelli et al., 2018.
Table 1. β-lactamases types, including some examples of clinically relevant enzymes.
Figure 4. Regulation pathways of LPS modifications in Klebsiella pneumoniae. From Poirel et al.,
2017.
Figure 5. Four well-characterized virulence factors in classical and hypervirulent K. pneumoniae
strains. From Paczosa and Mecsas, 2016.
Figure 6. Schematic representation of A. baumannii colistin resistance mechanisms. From Trebosc
et al., 2019.
Figure 7. A schematic representation of the hypothetical workflow after adoption of WGS, with low
complexity and an expected turnaround time within one day. Adapted from Didelot et al., 2012.
Figure 8. Overview of the three generations of sequencing technologies, with examples of the
major sequencing platforms. From Loman and Pallen, 2015.
7
Preface
In this preface, an overview of the contents of each chapter in this thesis is provided, the chapters
that are included as publications are listed, and the contribution to the chapters directly from the
author of this thesis are listed.
Chapter 1: General introduction and aims
This is an original overview of the background, key concepts and objectives of this thesis.
Chapter 2: Genomic epidemiology of carbapenem- and colistin-resistant Klebsiella pneumoniae
isolates from Serbia: predominance of ST101 strains carrying a novel OXA-48 plasmid
This chapter is an original work that resulted in a publication in Frontiers in Microbiology (DOI:
10.3389/fmicb.2020.00294). I was first author and the main contributor of the work presented in this
publication.
The nature and extent of the thesis author contributions to this chapter are detailed below:
• I contributed to the design of this published study and interpretation with Prof. Alex van Belkum,
Prof. Marco Maria D’Andrea and Prof. Gian Maria Rossolini.
• I performed all wet lab experiments, including antimicrobial susceptibility testing, MALDI-TOF MS
and DNA extraction.
• I performed library preparations for Nanopore long-read sequencing under supervision by and
assistance from Franck Tarendeau (bioMérieux Grenoble).
• I conducted all epidemiological, phylogenetic, and genomic analysis with Prof. Marco Maria
D’Andrea.
• I was responsible for the planning, drafting, editing, and submission of the manuscript, though all
co-authors also edited the manuscript.
Chapter 3: Abundance of colistin-resistant, OXA-23- and ArmA-producing Acinetobacter baumannii
belonging to International Clone 2 in Greece
This chapter is an original work that resulted in a publication in Frontiers in Microbiology (DOI:
10.3389/fmicb.2020.00668). I was first author and the main contributor of the work presented in this
publication.
The nature and extent of my contributions to this chapter are detailed below:
8
• I contributed to the design of this published study and interpretation with Prof. Alex van Belkum,
Prof. Marco Maria D’Andrea and Prof Gian Maria Rossolini. Dr. Nikos Legakis was responsible for the
collection, initial characterization and shipment of the strains. I verified some of the strain
characteristics for reasons of quality control.
• I performed MALDI-TOF MS under supervision by and assistance from Nadine Perrot.
• I performed all wet lab experiments, including antimicrobial susceptibility testing and DNA
extraction.
• I conducted all epidemiological, phylogenetic, and genomic analysis with input from Prof. Marco
Maria D’Andrea.
• I was responsible for the planning, drafting, editing, and submission of the manuscript, though all
co-authors also edited the manuscript.
Chapter 4: Genomic evolution and local epidemiology of Klebsiella pneumoniae from the Beijing
Hospital 301 over a fifteen-year period: dissemination of known and novel high-risk clones
This chapter is an original work that resulted in an in-progress manuscript, soon to be submitted for
publication. I was first author and the main contributor of the work presented in this manuscript.
The nature and extent of my contributions to this chapter are detailed below:
• I conducted all epidemiological, phylogenetic, and genomic analysis together with Dr. Kelly L. Wyres.
• I wrote the first draft of the manuscript and consolidated the editing suggestions made by the co-
authors.
Chapter 5: Interpreting k-mer based signatures for antibiotic resistance prediction
This chapter is an original work that resulted in a submitted manuscript, under revision at the time of
submission of this thesis. I was second author.
The nature and extent of my contributions to this chapter are details below:
• I contributed to the design of this nearly published study and performed data interpretation with
Dr. Pierre Mahé, Dr. Magali Jaillard and Prof. Alex van Belkum.
• I built the K. pneumoniae database used to test the machine elarning algorithm.
• I contributed to the analysis of the data.
9
• I contributed to the initial writing and editing of the manuscript.
Chapter 6 : PFM-like, a novel family of subclass B2 metallo β-lactamase from Pseudomonas
synxantha belonging to the Pseudomonas fluorescens complex
This chapter is an original work that resulted in a publication in Antimicrobial Agents and
Chemotherapy (DOI: 10.1128/AAC.01700-19). I was second author and the main contributor of the
experimental work presented in this publication.
The nature and extent of my contributions to this chapter are detailed below:
• I performed most of the wet lab experiments, including antimicrobial susceptibility testing, gene
cloning, enzyme purification and kinetic analysis of hydrolysis.
• I conducted all bioinformatics analyses.
• I wrote the first draft of the manuscript.
Chapter 7 : Summary and future perspectives
This is an original summary of the implication and significance of the work presented in this thesis,
together with a brief general discussion and the future perspectives.
10
CHAPTER 1 : General introduction and aims
1.1 The antimicrobial resistance crisis
The discovery of antibiotics in the early phase of the previous century was one of the most important
developments in medicine and a milestone in the history of modern human society. Before the
introduction of antibiotics, infectious diseases were a major cause of mortality due to the systemic
infections, sepsis resulting from wound infections, pneumonia and also common infections
surrounding childbirth. In the absence of antibiotics, routine clinical practices such as organ
transplants, surgery and cancer chemotherapy would be impossible 1.
As soon as antibiotics were introduced in clinical practice, clinically-relevant antibiotic resistant
bacterial strains were described. These strains emerged due to their ability to rapidly evolve via both
vertical and horizontal inheritance 2.
Moreover, antibiotics have been inappropriately used in particular outside healthcare settings and
especially in low-income countries. The misuse and overuse of antibiotics has not only been a
problem observed in human clinical settings, but also a frequent habit in agriculture, aquaculture and
animal farming. Alarmingly, these drugs are largely used as disease prophylaxis and growth factors 3.
This situation has led to selection and propagation of antibiotic resistant strains in many
environments, turning them into reservoirs that contribute to storage, transmission and selection of
new superbugs. Consequently, some infections previously easily manageable are now difficult or
impossible to treat 4. Infections caused by a pathogen resistant to the drug of treatment generally
have a poorer clinical outcome (possibly even death) and are also linked to a greater overall
consumption of healthcare resources, when compared to infections caused by antibiotic-susceptible
organisms 1.
Members of a bacterial species can all be naturally resistant to a specific drug (intrinsic resistance) or
(the) resistance trait(s) can be acquired by susceptible microorganisms (acquired resistance). On a
genetic level, resistance may arise i) endogenously, through random chromosomal point mutations,
often when sub-therapeutic concentrations of antibiotics increase mutability and specifically select
for resistant strains, or ii) exogenously, through horizontal gene transfer, when foreign DNA is
mobilized via conjugative plasmids (transformation), bacteriophages (transduction), transposons,
insertion sequences and naked DNA, eventually leading to the recombination of acquired DNA into
the chromosome 2. Concerning the endogenous mechanisms, the process toward high level
resistance is usually stepwise. The antibiotic selection pressure enriches for bacterial cells with an
initial mutation that allows its enhanced survival, followed by subsequent additional mutations that
11
confer increased resistance levels during further antibiotic therapy. Though mutation frequencies can
be as low as 10-8, this is offset by the huge numbers of cells in bacterial colonies 5. Concerning
exogenous mechanisms, the major genetic elements associated with resistance genes are plasmids.
These are nearly ideal carriers for acquisition and dissemination of resistance genes followed by
transposons, which can move genes between plasmids or chromosomes, and the integrons that can
ease the recruitment and expression of resistance determinants. These elements are widely present
among both Gram-negative and Gram-positive bacterial species and play a crucial role for
dissemination of resistance determinants 6.
From a biochemical point of view, four major mechanisms of resistance can occur in bacteria: i)
decreased antibiotic uptake associated with reduction of membrane permeability (e.g. resistance to
tetracyclines and quinolones); ii) enzymatic inhibition/inactivation of the antibiotic (e.g. resistance to
β-lactams by β-lactamases); iii) rapid efflux of the antibiotic from the cell (e.g. resistance to
tetracyclines and macrolides); iv) target alterations: mutation of the cellular structure (receptor) that
the antibiotics target (e.g. resistance to oxacillin and methicillin by mutating the mecA gene,
mutations in DNA gyrase resulting in resistance to several fluoroquinolones); and v) acquisition of
one or more alternative metabolic pathways to supplement those inhibited by antibiotics (e.g.
resistance to sulfonamides) 7(Figure1). These resistance mechanisms can be present together in
different combinations in a single bacterial cell, potentially allowing high level resistance to multiple
antibiotic compounds simultaneously 8.
Figure 1. Antibiotic resistance strategies in bacteria 9
Ever-growing levels of antimicrobial resistance (AMR) menace the health benefits facilitated by
antibiotics and this phenomenon is recognised as a global crisis 10. With an estimate of 50,000 deaths
across the US and Europe every year attributable to AMR, urgent international actions need to be
taken to preserve the efficacy of modern antibiotic treatments.
12
Without proactive solutions to prevent the continued escalation of antibiotic resistance, it is
estimated that by 2050 approximately 10 million people will die annually of antimicrobial-resistant
infections, which is more than the cumulative number of people dying today from any other type of
disease 1(Figure2).
Figure 2. Predicted global deaths due to antimicrobial-resistant infections every year, compared to other major diseases 1
1.2 The ESKAPE pathogens
The ESKAPE pathogens (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae,
Acinetobacter baumannii, Pseudomonas aeruginosa and Enterobacter species), although not the only
worrisome pathogens, have been labelled as requiring special attention since they are responsible
for the majority of hospital acquired infections (HAIs), concurrently showing a high prevalence of
AMR 11. The World Health Organization (WHO) has also recently listed twelve bacterial species
against which new antibiotics are urgently needed 12. They describe three categories of pathogens
namely critical, high and medium priority, according to the urgency of need for new antibiotics
(Figure3). Carbapenem-resistant A. baumannii and P. aeruginosa along with extended spectrum β-
lactamase (ESBL) or carbapenem-resistant Enterobacteriaceae (including K. pneumoniae) were listed
in the critical priority list of pathogens.
13
Figure 3. WHO priority pathogens list for R&D of new antibiotics. *Enterobacteriaceae include: K. pneumoniae, E. coli, Enterobacter spp., Serratia spp., Proteus spp., Providencia spp. and Morganella spp.
12
1.2.1 Klebsiella pneumoniae
K. pneumoniae, belonging to the Enterobacteriaceae family, was first isolated in the late 19th century
and was initially known as Friedlaender’s bacterium 13. From a clinical point of view, the species K.
pneumoniae is the most important member of the genus Klebsiella spp., which also includes other
clinically relevant species such as K. oxytoca and, even if to a lesser extent, K. rhinoscleromatis and K.
ozaenae 14. Klebsiella spp. are Gram-negative, encapsulated, non-motile bacteria that are able to
readily colonize human mucosal surfaces, including the gastro-intestinal (GI) tract and oropharynx,
even if this colonization appears benign 15. From these sites, this opportunistic pathogen can gain
entry to other tissues where it can cause severe infections in humans. Major diseases include urinary
tract infections, lower respiratory tract infections, intraabdominal infections and bloodstream
infections. Other diseases, such as meningitis and wound infections, are less common 16.
As the best known genus member, K. pneumoniae is a common opportunistic mostly nosocomial
pathogen, accounting for about one third of all Gram-negative HAIs overall 17. It is also an important
cause of serious community onset infections such as necrotizing pneumonia, pyogenic liver abscesses
and endogenous endophthalmitis 14.
In healthcare settings, K. pneumoniae infections commonly occur among patients who already suffer
from serious underlying clinical conditions, often together with a state of general immunodeficiency.
14
Risk factors for K. pneumoniae infections include extremes of age, presence of malignancy, diabetes,
chronic liver disease, recent solid-organ transplantation, and chronic dialysis 18. Other risk factors for
nosocomial infections by K. pneumoniae are treatment with corticosteroids, chemotherapy, organ
transplantation, or other treatments or conditions resulting in neutropenia 19.
Over the last few decades, there has been a concerning rise in the acquisition of resistance to a wide
range of antibiotic classes by “classical” K. pneumoniae strains 20. Consequently, simple infections
such as UTIs have become hard to treat, while more serious infections such as pneumonia and
bacteremia have become increasingly life-threatening 21.
From the mid-1980s, a novel type of community-acquired invasive K. pneumoniae infection, primarily
in the form of pyogenic liver abscesses, has emerged in mostly Asian countries 22. K. pneumoniae
strains causing these invasive infections are defined as being hyper-virulent and express a distinct
hyper-mucoviscous phenotype when grown on agar plates 23.
Very recently, strains with a hyper-virulent phenotype have been found to carry antimicrobial
resistance genes including carbapenemases 24 but also mechanisms of resistance against last resort
antibiotics such as colistin 25, thus leading to a terrific scenario in lacking of novel approach to treat
this kind of superbugs.
1.2.1.1 Antimicrobial resistance in K. pneumoniae: the β-lactamases
K. pneumoniae can produce various enzymes that hydrolyze the four-membered ring of β-lactams
and inactivate them. These enzymes include ESBLs, oxacillinases, carbapenemases (including metallo-
and serine-β-lactamases), among others (Table 1). Genes encoding such enzymes are generally
present on plasmids which K. pneumoniae seems to readily acquire. Such plasmids often carry other
genes conferring resistance to other antibiotic classes including aminoglycosides, chloramphenicol,
sulfonamides, trimethoprim, and tetracyclines. Thus, bacteria containing these plasmids are often
multidrug-resistant (MDR) 26.
Type Ambler class Features Enzymes
Narrow-spectrum β-lactamases
A Hydrolyze penicillins TEM-1, TEM-2, SHV-1
Extended-spectrum β-lactamases
A Hydrolyze narrow and extended-spectrum β-lactams
SHV-2, CTX-M-15, VEB-1, PER-1
Serine carbapenemases
A Hydrolyze carbapenems KPC-2, KPC-3, IMI-1
Metallo β-lactamases B Hydrolyze carbapenems NDM-1, VIM-1, IMP-1 Cephalosporinases C Hydrolyze cephamycins and
some oxymino β-lactams AmpC, CMY-2, FOX-1
OXA-type enzymes D Hydrolyze carbapenems OXA-48, OXA-232 Table 1. β-lactamases types, including some examples of clinically relevant enzymes.
15
Two major types of antibiotic resistance have been commonly described in K. pneumoniae, both
involving the production of β-lactamases. The first mechanism, initially described in the late 1980’s
concomitantly in Europe 27 and in the US 28, is the production of variants of the SHV-1 or TEM-1 β-
lactamases, in which the substitution of only one or two amino acids led to the appearance of
variants that have been termed ESBLs. ESBLs are chromosomally or plasmid-encoded enzymes that
mediate resistance to penicillins, extended-spectrum (third generation) cephalosporins (e. g.
ceftazidime, cefotaxime, and ceftriaxone) and monobactams (e. g. aztreonam), but do not affect
cephamycins (e. g. cefoxitin and cefotetan) or carbapenems (e. g. meropenem and imipenem) 29. The
early SHV and TEM variants have been largely replaced by the CTX-M family of ESBLs, identified in
the early 1990s in Western Europe and South America and that are currently the most common type
of ESBL in enteric bacteria 30.
The second major mechanism of resistance is the expression of carbapenemases, which renders K.
pneumoniae resistant to all β-lactams, including the carbapenems. Carbapenemases can be classified
on the basis of their aminoacid sequence in different molecular classes: class A (e.g. IMI-, SME-, KPC-
type enzymes), class B (of which the main representatives in clinical isolates are the NDM-, IMP- and
VIM-types) and class D β-lactamases (e.g. OXA-48-types, OXA-232-types) 31.
Klebsiella pneumoniae carbapenemases (KPCs) represent the clinically most relevant mechanism of
acquired antimicrobial resistance observed in K. pneumoniae during recent years. This is due to their
very wide range of activity against several β-lactam families, including penicillins, older and newer
cephalosporins, aztreonam and carbapenems 32.
Several different KPC variants (KPC-2 to KPC-22) have been described, even if KPC-2 and KPC-3 are
the most widely diffused. KPCs are mostly plasmid-encoded enzymes and bacteria carrying these
plasmids are often susceptible to only a few antibiotics such as colistin, aminoglycosides, and
tigecycline.
1.2.1.2 Antimicrobial resistance in K. pneumoniae: colistin resistance
Polymyxins represent the major antimicrobial therapeutic option against carbapenem-resistant K.
pneumoniae infections over the last decades. Indeed, polymyxin E (colistin) is considered as a “last
resort” antimicrobial for the treatment of MDR K. pneumoniae infections, essentially the only drug
that will reach adequate serum levels and that will pass the minimum inhibitory concentration (MIC)
of the infecting strain 33.
16
Consequently, the increasing prevalence of colistin-resistant K. pneumoniae is a major concern,
considering the scarcity of the alternative treatment options and the high mortality rate associated
with carbapenem- and colistin-resistant K. pneumoniae infections 34.
The target of colistin is the outer membrane of Gram-negative bacteria. An electrostatic interaction
occurs between the positively charged colistin molecule on the one side and the phosphate groups of
the negatively charged lipid A on the other side. Divalent cations (Ca2+ and Mg2+) are consequently
displaced from the negatively charged phosphate groups of membrane lipids 35. Then, the
lipopolysaccharide (LPS) is destabilized, the permeability of the bacterial membrane is increased, and
cytoplasmic leakage ultimately causes cell death 36. Even though LPS is the initial target, the exact
colistin mode of action is still uncertain 37.
Similar to what is observed in bacteria that are naturally resistant to colistin, LPS modifications via
addition of cationic groups, i.e. L-aminoarabinose (L-Ara4N) and phosphoethanolamine (pEtN), is
responsible for colistin resistance in K. pneumoniae. A large panel of genes and operons is involved in
qualitative modification of the LPS (Figure4). The pmrCAB operon encodes the pEtN
phosphotransferase PmrC, the response regulator PmrA, and the sensor kinase protein PmrB. The
pEtN phosphotransferase PmrC adds a pEtN group to the LPS. Environmental stimuli such as ferric
(Fe3+) iron, aluminium (Al3+), and low pH (e.g., pH 5.5) activate PmrB through its periplasmic domain.
The tyrosine kinase PmrB in turn activates PmrA by phosphorylation. Finally, PmrA activates the
transcription of the pmrCAB operon itself, and also of the pmrHFIJKLM operon and the pmrE gene
which are also involved in LPS modifications. Specific PmrA/B mutations are responsible for
constitutive activation of the PmrAB two-component system, and have been described as being
responsible for colistin resistance in K. pneumoniae 38.
The pmrHFIJKLM operon encodes for seven proteins, and together with the pmrE gene they are
responsible for the synthesis of the L-Ara4N and its coupling to lipid A. The phoPQ operon encodes
the regulator protein PhoP and the sensor protein kinase PhoQ. In a similar way to PmrB, PhoQ
senses environmental stimuli such as low magnesium (Mg2+) and low pH (e.g., pH 5.5), which mediate
PhoQ activation through its periplasmic domain. PhoQ in turn activates PhoP by phosphorylation.
Finally, PhoP activates the transcription of the pmrHFIJKLM operon, mediating the addition of L-
Ara4N to the LPS. PhoP can also activate the PmrA protein, both directly or indirectly via the PmrD
connector protein, causing the LPS modification via pEtN addition. Several mutations in the phoP/Q
genes are responsible for constitutive activation of the PhoPQ two component system and
consequently colistin resistance in K. pneumoniae 38.
17
MgrB is a small transmembrane protein that acts as a negative regulator of the PhoPQ two-
component system. Inactivation of the mgrB gene leads to overexpression of the phoPQ operon and
consequently colistin resistance. Several missense mutations resulting in amino acid substitutions
and nonsense mutations leading to a truncated MgrB protein have been observed. Insertional
inactivation caused by different insertion sequences (IS), belonging to several families and inserted at
different locations within the mgrB gene, is often responsible for colistin resistance in K. pneumoniae
39,40.
The crrAB operon encodes the regulatory protein CrrA and the sensor protein kinase CrrB, which
regulate the pmrAB expression. Inactivation of the crrB gene leads to overexpression of the pmrAB
operon, finally resulting in colistin resistance 41.
Finally, the plasmid-mediated mcr-1 gene is responsible for horizontal transfer of colistin resistance.
It was initially described in E. coli and K. pneumoniae isolates from Chinese patients between 2011
and 2014 42. The encoded MCR-1 protein is a pEtN transferase, and its acquisition results in the
addition of pEtN to lipid A, similarly to the chromosomal mutations mentioned above. Following mcr-
1, several other variants, up to mcr-9, have been described 43–50.
Figure 4. Regulation pathways of LPS modifications in Klebsiella pneumoniae 37
1.2.1.3 Hyper-virulent K. pneumoniae
Despite rendering bacterial infections more difficult to treat, MDR does not enhance the virulence of
K. pneumoniae strains. However, starting from the 1980s, K. pneumoniae strains with the ability to
cause severe infections in apparently healthy individuals emerged. These strains are defined as
hyper-virulent K. pneumoniae (hvKp) compared to classical K. pneumoniae (cKp) strains as they are
18
able to infect both healthy and immunocompromised individuals, with resulting infections which are
generally invasive.
Infections were first described in Taiwan and are common on the Asian Pacific Rim. However, new
cases have recently been reported on a more global scale. In contrast to the infections caused by cKp,
most hvKp infections originate in the community 51. While pyogenic liver abscesses represents the
major disease, hvKp strains can also cause pneumonia and lung abscesses, among others 52.
Bacteremia is frequent among hvKP-infected patients and is correlated with a significantly poorer
prognosis 53.
Several virulence factors were reported and studied in hvKP strains. Capsule is a polysaccharide
matrix that overlays the cell and it is fundamental for K. pneumoniae virulence. hvKp strains are
characterized by hyper-capsulation which consists of an extensive mucoviscous exopolysaccharide
coating that is thicker and more robust than that of the typical capsule. This hyper-capsule
contributes significantly to the pathogenicity of hvKp 20.
Most hvKp are associated with only two of the 130 reported capsular serotypes, K1 and K2, that were
shown to be particularly anti-phagocytic and serum resistant 20,54. hvKp are also associated with
several other key virulence factors (Figure5); the rmpA and rmpA2 genes that upregulate capsule
expression thereby aiding the formation of a hyper-capsule which is linked to the hyper-mucoviscous
phenotype; the colibactin genotoxin that induces eukaryotic cell death and promotes bacterial
transfer from the intestines into the blood; the yersiniabactin, aerobactin and salmochelin
siderophores that enhance survival in the blood by promoting iron scavenging 20. Yersiniabactin
synthesis is encoded by the ybt locus that is generally mobilized by an integrative, conjugative
element termed ICEKp. Its prevalence is about 40% in K. pneumoniae and it is frequently acquired
and lost from MDR clones 55. Conversely, the salmochelin (iro), aerobactin (iuc) and rmpA/rmpA2 loci
are usually co-harbored by a virulence plasmid 56. The prevalence of that virulence plasmid is less
than 10% in the K. pneumoniae population, and until recently it was rarely reported among cKp
strains 57.
hvKp strains are generally susceptible to most antimicrobials. However, the last few years have seen
an increasing number of reports of ‘convergent’ K. pneumoniae strains that are both hyper-virulent
(carrying the iuc aerobactin locus, which is recognized as the single most important feature of hvKp
strains 58) and ESBL/carbapenemase producers. The majority of these reports represent sporadic
isolations, but in 2017 Gu and colleagues reported a fatal outbreak in a Chinese hospital caused by a
hyper-virulent carbapenemase-producing K. pneumoniae isolate 59.
19
Figure 5. Four well-characterized virulence factors in classical and hypervirulent K. pneumoniae strains 20
1.2.2 Acinetobacter baumannii
Acinetobacter baumannii is a Gram-negative coccobacillus recognized as an important opportunistic
human pathogen causing infections of the urinary tract, skin, bloodstream, and soft tissues 60. The
majority of A. baumannii infections occur among critically ill patients in the intensive care unit (ICU)
setting, accounting for as much as 20% of infections in ICUs worldwide 61. MDR phenotypes due to
the acquisition of antibiotic resistance mechanisms represent a major factor of the success of A.
baumannii in hospital environments. Antibiotic modifying enzymes, decreased permeability to
antibiotic molecules, and active efflux pumps are among the major AMR mechanisms. Apart from its
multidrug resistance, the success of A. baumannii can also be attributed to its ability to survive in the
hospital environment 62. Examples of the challenges that A. baumannii faces as an opportunistic
human pathogen include the survival at low temperatures, the exposure to antiseptics and
desiccating agents and the rapid changes of environmental and nutritional conditions when
transferred into the human body from the hospital environment. Therefore, A. baumannii needs to
sense and adapt to these changes in an efficient and prompt manner. A. baumannii also has also the
ability to colonize the skin of patients or healthy individuals without causing any apparent illness.
However, transmission of such colonizing bacteria to a susceptible patient can result in immediate
infection.
1.2.2.1 Multidrug-Resistant A. baumannii
The major mechanism of β-lactam resistance in A. baumannii is enzymatic degradation by β-
lactamases. A. baumannii strains are characterized by chromosomally encoded AmpC
cephalosporinases, which are also known as Acinetobacter-derived cephalosporinases (ADCs). The
overexpression of such enzymes in A. baumannii is regulated by the presence of an upstream
insertion sequence (IS) element, the major representative being ISAba1. The presence of this
20
element correlates with resistance to extended-spectrum cephalosporins due to the increased ADC
production. Cefepime and carbapenems are not hydrolyzed by these enzymes.
ESBLs of the VEB-, PER-, TEM- and CTX-M-type have also been reported in A. baumannii. However,
the assessment of their prevalence is hindered by difficulties with laboratory detection in the
presence of ADCs 60.
The β-lactamases with carbapenemase activity are of major concern and include the serine
oxacillinases (Ambler class D OXA type) and the metallo-β-lactamases (MBLs) (Ambler class B).
The second intrinsic β-lactamase produced by A. baumannii is an oxacillinase, represented by the
OXA-51/69 variants. The OXA-51-like-encoding genes are chromosomally located in A. baumannii and
the carbapenemase activities of OXA-51/69 enzymes have been studied in detail 63,64. However, the
level of expression of the corresponding genes is quite low in most cases, resulting in a minor impact
on β-lactam susceptibility 65.
Identification of a carbapenem-hydrolyzing oxacillinase-encoding gene was first reported in A.
baumannii in 1995 and named blaOXA-23. This enzyme type now represents the major carbapenem
resistance determinant in A. baumannii on a global scale. Two other acquired OXA-type genes giving
rise to the production of proteins with carbapenemase activity have been reported, the blaOXA-24-like
and the blaOXA-58-like carbapenemase genes 65.
IS elements play an important role in oxacillinases-mediated carbapenem resistance in A. baumannii.
These elements provide two major functions. First, they encode a transposase, allowing the
mobilization of the carbapenemase-encoding gene. Second, they can contain promoter regions that
lead to overexpression of downstream genes. IS elements have been frequently described upstream
of blaOXA-23 and blaOXA-58 genes, but they may also promote carbapenem resistance in association with
intrinsic genes such as blaOXA-51. Some IS elements, in particular ISAba1, are relatively unique to A.
baumannii 60.
Aminoglycoside resistance in A. baumannii is encoded by acetyltransferases, nucleotidyltransferases,
and phosphotransferase-encoding genes. More alarmingly, 16S rRNA methylation is becoming
common in A. baumannii due to the expression of the armA gene. This resistance mechanism
protects the 30S ribosomal subunit from aminoglycoside binding conferring high-level resistance to
all clinically useful aminoglycosides, including gentamicin, tobramycin, and amikacin 66.
The major fluoroquinolone resistance mechanism depends on modifications of DNA gyrase or
topoisomerase IV through mutations in the gyrA and parC genes. Such mutations modify the
fluoroquinolone’s target binding site 60.
21
1.2.2.2 Colistin resistance in A. baumannii
The main mechanism of colistin resistance in A. baumannii corresponds to the addition of cationic
groups to the LPS (Figure6). Colistin resistance may also be the consequence of a complete loss of
LPS production. However, LPS loss is associated to growth defects and decreased virulence, and for
these reasons very few clinical isolates are LPS deficient 67.
Colistin resistance has been linked to mutations in the two-component transcriptional regulator
genes pmrA/B and consequent pmrC overexpression in most instances. The pEtN phosphotransferase
PmrC adds a pEtN group to the lipid A of the lipopolysaccharide, lowering the net negative charge of
the cell membrane, thus impacting the binding of colistin and preventing the cell membrane leakage.
The complete loss of LPS is caused by alterations of the lipid A biosynthesis genes, namely the lpxA,
lpxC, and lpxD genes. Mutations identified in those genes were either substitutions, truncations,
frameshifts , or insertional inactivation by the insertion sequence ISAba11 37.
Colistin resistance may also result from the overexpression of etpA, a pmrC homolog. This is
mediated by insertional inactivation of a gene encoding an H-NS family transcriptional regulator 68 or
by integration of insertion sequence elements upstream of the eptA gene itself 69–71.
Figure 6. Schematic representation of A. baumannii colistin resistance mechanisms 69
.
1.3 Whole Genome Sequencing (WGS): a disruptive diagnostic tool
The current methods of clinical microbiology diagnostics mainly consist on conventional culturing of
clinical samples on different agar plates, followed by antimicrobial susceptibility testing (AST) and
further characterization on a case-by-case basis. The major steps in processing a sample are isolating
a pathogen, determining its species, testing antimicrobial susceptibility and virulence and, in specific
22
settings, intra-species typing for epidemiological purposes. The first three steps are crucial for the
treatment and management of an infected patient, while the last step is valuable for identifying
outbreaks and improve the surveillance. Depending on the pathogen, this practice usually takes one
to two days for culturing, an additional one to two days for species identification and susceptibility
testing, and several days for typing 72. While the species identification and AST can be performed
significantly faster, for example by employing MALDI-TOF MS and rapid disk diffusion after 4-6 hours
of culture 73,74, the overall diagnostic process, including typing, remains complex, time-consuming
and difficult to automate 72.
Several methods for rapid diagnostic testing have been developed and evaluated. Molecular
methods, such as PCR, microarray, and nucleic acid sequencing, have been widely adopted in the
clinical laboratory. These methods are able to identify microorganisms, genes and genetic
polymorphisms with high sensitivity and specificity through detection of specific nucleic acid targets.
Regardless of methodology, molecular diagnostics have the capability to reduce the time to results
and provide more accurate diagnosis. Despite these clear advantages, molecular diagnostic methods
are still expensive, and AST is limited to the detection of few resistance markers 75.
WGS has all the essentials to dramatically revolutionize bacterial diagnosis and surveillance by
replacing current time-consuming and labour-intensive techniques with a single and rapid diagnostic
test (Figure 7). Over the past two decades, huge progress was made in the field of high-throughput
sequencing technologies, and nowadays sequencing the full genome of a bacterial pathogen is
considered neither challenging nor particularly expensive anymore. As a result, WGS is believed as
the obvious and inevitable future diagnostics in multiple reviews and opinion articles 72,75–79.
Figure 7. A schematic representation of the hypothetical workflow after adoption of WGS, with low complexity and an expected turnaround time within one day (Adapted from
72).
23
However, WGS diagnostics is still not widely adopted in clinical microbiology, which may seem in
contrast with the number of applications for which WGS has huge potential, and which are already
widely used in the academic research 80.
Some major applications of WGS in diagnosing infectious diseases include:
i) Strain identification and typing. WGS data can be exploited to obtain information concerning the
bacterial species and subtype. WGS can also allow the phylogenetic placement of a given sequence
relative to an existing set of isolates for which the complete genome sequence is also known. WGS-
based strain identification offers a greater resolution compared to current genetic marker-based
approaches such as multi-locus sequence typing (MLST) pulsed-field gel electrophoresis (PFGE),
variable-number tandem repeat (VNTR) profiling. The greater resolution offered by WGS is also of
major significance for bacteria with large accessory genomes. While the core genome contains the
essential housekeeping genes which are present in all members of a lineage, the accessory genome is
defined as the genome fraction containing nonessential genes. In K. pneumoniae and A. baumannii
most of the relevant genes, like those encoding for resistance or virulence, are located in the
accessory genome.
ii) Phenotype prediction. WGS data provide a rich resource that can be exploited to predict the
pathogen’s phenotype. The major bacterial traits of clinical relevance are AMR and virulence, but
may also include other traits such as the ability to form biofilms or survival in the environment.
Concerning AMR prediction, several databases and bioinformatics tools were developed to detect
known genes and mutations associated with a resistance phenotype 81. More recently, the use of
machine learning (ML) techniques was assessed for the antimicrobial susceptibility prediction
without any previous knowledge of the actual AMR determinants involved 82. In general, ML
algorithms work by finding the relevant features in a complex data set that enable strong and reliable
prediction 83. ML algorithms are used to select the genomic features that are relevant to a given
antibiotic susceptibility profile. These relevant genomic features are then used as a phenotype
“classifier” for unknown genomes and as a source for identifying important genomic regions. From a
practical point of view, the counts of overlapping K-mers (subsequences of length ‘k’ contained
within a biological sequence) are computed and combined with the clinical laboratory generated
phenotypic data for each antibiotic to form one large matrix containing both the k-mers and
antibiotics as features. Different algorithms (boosting algorithms, penalized regression models,
decision trees, random forest, neural networks or set cover machines) are then used to build a
predictive model 82.
24
iii) Tracking outbreaks and identifying sources of recurrent infections. WGS can identify isolates
which are part of an outbreak and, by combining epidemiological data with phylogenetic information,
detect putative transmission events between patients or between patients and the environment.
WGS was successfully employed to reconstruct outbreaks within hospitals and the community
caused by pathogens belonging to several species, including carbapenem-resistant K. pneumoniae 84–
86 and A. baumannii 87. A recent review summarizes the major bioinformatics tool for outbreak
investigations 88.
iv) Improved surveillance. Molecular surveillance and real-time tracking of bacterial disease are
among the major promises of WGS implementation. In order to achieve this, the genomes sequenced
each year together with their metadata (e.g. sampling date, geographic location, isolation host) need
to be shared and methodically archived in an exploitable form. With such data, surveillance
initiatives have the capability to identify the likely geographic origin of emerging bacteria and AMR
genes, to group seemingly unrelated cases into outbreaks, and to clearly identify the emergence of
new clones. In a hospital environment, surveillance can help to detect cross-transmission events
between the hospital and the community and to improve antimicrobial stewardship; on a wider scale,
it can anticipate worldwide emerging trends consequently enabling anticipatory policy decisions.
Despite the WGS potential, there are some major bottlenecks to its implementation as a routine
clinical microbiology diagnostic tool. Major limitations include: the cost of performing WGS, which is
still high but it keeps falling; a lack of clinical microbiologists with bioinformatics skills; a lack of the
necessary computational infrastructure in most medical settings; the incompleteness of reference
microbial genomics databases required for AMR and virulence determinants detection; and the lack
of standardized, effective and easy to use bioinformatics protocols 75,80.
1.3.1 Different WGS platforms
From 2005, novel sequencing technologies emerged under the name of second (or next) generation
sequencing platforms, as opposed to the automated Sanger method, which is a first-generation
technology (Figure 8). Three major technologies, Illumina, SOLiD and 454, were employed to
generate bacterial genomes. From 2011, Illumina displaced the other competitors, and nowadays it
represents the major sequencing platform 89.
Illumina sequencing is based on the sequencing-by-synthesis principle to elucidate the sequence of
DNA. Briefly, DNA polymerases catalyse the binding of fluorescently labelled deoxyribonucleotide
triphosphates (dNTPs) into a DNA template strand during subsequent cycles of DNA synthesis. During
each cycle, at the point of incorporation, the nucleotides are identified by fluorophore excitation.
This process takes place across millions of fragments in a massively parallel fashion. The size of the
25
Illumina reads (the fragments of DNA that are sequenced by the instrument) is up to 300 bases. With
appropriate multiplexing, the ordinary coverage for a bacterial genome sequence project is between
30 and 100 reads per base. Illumina reads accuracy rates are typically around 99.9%, although
systematic biases related to GC-rich regions and some specific DNA motifs exist 90. Illumina has
developed several instruments ranging from low-throughput benchtop machines (MiniSeq, MiSeq) to
ultra-high-throughput instruments (HiSeq, NovaSeq). Illumina sequencing is considered as short-read
sequencing. Such short reads are insufficiently large to cover repeat elements such as transposons
and insertion sequences, which usually mobilize resistance and virulence determinants.
Consequently, short-read genome assemblies are fragmented and can consist of up to hundreds of
DNA fragments, called contigs. Sequencing technologies producing longer reads can cover such
repeats allowing the complete assembly of bacterial genomes.
In 2011, the first single-molecule, third generation long-read sequencing technology was released by
Pacific Biosciences (PacBio), while in 2014 Oxford Nanopore Technologies (ONT) released the MinION
instrument. PacBio’s single-molecule real-time (SMRT) sequencing it’s also based on the sequencing-
by-synthesis principle, as it detects sequence information during the replication process of the target
DNA molecule. The method is based on the optical observation of the polymerase-mediated
synthesis in real time. A zero-mode waveguide (ZMW), a hole less than half the wavelength of light,
limits fluorescent excitation to only a single polymerase together with its template. Consequently,
only fluorescently labelled nucleotides integrated into the growing DNA chain emit signals of
sufficient duration to be read 91.
SMRT sequencers (RSII, Sequel and Sequel II) have fast run times, typically less than three hours, and
the long reads produced can be longer than 80 Kb. The raw base-called error rate is decreasing over
the last years, and is now reduced to < 1% 92. As a major drawback, the high cost per base compared
with Illumina technologies and the massive cost for a PacBio sequencer represent major obstacles for
the implementation of this technology in the clinical microbiology laboratory 93.
ONT sequencing principle is based on the passage of a single stranded DNA in a nanopore over which
a voltage is continuously applied. The current through the nanopore changes depending on which
base is passing through it. Such changes can be processed and translated to obtain the sequence of
the DNA molecule that passes through the pore 94. The MinION is the main ONT device, it’s a small
and portable sequencer that can be used outside of traditional laboratories. Its throughput is up to
30 Gb per run, and it can produce reads longer than 200 Kb. The raw base-called error rate is claimed
to have been reduced to < 5% for nanopore sequences 95. An important feature of the MinION
sequencer is that the output can be analysed during its generation. This allows strain identification
26
within 30 minutes and prediction of the antibiotic resistance profile within 10 hours after the start of
a run 89.
Figure 8. Overview of the three generations of sequencing technologies, with examples of the major sequencing platforms
96.
1.4 Aims
Antimicrobial resistance is a severe threat to public health worldwide, leading to growing costs,
treatment failure, morbidity and mortality. Nowadays, the antibiotic resistance level of bacterial
strains can be assessed by simple, mostly culture-based clinical AST methods. Although the classic
tests are reliable, they require extensive manual laboratory work and results are normally obtained
after several days only. WGS is a high-throughput DNA sequencing strategy that can produce a large
amount of data in a single reaction. WGS could potentially reduce the turnaround time for laboratory
results and allow clinically actionable information to be obtained sooner than traditional laboratory
diagnostic tests. However, translating genomic information to AST results is challenging. Moreover,
WGS allows for high resolution epidemiologic investigations, fundamental to track the spread and
the evolution of novel ‘high-risk’ clones.
This research project focuses on the use of WGS in order to study collections of MDR strains obtained
from countries with high AMR rates. The general aim is to study the AMR mechanisms at the
genomic level, with particular focus on last line drugs, such as colistin, and to perform
27
epidemiological investigations about the nosocomial spread focusing mainly on clinical A. baumannii
and K. pneumoniae strains.
The research was part of an initiative to define new diagnostic routing in infectious disease under the
name of ND4ID (Novel Diagnostics for Infectious Diseases). This project received funding from the
European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie
grant agreement No 675412.
The specific aims of this thesis are:
1. To investigate the genetic mechanisms of colistin resistance in K. pneumoniae (CHAPTER2)
and A. baumannii (CHAPTER 3) from two countries facing high AMR levels. Resistance
mechanism analysis of other antimicrobials, plasmid analysis and genomic epidemiology
investigations were also performed.
2. To study the population of K. pneumoniae isolates collected over a 15-year period in the
Beijing hospital H301 (CHAPTER 4). WGS was employed to decipher the genomic
epidemiology, the AMR and virulence determinants, as well as the emergence of novel ‘high-
risk’ clones, characterized by hyper-virulence and MDR.
3. To build and evaluate a machine learning algorithm for the prediction of antimicrobial
susceptibilities from genomic data (CHAPTER 5). To test the algorithm performances for the
phenotype prediction of K. pneumoniae genomes.
4. To perform classical molecular and enzymology techniques for the cloning, expression and
enzymatic activity testing of a novel carbapenemase. WGS was employed to detect the
putative determinant of carbapenem resistance and its genetic environment and to perform
phylogenetic analysis (CHAPTER 6).
1.5 References
1. O’Neill J. Review on Antimicrobial Resistance. Antimicrobial Resistance: Tackling a Crisis for the
Health and Wealth of Nations, 2014. 2014; 4.
2. Davies J, Davies D. Origins and evolution of antibiotic resistance. Microbiol Mol Biol rev 2010; 74:
417–33.
3. Aarestrup FM, Wegener HC, Collignon P. Resistance in bacteria of the food chain: Epidemiology
and control strategies. Expert Rev Anti Infect Ther 2008; 6: 733–50.
4. Rice LB. The clinical consequences of antimicrobial resistance. Curr Opin Microbiol 2009; 12: 476–
81.
28
5. Drlica K, Perlin DS. Antibiotic Resistance: Understanding and Responding to an Emerging Crisis.
Emerg Infect Dis 2011; 17: 1984–1984.
6. Partridge SR, Kwong SM, Firth N, Jensen SO. Mobile Genetic Elements Associated with
96. Loman NJ, Pallen MJ. Twenty years of bacterial genome sequencing. Nat Rev Microbiol 2015; 13:
787–94.
36
CHAPTER 2 : Genomic epidemiology of carbapenem- and colistin-
resistant Klebsiella pneumoniae isolates from Serbia: predominance of
ST101 strains carrying a novel OXA-48 plasmid
Mattia Palmieri1, Marco Maria D’Andrea2,3, Andreu Coello Pelegrin1, Caroline Mirande4, Snezana
Brkic5, Ivana Cirkovic6, Herman Goossens7, Gian Maria Rossolini8,9, Alex van Belkum1
1bioMérieux, Data Analytics Unit, La Balme Les Grottes, France.
2Department of Biology, University of “Tor Vergata”, Rome, Italy.
3Department of Medical Biotechnologies, University of Siena, Siena, Italy.
4bioMérieux, R&D Microbiology, La Balme Les Grottes, France.
5Institute for Laboratory Diagnostics Konzilijum, Belgrade, Serbia.
6Institute of Microbiology and Immunology, Faculty of Medicine, University of Belgrade, Serbia.
7Laboratory of Medical Microbiology, Vaccine and Infectious Disease Institute, University of Antwerp,
Belgium.
8Microbiology and Virology Unit, Florence Careggi University Hospital, Florence, Italy.
9Department of Experimental and Clinical Medicine, University of Florence, Florence, Italy
Published in Frontiers in Microbiology, 21 February 2020, doi: 10.3389/fmicb.2020.00294
37
2.1 Abstract
Klebsiella pneumoniae is a major cause of severe healthcare-associated infections and often shows
MDR phenotypes. Carbapenem resistance is frequent, and colistin represents a key molecule to treat
infections caused by such isolates. Here we evaluated the antimicrobial resistance mechanisms and
the genomic epidemiology of clinical K. pneumoniae isolates from Serbia. Consecutive non-replicate
K. pneumoniae clinical isolates (n=2,298) were collected from seven hospitals located in five Serbian
cities and tested for carbapenem resistance by disk diffusion. Isolates resistant to at least one
carbapenem (n=426) were further tested for colistin resistance with Etest or Vitek2. Broth
microdilution (BMD) was performed to confirm the colistin resistance phenotype, and colistin-
resistant isolates (N=45, 10.6%) were characterized by Vitek2 and whole genome sequencing. Three
different clonal groups (CGs) were observed: CG101 (ST101, N=38), CG258 (ST437, N=4; ST340, N=1;
ST258, N=1) and CG17 (ST336, N=1). mcr genes, encoding for acquired colistin resistance, were not
observed, while all the genomes presented mutations previously associated with colistin resistance.
In particular, all strains had a mutated MgrB, with MgrBC28S being the prevalent mutation and
associated with ST101. Isolates belonging to ST101 harbored the carbapenemase OXA-48, which is
generally encoded by an IncL/M plasmid that was no detected in our isolates. MinION sequencing
was performed on a representative ST101 strain, and the obtained long reads were assembled
together with the Illumina high quality reads to decipher the blaOXA-48 genetic background. The blaOXA-
48 gene was located in a novel IncFIA-IncR hybrid plasmid, also containing the extended spectrum β-
lactamase-encoding gene blaCTX-M-15 and several other antimicrobial resistance genes. Non-ST101
isolates presented different MgrB alterations (C28S, C28Y, K2*, K3*, Q30*, adenine deletion leading
to frameshift and premature termination, IS5-mediated inactivation) and expressed different
carbapenemases: OXA-48 (ST437 and ST336), NDM-1 (ST437 and ST340) and KPC-2 (ST258). Our
study reports the clonal expansion of the newly emerging ST101 clone in Serbia. This high-risk clone
appears adept at acquiring resistance, and efforts should be made to contain the spread of such
clone.
2.2 Introduction
Klebsiella pneumoniae has emerged as one of the most challenging antibiotic-resistant pathogens,
since it can cause a variety of infections, including pneumonia and bloodstream infections, and
exhibits a remarkable propensity to acquire antimicrobial resistance (AMR) traits. In particular,
carbapenem-resistant K. pneumoniae (CRKP) are challenging pathogens due to the limited treatment
options, high mortality rates, and potential for rapid dissemination in health care settings (Paczosa
and Mecsas, 2016).
38
Treatment options for CRKP infections are usually limited to aminoglycosides, tigecycline, fosfomycin
and colistin. Novel β-lactam-β-lactamase inhibitors combinations, such as ceftazidime-avibactam and
meropenem-vaborbactam, have represented a major breakthrough for treatment of some CRKP (e. g.
those producing KPC-type and OXA-48-like enzymes), but unfortunately they do not cover strains
producing metallo-carbapenemases (Bassetti et al., 2018). Colistin, despite its nephrotoxicity and
neurotoxicity, remains a key component of some anti-CRKP regimens (Karaiskos et al., 2017).
Colistin resistance (colR) is mainly mediated by modifications of the lipid A moiety of the bacterial
lipopolysaccharide (LPS) by addition of positively charged 4-amino-4-deoxy-L-arabinose (LAra4N)
and/or phosphoethanolamine (pEtN) residues. A large panel of genes and operons is involved in
modifications of the LPS, and mutations conferring colistin resistance have mainly been observed in
mgrB, phoP/phoQ, pmrA/pmrB, and crrB genes (Cheng et al., 2010; Cannatelli et al., 2013, 2014a;
Wright et al., 2015). Recently, several plasmid-mediated colistin resistance genes, named mcr,
encoding pEtN transferases, have also been reported in E. coli and other members of
Enterobacterales, including K. pneumoniae (Sun et al., 2018).
Global dissemination of CRKP is mainly caused by the spread of a few successful clones. Major
representatives of these high-risk clonal lineages include the Clonal Group (CG) 11, CG15, CG307,
CG17, CG37, CG101 and CG147 strains. CG258 strains, and in particular those of ST258, are major
players in the worldwide spread of KPC-type carbapenemases, and are responsible for 68% of the
CRKP outbreaks (Navon-Venezia et al., 2017). CG101 strains harbor different clinically-relevant
resistance determinants, such as carbapenemases of the KPC, OXA-48, VIM and NDM types, and
virulence genes, such as an integrative conjugative element carrying the yersiniabactin siderophore
(ICEKp3), the fimbriae cluster (mrkABCDFHIJ), the ferric uptake system (kfuABC), a capsular K type
K17, and an O antigen type of O1 (Roe et al., 2019). These features, together with the ability to
produce biofilm, are likely major factors in the ecological success of CG101 strains. Indeed, spreading
of this clone is on the rise (Navon-Venezia et al., 2017).
Multidrug resistance (MDR) prevalence in clinical isolates of K. pneumoniae, including resistance to
third-generation cephalosporins, fluoroquinolones and aminoglycosides, may be as high as 50% in
Southern Europe, and even higher proportions have been observed in Eastern Europe. In Serbia, in
2016, MDR K. pneumoniae accounted for 63% of all K. pneumoniae infections in humans, of which
35% were also carbapenem resistant (WHO Regional Office for Europe, 2017). Previous studies
reported that NDM-1 was the main K. pneumoniae-associated carbapenemase observed in Serbia in
the period 2013-2014 followed by OXA-48, while KPC was only sporadically reported (Grundmann et
al., 2017; Trudic et al., 2017). Novović et al. performed a molecular epidemiology study of
39
carbapenem- and colistin-resistant strains from Serbia, showing prevalence of CG258 and CG101
strains, producing NDM-1 and OXA-48 carbapenemases, respectively. However, the proportion of
colistin resistance among those isolates was not reported, and the mechanisms of colistin resistance
of those isolates were not elucidated (Novović et al., 2017).
In this study, we used whole genome sequencing (WGS) to study the genomic epidemiology and
antimicrobial resistance mechanisms of colR K. pneumoniae isolates from Serbia, including some
representative of the previously mentioned collection as reference to study the dynamic changes of
population structure (Novović et al., 2017).
2.3 Materials and methods
Bacterial isolates and susceptibility testing. In the period between November 2013 and May 2017, K.
pneumoniae isolates were obtained from routine microbiological cultures of clinical samples (e.g.
urine, blood, skin, bronchial aspirate) from seven Serbian medical centers distributed in five Serbian
cities (Niš, Novi Sad, Belgrade, Kraljevo and Subotica). Bacteria were not isolated by the authors but
provided by the respective medical centers. Therefore, an ethics approval was not required as per
institutional and national guidelines and regulations. Information about patients antimicrobial
treatment were not available. Identification at the species level was performed by MALDI-TOF MS
(Vitek MS, bioMérieux, Marcy l’Etoile, France), and carbapenem susceptibility was determined by
disk diffusion and interpreted according to the EUCAST breakpoints (EUCAST, 2019). Isolates non-
susceptible to at least one carbapenem (ertapenem, meropenem and imipenem) were tested for
colistin resistance by Vitek2 or Etest (bioMérieux, Marcy l’Etoile, France) according to manufacturer’s
instructions (note that the warning by EUCAST about colistin susceptibility testing was only issued in
July 2016, and for this reason the above methods were used for colistin susceptibility testing of the
isolates collected in this study). Antimicrobial susceptibility testing of the colR isolates was
performed using the Vitek2 automated system, and results were interpreted according to EUCAST
breakpoints (EUCAST, 2019). Colistin minimum inhibitory concentrations (MICs) were confirmed
using the broth microdilution method performed according to the CLSI guidelines (CLSI, 2019) and
interpreted by using the EUCAST breakpoints (EUCAST, 2019). For carbapenems (ertapenem,
imipenem and meropenem), MICs were obtained by using Etests (bioMérieux, Marcy l’Etoile, France).
To note, 25 colR isolates were from the previously described collection by Novović et al., and were
included in this study for comparative purposes.
Mass spectrometry analysis of lipid A. Preparations of lipid A were obtained as previously described
(Kocsis et al., 2017). An aliquot of 0.7 µL of each preparation was spotted on a matrix-assisted laser
desorption/ionization–time of flight mass spectrometry (MALDI-TOF MS) sample plate, mixed with an
40
isovolume of norharmane matrix (Sigma-Aldrich, St Louis, Missouri) and then air-dried. Samples were
analyzed with a Vitek MS instrument (bioMérieux, Marcy l’Étoile, France) in the negative-ion mode.
DNA extraction and Whole Genome Sequencing. Genomic DNA was extracted with the DNeasy
UltraClean kit (Qiagen, Hilden, Germany), quantified by using the Qubit fluorometer (Thermo Fisher
Scientific, USA) and quality checked by using the 260/280 ratio absorbance parameter as determined
by the DS-11 FX + instrument (DeNovix, Wilmington, USA). Sequencing was performed using a
NextSeq platform (Illumina, Inc., San Diego, USA) and a 2x150 bp paired-end approach. Raw data
from paired-end sequencing were quality checked with the FastQC tool (v.0.11.6) and assembled
with SPAdes (v.3.10.1)(Bankevich et al., 2012). One representative strain (KB-2017-139) was also
sequenced with the MinION sequencer (ONT, Oxford, UK) using an R9.5.1 flow cell and the protocol
1D Genomic DNA by Ligation (SQK-LSK109). Illumina and Nanopore raw data from KB-2017-139 were
assembled with a hybrid approach using Unicycler (Wick et al., 2017). Whole genome sequencing
data of the 45 clinical isolates have been deposited under BioProject PRJNA449293
(www.ncbi.nlm.nih.gov/bioproject/PRJNA449293). The complete sequence of the plasmid
pSRB_OXA-48 obtained by Illumina and Nanopore sequencing was deposited on GenBank under
accession number MN218814.
Bioinformatics analysis. MLST was performed in silico by using the tool mlst
(https://github.com/tseemann/mlst) and the Pasteur database (https://bigsdb.pasteur.fr/). BLAST+
(2.7.1) was used to detect mutations in genes potentially involved in colistin resistance (mgrB,
pmrA/B, phoP/Q, crrA/B), and only mutations leading to amino acid variations were considered. For
the characterization of colistin resistance mechanisms, strains of CG258, ST101 and ST336 were
compared to colistin susceptible reference strains of the same CG, i. e. NJST258_2 (accession no.
NZ_CP006918.1), BA33875 (NEWA00000000) and MGH-78578 (NC_009648.1), respectively.
Phylogenetic relatedness was investigated with the parsnp tool (v1.2) (Treangen et al., 2014) by using
default parameters and the strain NTUH-K2044 (accession no. NC_012731.1) as reference. The
phylogenetic tree obtained was visualized with the online tool iTol (Letunic and Bork, 2016). The
ABRicate tool (https://github.com/tseemann/abricate) was used to detect acquired antimicrobial
resistance genes using the ResFinder database (Zankari et al., 2012), while plasmid replicons were
predicted by PlasmidFinder (Carattoli et al., 2014). Kaptive was used for the capsular type detection
(Wyres et al., 2016). Comparative analysis of plasmids was performed with BLAST Ring Image
Generator (Alikhan et al., 2011) and Easyfig (Sullivan et al., 2011).
For the comparative genomic analysis of ST101 isolates, on 31 October 2018 all the K. pneumoniae
genomes available on NCBI (N=5,820) were downloaded with the ncbi-genome-download tool
41
(https://github.com/kblin/ncbi-genome-download). MLST was performed and all ST101 (N=195)
(Table S2) together with ST101 strains from this study were used for phylogenetic investigations by
using parsnp and the closed ST101 chromosome from Kp_Goe_121641 (accession no.
NZ_CP018735.1) as reference.
2.4 Results
K. pneumoniae isolates and antimicrobial susceptibilities. In the period between November 2013
and May 2017, a total of 2,298 clinical isolates of K. pneumoniae were isolated from patients
admitted to seven medical settings located in five Serbian cities. Among those, 426 isolates (18.5%)
were non-susceptible to at least one carbapenem by disk diffusion, and were tested for colistin
resistance. A total of 45 strains (10.6%) out of this subset showed a colistin resistant phenotype. At
the time of the collection, colistin susceptibility testing was routinely performed with the Vitek2
instrument or Etest, although these methods had several limitations (Tan and Ng, 2007). Thus, the
number of colR isolates may be underestimated.
All the strains were confirmed as colistin resistant by the broth microdilution method (considering
the EUCAST susceptibility breakpoint of 2 mg/L) with MICs that ranged between 8 and 32 mg/L
(Table S1). Etest results for carbapenemes showed that all the strains were resistant to ertapenem,
while meropenem and imipenem had susceptibility rates of 93.3% and 91.1%, respectively. Vitek2
results showed that none of the fluoroquinolones, penicillins combined with β-lactamase inhibitors
and cephalosporins (including cefoxitin and the 4th generation cephalosporin cefepime) were
effective against the 45 colR isolates. Conversely, amikacin (86% susceptibility) and
trimethoprim/sulfamethoxazole (78% susceptibility) were the most active agents together with
imipenem and meropenem (Table S1).
Genomic epidemiology. Genome sequence data were used to investigate the population structure of
the colR K. pneumoniae strains circulating in Serbia. Five different STs were detected among the
investigated collection (ST101, ST437, ST258, ST336 and ST340), with the majority of strains
belonging to ST101 (N=38) or CG258 (ST258, N=1; ST340, N=1 and ST437, N=4) (Figure 1). The
remaining strain belonged to CG17 and was typed as ST336. Isolates of ST101 were closely related to
each other (single nucleotide polymorphism (SNP) variation: 5–893, mean 107, median 61), with only
two of them (i. e. KV-2017-142 and KV-2017-143) having more than 200 SNPs when compared to
other ST101 isolates and to each other. The ST101 isolates were detected in all the cities involved in
this study, except Niš, thus demonstrating the endemicity at the national level of this clone.
Moreover, there was not a clear clustering of isolates obtained from different hospitals, suggesting
inter-hospital cross infections.
42
Figure 1. Phylogenetic tree of the colR K. pneumoniae isolates from Serbia. For each isolate, the medical setting (CN, Clinical center of Niš, Niš; CV, Clinical center of Vojvodina, Novi Sad; KB, Konzilijum, Belgrade; DM, University hospital center “Dr Dragiša Mišovic-Dedinje”, Belgrade; KV, The General hospital “Studenica”, Kraljevo; GZ, The Institute of Public health of Belgrade, Belgrade; SU, General Hospital Subotica, Subotica), the year of isolation and the sample number are reported. Colored nodes indicate MLST, while the presence/absence of ESBLs, carbapenemases, resistance genes (black) and plasmid replicons is indicated by filled boxes.
The genomes of the ST101 Serbian isolates were compared with 195 ST101 genomes available in the
NCBI databases, and their phylogenetic relation is showed in Figure 2. Strains from our study (red
lines) cluster together in the tree in a well-defined branch containing other strains from Serbia,
Slovenia, Turkey and Greece. Overall, the number of SNPs among all analyzed ST101 isolates ranged
between 1 and 1,547 (mean 195, median 135), and two major lineages within this group can be
observed. The majority of SNPs separating these two lineages fell in the cps gene cluster, and this
was consistent with the previous observations that strains of ST101 are characterized by two
different K-loci , KL17 and KL106, associated with wzi alleles 137 and 29, respectively (Roe et al.,
2019). While KL17 is prevalent among ST101 strains, KL106 is less frequent but, interestingly, it is the
second most abundant capsular variant of CG258 (Wyres et al., 2015), reinforcing the hypothesis that
capsular exchange in K. pneumoniae is a common event (Chen et al., 2014; Bowers et al., 2015).
43
All non-ST101 isolates (excluding KB-2015-119) were part of a single monophyletic subclade within
the CG258 (Bowers et al., 2015) and produced different carbapenemases or were carbapenemase
negative (Figure 1), while the remaining isolate of ST336 was a OXA-48-producer and harbored the
KL25 capsular type.
Figure 2. Phylogenetic tree of the ST101 K. pneumoniae isolates from this study (red lines) in comparison to ST101 isolates retrieved from NCBI (black lines). The two types of capsular polysaccharides (KL17 and KL106) are indicated by colored ranges. Two datasets are also present, indicating the type of carbapenemase (inner circle) and the country of origin (outer circle).
Colistin resistance mechanisms. No mcr genes were observed in the genomes of the colR isolates.
Conversely, all of them showed alterations in the PhoP/PhoQ regulator mgrB gene. These alterations
were mainly SNPs, with the majority of ST101 isolates from this study characterized by the mutation
MgrBC28S (N=37; 97.4%). Although different substitutions of the cysteine amino acid at position 28
have already been described (e. g. MgrBC28F and MgrBC28Y), and their role in colistin resistance has
been experimentally demonstrated (Cannatelli et al., 2014b; Olaitan et al., 2014; Cheng et al., 2015;
Wright et al., 2015), the MgrBC28S is first described here. This cysteine residue has been previously
shown to be involved in a key disulfide bond relevant to MgrB function (Lippa and Goulian, 2012),
44
thus its substitution by Serine or by any other amino acid is expected to interfere with the ability to
repress PhoQ, leading to the overexpression of the pmrHFIJKLM operon and to a colistin resistance
phenotype. The isolate CN-2013-099, belonging to ST340, displayed the previously studied MgrBC28Y
substitution (Cheng et al., 2015). Different mutations leading to premature stop codons were MgrBK2*
in the ST101 isolate KV-2017-143, firstly described here, MgrBK3* in the ST437 isolate GZ-2017-145
(Nordmann et al., 2016) and MgrBQ30* in the ST336 strain KB-2015-119 (Nordmann et al., 2016). The
ST258 isolate was characterized by an insertion sequence of the family IS5 which interrupted the
mgrB gene at nucleotide 75. Disruption of the mgrB gene by insertion sequences has been shown as
a common mechanism of colistin resistance in KPC harboring strains (Cannatelli et al., 2014b). Three
ST437 strains were characterized by an adenine deletion within the polyadenine region present from
nucleotide 4 to 9 in mgrB, resulting in a frameshift mutation. Collectively, the results of these
analyses demonstrated that all colistin resistant strains investigated in this study were characterized
by genetic alterations in the mgrB gene.
Other genetic alterations potentially involved in colistin resistance were: PmrAE57G (KB-2015-119,
ST336), PmrBT157P (CCV-2015-105, ST101) and PhoQV446G (CCDM-2017-135, ST258). Among these, only
PmrBT157P was previously reported, and its role in reducing colistin susceptibility was demonstrated
(Jayol et al., 2014). Accordingly, the ST101 isolate CV-2015-105 having PmrBT157P together with
MgrBC28S, showed a colistin MIC 1- to 2-fold higher than isogenic strains carrying only MgrBC28S.
Mass spectrometry of lipid A was performed on a subset of isolates representative of the different
alterations potentially involved in colistin resistance. Compared to the colistin susceptible reference
ATCC11296 strain, colR isolates showed an additional peak at 1,971 m/z resulting from the addition
of a 4-amino-4-deoxy-L-arabinose moiety (131 m/z) to lipid A (peak at 1,840 m/z), as previously
reported (Leung et al., 2017) (results not shown). This supports the role of the observed mutations in
the overexpression of the pmrHFIJKLM operon and consequent lipid A modification, leading to
reduced colistin interactions. Moreover, no addition of pEtN moieties to lipid A were observed,
consistently with the absence of mcr-like genes (Liu et al., 2017).
To note, our findings concerning MgrB alterations differ from those previously reported by Novovic
et al., as they did not detect significant MgrB alterations for most of the isolates. This underline the
importance of using well characterized colistin susceptible reference isolates, as the one used in the
mentioned study was not characterized with reference methods for colistin susceptibility testing
(Mirovic et al., 2012).
Other antibiotic resistance mechanisms. All strains were positive for an ESBL-encoding gene, with
blaCTX-M-15 harbored by all strains except the only ST258, which carried a blaSHV-12 gene. Analysis of the
45
ompK35 gene, encoding a major outer membrane protein, showed that all non-ST258 strains had
deletions leading to frameshift and premature stop codons, while the ompK36 gene was intact in all
the genomes. Outer membrane impermeability most likely explains resistance to cefoxitin (a
cephamycin) and to ertapenem for those isolates negative for a carbapenemase encoding gene
(Ardanuy et al., 1998). Two ST437 and the ST336 isolate harbored the 16S rRNA methylase gene
armA, which confers high level resistance to aminoglycosides. Several other antimicrobial resistance
genes were observed for the following antimicrobial classes: aminoglycosides (presence of aac- ,
blaTEM-1A, dfrA14), and ii) a fragment identical to the IncL/M plasmid pKp_Goe_641-2 (CP018736.1)
carrying the blaOXA-48 gene (Figure 3). Both these plasmids have been described in K. pneumoniae
strain Kp_Goe_121641 (accession no. NZ_CP018735.1), isolated from a refugee from North Africa
hospitalized in Germany, in 2013. The latter strain belongs to ST101 and has a median of 142 SNPs
(min 134, max 601) compared to the Serbian ST101 isolates from this study. Collectively these results
suggest that pSRB_OXA-48 likely originated by recombination events between two plasmids within
an ST101 strain related to Kp_Goe_121641. In order to elucidate the recombination mechanisms at
the origin of pSRB_OXA-48, we compared this plasmid to pKp_Goe_641-1 and to pRA35
(LN864821.1), an IncL/M plasmid similar to pKp_Goe_641-2 but with an intact structure of the
transposon Tn6237 carrying blaOXA-48 (Beyrouthy et al., 2014) (Figure 3). A detailed analysis showed
46
that pSRB_OXA-48 contained a copy of Tn6237 which was disrupted by a IS26 composite transposon
of 73.7 Kbp sharing similarity with pKp_Goe_641-1. This hypothesis was corroborated by the
presence of 8-bp target site duplication sequences (5’-GCGAATAA-3’) flanking the composite
transposons regions (Figure 4). The results of reads-mapping performed against pSRB_OXA-48 using
Illumina short-reads from the other ST101/OXA-48 strains were consistent with the presence of a
pSRB_OXA-48-related plasmid in all the ST101/OXA-48 isolates. Non-ST101 OXA-48 strains (ST336
KB-2015-119 and ST437 GZ-2017-145) had the IncL/M replicon, while lacking the IncFIA and IncR
replicons, suggesting that the blaOXA-48 gene was located in a classic IncL/M plasmid and not in a
pSRB_OXA-48-like plasmid (Figure 1).
Figure 3. BLAST ring image generator output of the OXA-48 plasmid pSRB_OXA-48 from the ST101 isolate KB-2017-139 (violet) against the two major plasmids from the ST101 isolate Kp_Goe_1216141 (pKp_Goe_641-1, in red and pKp_Goe_641-2 in green). Only identities >95% are indicated. Antimicrobial resistance genes are indicated in red, plasmid replicons in blue and all other genes in black.
47
Figure 4. Comparison of plasmids pSRB_OXA-48, pKpGoe_641-1 and pRA35. Antimicrobial resistance genes, plasmid replicons and mobile elements are also indicated. TSD: target site duplication.
2.5 Discussion
This study exploited WGS to characterize a collection of colR CRKP isolates obtained from seven
medical settings and five Serbian cities over a nearly four-year period. Results showed that all the
isolates presented alterations in the PhoP/PhoQ regulator MgrB, confirming its major role in colR in K.
pneumoniae. Lipid A alterations associated with colR were also studied with MALDI-TOF MS. The
analysis revealed the addition of a 4-amino-4-deoxy-L-arabinose moiety to lipid A, but no addition of
pEtN moieties, for all isolates tested. These results support the role of the MgrB mutations in colistin
resistance, and also confirm the absence of mcr-like genes.
The predominant ST observed was ST101, an emerging high-risk clone detected worldwide and
associated with different carbapenemases and high mortality (Navon-Venezia et al., 2017; Can et al.,
2018). In a recent European survey of CRKP isolates, including 244 hospitals in 32 countries, four
major clonal lineages accounted for roughly 70% of the carbapenemase-producing isolates, including
ST 11, 15, 101, 258/512 and their derivatives (David et al., 2019). The first ST101 strain from Serbia
was isolated in 2013, and coproduced the OXA-48 and the NDM-1 carbapenemases (Seiffert et al.,
2014). Most of the colR ST101 from this study were carbapenemase-producers, and OXA-48 was the
only carbapenemase expressed. ST101/OXA-48 has been frequently reported, and in an 11-year
epidemiology study of OXA-48 producers among European and north- African countries, a quarter of
the OXA-48 K. pneumoniae isolates belonged to ST101 (Potron et al., 2013). Outbreaks of
ST101/OXA-48 were also described, with reports from Spain (Pitart et al., 2011; Cubero et al., 2015),
Algeria (Loucif et al., 2016), Czech Republic (Skálová et al., 2016) and Greece (Avgoulea et al., 2018).
The challenging phenotypic detection of OXA-48 carbapenemases and the rapid horizontal transfer
of OXA-48-encoding plasmids favor hospital outbreaks linked to patient transfer (Skálová et al., 2016)
and draw attention to the need for continuous and meticulous surveillance, as well as timely
investigation.
48
The blaOXA-48 gene spread is mainly related to the dissemination of a single ~62-kb IncL/M-like
conjugative plasmid that does not carry additional resistance determinants (Poirel et al., 2012).
Conversely, ST101/OXA-48 isolates from this study carried a novel hybrid plasmid (pSRB_OXA-48)
with replicons IncR and IncFIA and encoding OXA-48, the CTX-M-15 ESBL and several other
antimicrobial resistance genes. Such plasmids confer an MDR phenotype which limits the use of most
β-lactams, including carbapenems. In fact, even if most isolates (91%) were susceptible to imipenem,
carbapenems have been proven to be not effective in an in vivo murine model (Wiskirchen et al.,
2014). Moreover, there have been a number of case reports and series describing treatment failures
with carbapenem-containing regimens in the treatment of OXA-48-producing bacterial infections
(Stewart et al., 2018). Ceftazidime-avibactam may represent an effective alternative against such
isolates, as previously reported (Kazmierczak et al., 2018).
Similarities among the Serbian ST101 strains, supported by the limited number of SNPs observed and
the presence of the same alteration in the mgrB gene, suggest a clonal expansion of this clone among
Serbian medical settings. This observation underscores the need to strengthen contact precautions
for patients diagnosed with or suspected of having CRKP infections to limit the diffusion of colR CRKP
of ST101.
Of note, colR ST101 strains have recently been associated with high mortality rates. Indeed, a
prospective cohort study showed that among colR isolates, ST101 was found to be a significant
independent predictor of patient mortality, with a 30-day patient mortality of 72% (Can et al., 2018).
In conclusion, this work corresponds to the first genomic investigation of colistin resistance in K.
pneumoniae isolates from Serbia. The major role of MgrB mutations in colistin resistance in K.
pneumoniae, observed in strains of CG258, is here confirmed for those of ST101. We also report the
full sequence of a novel plasmid, pSRB_OXA-48, conferring MDR phenotype and encoding for the
ESBL CTX-M-15 and the carbapenemase OXA-48.
2.6 References
Alikhan, N.-F., Petty, N. K., Ben Zakour, N. L., and Beatson, S. A. (2011). BLAST Ring Image Generator
Figure 8. Features of the major CGs observed among the 200 randomly collected strains. The prevalence of MDR vs MDS A) and the types of ESBLs (B), carbapenemases (C) and capsular types (D) observed within the major CGs.
4.3.3 Antimicrobial resistance determinants.
More than half of the strains (N=110, 55%) harboured an ESBL-encoding gene, with 13 strains
harbouring more than one gene with up to four genes per strain. The most common ESBLs observed
were of the CTX-M type, with CTX-M-14 (N=35), CTX-M-3 (N=26) and CTX-M-15 (N=22) being the
most prevalent. CG307 strains had the highest prevalence of ESBLs, with all strains encoding for
either CTX-M-15 or CTX-M-14.
Four different carbapenemase-encoding genes were observed, blaKPC-2 (N=10), blaIMP-4 (N=2), blaOXA-48
(N=2) and blaIMP-30 (N=1). Strains belonging to ST11 carried most of the blaKPC-2 genes (90%), while the
remaining gene was found in an ST37 strain. The blaIMP-4 genes were observed in an hypervirulent
ST23 strain and in an ST337 strain. Two ST147 strains had either blaOXA-48 or blaIMP-30, and an ST383
strain had blaOXA-48.
Mutations in ompK genes were observed in 42 strains (21%) and consisted in insertion and deletions
leading to premature termination of OmpK35, which in few cases (N=9) were combined with
simultaneous ompK36 alterations. Such porin deficiencies were mainly observed within CG258, with
22 mutated strains out of 28 (78.6%). No porins alterations were observed for hypervirulent CG23
and CG65 strains.
79
Genes encoding 16S rRNA methyltransferase, associated with high-level aminoglycoside resistance,
were observed, with 13 strains harbouring armA, 11 harbouring rmtB genes and 2 strains harbouring
both armA and rmtB. Such genes were mainly observed in strains belonging to ST11 (N=9) and ST15
(N=4).
Several chromosomal mutations associated with known fluoroquinolone resistance were observed,
the most common being ParC80I (N=57), GyrA83I (N=47) and GyrA83F (N=11). Overall, 65 strains (32.5%)
had at least one ParC or GyrA mutations, the most common combination being GyrA83I-ParC80I (N=37),
and all 65 strains had high ciprofloxacin MIC (≥4 mg/L). Concerning the acquired fluoroquinolone
resistance mechanisms, QnrS1 (N=65), Aac(6')-Ib-cr (N=61) and QnrB4 (N=33) were the most
prevalent. Overall, 150 strains had at least one mechanism of fluoroquinolone resistance.
Genes encoding resistance to trimethoprim (dfrA) and sulfonamides (sul) were observed in 138
strains, with 100 carrying both genes and showing trimethoprim/sulfamethoxazole resistance.
Acquired mechanisms of colistin resistance were also observed. The mcr-1.1 gene was observed in
the K. pneumoniae ST231 strain K089 isolated in 2015. The gene was carried by a plasmid with
replicon IncX4 and identical to plasmid pMCR_WCHEC1618 (accession no. KY463454.1) obtained
from an E. coli strain from China in 2015 (Zhao et al. 2017). Strain K089 also encoded the ESBL CTX-
M-27, as well as fluoroquinolone, trimethoprim and sulfonamide resistance mechanisms. Two mcr-
9.1 genes were detected in K. quasipneumoniae subsp. quasipneumoniae K7029 and K7030 strains
belonging both to ST1681 and collected in 2005. Unfortunately, only relying on the Illumina short-
reads we were not able to determine the genetic background of the mcr-9.1 genes.
4.3.4 Hypervirulent K loci and acquired virulence genes.
K. pneumoniae capsule is a major virulence factor, and the capsule synthesis locus has considerable
genetic diversity between clonal groups (DeLeo et al. 2014; Wyres et al. 2015; Holt et al. 2015; Wyres
et al. 2016b).The hypervirulence-associated KL1 and KL2 represented the two most common capsular
polysaccharides within our collection. KL2 was associated with CG14 and CG65 strains (N=9 each),
and three more strains belonging to ST380, ST86 and ST25. KL1 was strictly linked to ST23 in K.
pneumoniae sensu stricto (N=14). KL1 was also observed in an ST367 K. quasipneumoniae subsp.
similipneumoniae, in a novel ST two locus variant of ST367 belonging to K. quasipneumoniae subsp.
similipneumoniae, and in a novel ST (single locus variant of ST527) belonging to Klebsiella variicola.
Siderophore gene acquisition was recently recognised as an important contributor to severe K.
pneumoniae invasive disease (Holt et al. 2015; Lam, Wick, et al. 2018). Lam et al. reported that the
ybt locus was present in 40.0% of the CG258, 87.8% of the hyper-virulent CG23, and was identified in
80
32.2% of the wider population. In our collection, yersiniabactin-encoding genes were observed in 61
strains (30.5%), and were located in eight different ICEKp chromosomally integrated mobile elements
and one plasmid. The major mobile elements were ICEKp10 (N=22) and ICEKp3 (N=17). While
ICEKp10 was linked to hypervirulent clones (CG23, N=14; CG65, N=5), ICEKp3 was mostly associated
with CG258 (N=9) and other non-hypervirulent clones. We observed ybt genes in 57.1% and 100% of
CG258 and CG23 strains, respectively, which is higher than previously reported (Lam, Wick, et al.
2018).
Plasmid-related iuc, iro, clb, rmpA and rmpA2 genes were also observed (iuc, 17%; iro, 16.5%; rmpA,
16%; rmpA2, 15%; clb, 11%), mostly associated with CG23 and CG65 (Figure 9). Because of its crucial
role in hypervirulence, aerobactin (iuc) positivity was considered a defining genetic trait for hvKP
(Russo et al. 2014). iuc1 was the most prevalent iuc lineage (N=32), and was linked to CG23 (N=14),
CG65 (N=8) and six other less represented CGs, including ‘classic’ clones and including a K.
quasipneumoniae subsp. similipneumoniae strain. iuc1 is usually located within the KpVP-1 virulence
plasmid (Lam, Wyres, et al. 2018) together with the previously mentioned virulence genes. We found
iuc1 together with iro1 (N=28), clb2 (N=14), clb3 (N=4), rmpA (N=28) and rmpA2 (N=29). Other iuc
lineages observed were iuc2, which is associated to KpVP-2 (Lam, Wyres, et al. 2018) and observed in
an ST380 strain, and iuc5, observed in an ST107 strain.
Figure 9. Percentages of virulence genes within the major CGs.
81
4.3.5 Comparative genomics of CG258 strains: cps diversity and hypervirulence
Figure 10. Phylogenetic analysis of CG258 strains, including 48 strains from this study and 18 strains from previous studies (Gu et al. 2017; Dong, Zhang, et al. 2018; Zhou et al. 2020). The fatal outbreak clone reported in China in 2017 (Gu et al. 2017) is highlighted on the tree. Aerobactin and salmochelin are not showed in the legend as they were of the type iuc1 and iro1 only, respectively. Chromosomal regions characterized by high SNPs density are reported on the right and their locations are shown compared to the reference GD4 genome (CP025951). Red blocks indicate predicted recombinations occurring on an internal branch, which are therefore shared by multiple isolates through common descent. Blue blocks represent recombinations that occur on terminal branches, which are unique to individual isolates.
Considering all 299 genomes, we ended with 48 non-duplicated CG258 genomes (ST11, N=40; ST11-
1LV, N=3; ST395, ST437, ST1264, ST340, ST1326, N=1 each). The rapid evolution within CG258 was
emphasized by the number of different capsular polysaccharides detected (N=17), of which 11
detected in ST11 only, and by the high evolutionary rate (~15 SNPs/genome/year) detected in
previous studies (Wyres et al. 2015; Zhou et al. 2020).
Figure 10 shows the phylogenetic relations of the 48 strains together with other ST11 strains
sequenced in previous studies. Two major clades were formed, with clade 1 consisting of ST11-KL47
and ST11-KL64 only, and clade 2 consisting of six different STs and 15 different cps types. Average
core SNP difference between clade 1 strains was 23, ranging from 0 to 60. Consistent with previous
studies, the major CG258 clone was ST11-KL47-KPC-2, which was similar to strains recently described
in China and causing outbreaks, including the fatal one that caused 5 deaths in 2017 (Gu et al. 2017;
Dong, Zhang, et al. 2018; Zhou et al. 2020). All strains from this clade harboured blaKPC-2 and carried
the ybt9 locus on an ICEKp3 element. Two of our ST11-KL47 strains were CR-hvKp and carried blaKPC-2
plus a pLVPK-like plasmid containing iuc1 and a truncated rmpA2. Retrospective studies have shown
that ST11-KL47 CR-hvKP emerged before 2015 and has since become detectable in different Asian
countries, including China, Hong Kong and India, suggesting that CR-hvKP may undergo worldwide
dissemination in the near future (Shankar et al. 2016; Wong et al. 2017; Du et al. 2018).
82
Clade 2 strains had 47 core SNPs on average, ranging from 0 to 123 (median 45). Recent studies
revealed the emergence and predominance of a novel ST11 clone, harbouring KL64, KPC-2 and the
hypervirulence plasmid in some instances (Zhou et al. 2020; Yang et al. 2020). Genomic analysis
revealed that this clade originated from ST11-KL47 after recombination of the cps genes around 2011
(Zhou et al. 2020). Of note, ST11-KL64 strains from this study did not cluster in clade 1 together with
previously reported ST11-KL64 strains, but they were located within clade 2. Analysis of
recombination sites revealed that such strains had two major regions of recombination, the cps
genes and the ICEKpnHS11286-1 region. Conversely, ST11-KL64 strains described by Zhou et al. only
showed recombination within the cps biosynthesis genes. Such findings suggest a different
evolutionary origin of ST11-KL64 strains from this study compared to the emerging clone described
by Zhou et al. The three ST11-KL64 strains in our collection were isolated in 2006 and 2007, they
lacked the blaKPC-2 gene and the ybt locus which is normally present in the ICEKpnHS11286-1
recombinant region. Strain ST11-KL64 K7069, isolated in 2007, carried a pLVPK-like plasmid
containing iuc1 and a truncated rmpA2 and also co-harboured blaCTX-M-3, armA and several other AMR
genes (Table S1). Only three strains out of the 28 composing the lower clade harboured blaKPC-2. Also,
the prevalence of yersiniabactin-encoding genes was lower compared to that of clade 1, with twelve
strains carrying either ybt9, ybt10, ybt13 or ybt14.
4.3.6 Phylogenetic analysis of the hypervirulent CG23
Figure 11. Comparative genomics of CG23 strains from the present study. STs are indicated by coloured tips, with yellow and green indicating ST23 and ST1265, respectively. All strains also contained the cps KL1, ybt1, clb2 and a truncated rmpA2. *replicons IncFIB(K) and IncHI1B of the pLVPK-like plasmid were observed in all strains.
A total of 19 non-duplicate CG23 strains were sequenced over the study period (Figure 11). All
belonged to ST23, except strain K7159 which belonged to ST1265. Average core SNPs observed were
186, ranging from 49 to 288 (median 188). All genomes contained the KL1 capsular locus, the
chromosomally encoded ybt1 embedded in ICEkp10 and the colibactin locus clb2. The hypervirulent
83
plasmid with IncFIB(K) and IncHI1B replicons was observed in all strains, containing iuc1, iro1, rmpA
and rmpA2 in most instances (Figure 11).
Strain K7159 (ST1265) shared 6 MLST genes with ST23, differing only for allele phoE, which is of type
9 and 10 in ST23 and ST1265, respectively. ST1265 was first described in Beijing in 2010, associated
with KL1 cps type, rmpA and a negative string test (Liu et al. 2014). Recombination analysis revealed
that strain K7159 had a ~750 Kbp recombinant region which also contained the phoE gene. Genomic
comparison revealed that this region likely originated from an ST35 genome (Figure 12).
Figure 12. Whole genome alignment of ST1265 in comparison to ST23 and ST35 genomes. The SGH10 chromosome was used as reference for the alignment. Pink lines indicate SNPs identified with the Harvest suite. The MLST gene phoE position is indicated, as well as the ~750 Kb region of divergence of ST1265 strains originating from ST35 genomes.
Strain K7159 was nearly identical to strain 11420 (GCA_009497755.1) isolated in Beijing in 2014 (Li et
al. 2020). Strain 11420 consists of a chromosome of length 5’438’591 bp, a pLVPK-like plasmid of size
229’796 bp and a KPC-2 plasmid of size 81’180 bp, containing the replicon IncN without additional
AMR genes. Reads mapping analysis showed that our ST1265 genome also contained two plasmids
with identical organization and 99.9% nucleotide identity compared to plasmids from strain 11420.
Three additional cases of genomic convergence of MDR and hypervirulence were observed. Strains
K931 and K862 both carried a ~50 Kbp IncN plasmid similar to pIMP-HZ1 (KU886034.1) described in
IMP-4-producing Enterobacteriaceae from China (Wang et al. 2017). While K862 carried a plasmid
identical to pIMP-HZ1, the IncN plasmid from strain K931 had blaCTX-M-3 and blaTEM-1 replacing the
blaIMP-4 gene. Strain K7046 had a plasmid identical to pCTX-M-3 (AF550415) described in C. freundii in
Poland (Gołȩbiewski et al. 2007). It’s a ~90 Kbp, IncL/M plasmid carrying blaCTX-M-3, armA, blaTEM-1,
aac(3)-IId, mph(E), msr(E), sul1, aadA2 and dfrA12 genes.
84
4.3.7 Global comparison of ST383: an emerging high-risk clone
Figure 13. Phylogenetic tree of ST383 genomes from this study in comparison with publicly available ST383 genomes. Coloured leaves indicate different capsular polysaccharides, where yellow is for KL30 and green for KL15.
We deeply investigated the strains belonging to ST383 as we found several of them to be CR-hvKp.
ST383 is an emerging clone that was first observed in Greek hospitals during 2009-2010 and strains
belonging to this clone were co-harbouring blaVIM-4, blaKPC-2 and blaCMY-4 β-lactamases (Papagiannitsis
et al. 2010). Figure 13 shows the phylogenetic relatedness of our ST383 together with publicly
available ST383 genomes. Only ten genomes were available, with most of them originating from
Greece. Strain KpvST383_NDM_OXA-48 from the UK had a complete genome and it was used as
reference for the phylogeny (Turton et al. 2019). Genomic relatedness showed strains from Europe
clustering together, the strain from the UK positioned apart from the rest of the tree, and the
Chinese strains from this study clustering together. Overall, an average of 158 core SNPs was
observed (min: 4, max: 627, median: 157), which decreases to 53 (min: 4, max: 182, median: 40) if we
only consider the strains from China. Two different K loci were observed, with the strain from
Belgium carrying KL15 and all other strains carrying KL30. Gubbins analysis revealed that the capsular
polysaccharide genes represented the major recombinant region. A second recombination concerned
a ~12 Kbp region consisting of mercury resistance genes and several transposases. No other major
recombination events were observed. Several carbapenemase-encoding genes were observed,
comprising the major clinically relevant KPC, OXA-48, NDM and VIM types, with two strains co-
harbouring two different carbapenemase genes. All strains from China carried the blaOXA-48 gene and
had an IncL/M plasmid replicon. ESBL-encoding genes were blaCTX-M-14, observed in all strains from
China, and strain K57 additionally had blaCTX-M-55.
Concerning virulence factors, yersiniabactin-encoding genes were not observed. Conversely, the
hypervirulent pLVPK-like plasmid was observed in some strains from China and in the strain from the
85
UK. Although it was not possible to fully reconstruct the hv plasmid sequences from our short-reads
sequence data, we detected iuc1 on a contig that matches a 45kb region of pLVPK and also carries
rmpA and rmpA2.
Strains belonging to ST383 and carrying OXA-48 plasmids were previously reported, with reports
from the UK (Dimou et al. 2012) and from China (Guo et al. 2016). In the latter study, Guo et al.
reported an outbreak caused by ST383 strains carrying a 70 Kb IncL/M OXA-48 plasmid. ST383 strains
carrying hypervirulence genes were also reported from UK, carrying the iuc and rmpA/A2 genes
together with carbapenemase-encoding genes of type blaOXA-48, sometimes in combination with
blaNDM (Turton et al. 2017, 2019)
4.3.8 Simultaneous carriage of acquired AMR and hypervirulence genes.
We detected eleven examples of genomic convergence of hypervirulence, indicated by the presence
of the aerobactin locus (iuc), and MDR, indicated by the presence of either an ESBL- or a
carbapenemase-encoding gene, in our 200 randomly selected strains (5.5%), spanning eight different
STs. Similarly, in a recent study from South and Southeast Asia aiming at studying the population
structure of bloodstream infection isolates, the prevalence of convergent strains was 7.3%, with
seven different STs observed (Wyres et al. 2020). By considering our complete collection of genomes
after exclusion of duplicates, we ended with 25 cases of genomic MDR-hv convergence (Table S2).
The occurrence of such convergent strains is on the rise, with 80% of them being detected in the
period 2012-2016. Among the convergent strains, the major ST reported was ST383, with 6 cases,
followed by ST11 and ST23 (3 cases each), ST29 (two cases) and eleven other STs with only one case.
Most cases of convergence (N=21) were characterized by the presence of a pLVPK-like plasmid. Such
a plasmid is common within hypervirulent clones such as CG23 and CG65, and we observed more
than 80% of its sequence within our CG23 and CG65 convergent strains. Conversely, variable portions
of the virulent plasmid were observed in normally non-hypervirulent clones (Table S2).
Aerobactin loci detected were of type 1 (N=21), 3 (N=3) and 5 (N=1). Most of the iuc1 convergent
strains belonged to ST383, ST23 and ST11 and were previously described. In some cases we were
able to detect the genetic background of the hv and MDR genes. The K. quasipneumoniae subsp.
similipneumoniae strain K898 belonged to ST367 and had an hypervirulent capsule of the KL1 type. It
carried a blaCTX-M-15 gene in an IncFII plasmid together with blaTEM-1. Such IncFII plasmid is ~95 Kbp and
is identical to pL22-5 (CP031262.1) obtained from an ST367 from Beijing. The pLVPK-like plasmid was
characterized by the presence of the replicon IncFIB(K) and by the virulence genes iuc1, iro1, rmpA
and a truncated rmpA2. Strain K7058 belonged to ST65 and carried a pLVPK-like plasmid plus an ~70
Kb IncFII plasmid harbouring blaCTX-M-15 and no other AMR genes.
86
Three strains carried iuc3 which was associated with IncFIBK and IncFII plasmids similar to NCTC11676
(NZ_UGMR01000002.1). Two of those strains also carried iro3 and an ICEKp1 element containing
ybt2 and rmpA. All three strains carried multiple ESBL-encoding genes, and strain K7156 additionally
harboured a blaIMP-4 carbapenemase-encoding gene.
The strain K7146 belonged to ST107 and carried iuc5 together with iro5, which have been previously
detected in E. coli plasmids such as p3PCN033 (CP006635.1). Reads mapping revealed that our ST107
strain contained a plasmid with 90% coverage and 99.5% identity compared to p3PCN033, including
the plasmid replicons IncFIB, IncFIC and IncQ1 and several AMR genes (aph(3')-Ia, aph(6)-Id, aph(3'')-
Ib, sul2, oqxA/B, dfrA17, blaTEM-1B, tet(B)). K7146 also carried the ESBL-encoding gene blaCTX-M-3 on a
plasmid with replicons IncN and IncU, also containing additional AMR genes (aac(6')-Ib-cr, ARR-3,
qnrS1, catA1, mph(A), dfrA14).
4.4 Conclusions
This study aimed to investigate the longitudinal population of K. pneumoniae clinical isolates from
the Hospital 301 (People's Liberation Army General Hospital) in Beijing, China. The major focus was
directed towards the investigation of ‘high-risk’ clones, those characterized by the simultaneous
carriage of AMR and hypervirulence genes and potentially able to cause serious infections with
limited treatment options. A major limitation was that the sample size was small, especially if we
consider that it was spread over a long time frame. While some sporadic clones may have been
missed from our collection, the major K. pneumoniae clones, as described in previous reports from
China (Zhang et al. 2016; Van Dorp et al. 2019; Yang et al. 2020; Zhou et al. 2020), were observed.
While we did not get a complete picture of the complex K. pneumoniae population, we were able to
detect the major AMR and virulence determinants and, eventually, their genetic environment. We
detected three major high-risk clones, characterized by ESBL and/or carbapenemase production or
hypervirulence, with also strains expressing both features simultaneously. Strains belonging to
CG258, the globally dominant clinical K. pneumoniae clone, were the most represented and showed
high diversity. However, one clone, ST11-KL47, represented the majority of strains, and was highly
associated with KPC-2 and several virulence factors. CG23 still remains the dominant hvKp clone.
While it is usually susceptible to multiple antibiotics, we found some strains harbouring MDR
plasmids encoding for ESBLs and carbapenemases. Moreover, we found a strain belonging to the
recently described ST1265 and we showed that it’s an hybrid strain originating from an ST23 and an
ST35. The simultaneous carriage of the cps KL1, the hypervirulence plasmid and a KPC-2 plasmid
underscore the importance of tracking the spread of such novel clone. We also reported the
emergence of a recently described high-risk clone, ST383. Conversely to strains belonging to CG258,
87
which are usually associated to KPC-2, ST383 strains seems to readily acquire carbapenemases of the
different types, sometimes harbouring two different types. Moreover, we found several ST383
strains carrying the hypervirulent plasmid. The combination of carbapenem resistance and
hypervirulence significantly reduces the antimicrobial options for treating the life-threatening
infections caused by such strains and therefore represents a major urgent challenge for clinical
treatment, infection control and public health (Chen & Kreiswirth 2017).
4.5 References
Argimón S et al. 2016. Microreact: visualizing and sharing data for genomic epidemiology and
Zhang Y et al. 2018. Epidemiology of carbapenem-resistant Enterobacteriaceae infections: Report
from the China CRE Network. Antimicrob. Agents Chemother. 62. doi: 10.1128/AAC.01882-17.
Zhang Y et al. 2019. Evolution of hypervirulence in carbapenem-resistant Klebsiella pneumoniae in
92
China: a multicentre, molecular epidemiological analysis. J. Antimicrob. Chemother. doi:
10.1093/jac/dkz446.
Zhao F, Feng Y, Lü X, McNally A, Zong Z. 2017. Remarkable diversity of Escherichia coli carrying mcr-1
from hospital sewage with the identification of two new mcr-1 variants. Front. Microbiol. 8:2094. doi:
10.3389/fmicb.2017.02094.
Zhou K et al. 2020. Novel subclone of carbapenem-resistant klebsiella pneumoniae sequence type 11
with enhanced virulence and transmissibility, China. Emerg. Infect. Dis. 26:289–297. doi:
10.3201/eid2602.190594.
93
CHAPTER 5 : Interpreting k-mer based signatures for antibiotic
resistance prediction
Magali Jaillard1, Mattia Palmieri1, Alex van Belkum1 and Pierre Mahé1
1bioMérieux, Marcy l’Etoile, France
Submitted to GigaScience
94
5.1 Abstract
Background. Recent years witnessed the development of several k-mer-based approaches aiming to
predict phenotypic traits of bacteria based on their whole-genome sequences. While often
convincing in terms of predictive performance, the underlying models are in general not
straightforward to interpret, the interplay between the actual genetic determinant and its translation
as k-mers being generally hard to decipher.
Results. We propose a simple and computationally efficient strategy allowing one to cope with the
high correlation inherent to k-mer-based representations in supervised machine learning models,
leading to concise and easily interpretable signatures. We demonstrate the benefit of this approach
on the task of predicting the antibiotic resistance profile of a Klebsiella pneumoniae strain from its
genome, where our method leads to signatures defined as weighted linear combinations of genetic
elements that can easily be identified as genuine antibiotic resistance determinants, with state of the
art predictive performance.
Conclusions. By enhancing the interpretability of genomic k-mer-based antibiotic resistance
prediction models, our approach improves their clinical utility, hence will facilitate their adoption in
routine diagnostics by clinicians and microbiologists. While antibiotic resistance was the motivating
application, the method is generic and can be transposed to any other bacterial trait.
5.2 Introduction
Antimicrobial resistance (AMR) is a global healthcare problem and rapid diagnostics are needed to
select the right treatment, to follow the route to cure and to monitor and prevent community- and
hospital-acquired outbreaks of infections. Next-Generation Sequencing (NGS) is a disruptive
technology which is, potentially, able to supplant or even replace the current plethora of diagnostic
tests with a single, most probably well-affordable and faster solution. Inferring the antibiotic
resistance profile from a bacterial genome is challenging. However, good results have been obtained
for several species [1-7], including Klebsiella pneumoniae [8]. Su et al. [9] discussed the challenges of
NGS-based antibiotic susceptibility testing (AST) and provided a comprehensive review of the current
state of the art in this field.
Early approaches relied on the detection of known resistance markers to claim resistance, a strategy
sometimes referred to as direct association analysis [10]. While effective when the genetic bases of
antibiotic resistance are well known, which is the case for instance for most antibiotic resistance
mechanisms in the highly clonal species M. tuberculosis [11, 12] and Salmonella typhi [13], this
approach suffers from several limitations. First and foremost, it intrinsically relies on prior knowledge
95
of the precise nature of the resistance determinants, which may not be available for all species and
drugs. Secondly, it is not able to account for the fact that these markers can have different levels of
predictive power [14, 15], that they can act in a multi-factorial fashion through epistasis [16, 17], or
that resistance can result from the accumulation of several different mutations [18, 19]. Last but not
least, it is hazardous to predict susceptibility when no marker is detected, since the resistance marker
may be novel and databases incomplete. This issue is more and more addressed from the supervised
machine learning (ML) standpoint: given a set of genomes with associated reference phenotypes
(provided by phenotypic AST methods [20]), one seeks a prediction rule allowing to infer the
resistance or susceptibility of a novel strain from genomic features. Even for M. tuberculosis, where
the antibiotic resistance knowledge is probably among the most thorough and complete, recent
studies showed that performance of direct association strategies can still be significantly improved
by ML models [10, 17].
A great variety of ML strategies have been explored, taking into account several parameters. First,
regarding the nature of the genomic features considered: supervised ML models can indeed operate
from known markers like the ones involved in direct association strategies, offering the possibility to
discover more complex and multivariate marker combinations better predicting resistance
phenotypes [3, 10, 17], or directly using the raw sequences represented as k-mers [4, 8, 21-23]. The
latter approach offers several advantages: it does not require prior knowledge about the underlying
resistance mechanisms, allows to capture various types of genomic determinants (including the
acquisition of genes or point mutations), and does not require to align the genomes to a common
reference which may be hard to define for some species, especially the less clonal ones [24, 25].
Second, regarding the type of ML algorithms. Boosting algorithms [4, 8, 21], penalized regression
models [10, 17, 23], decision trees [26], random forest [10, 27], neural networks [17] or set cover
machines [22, 26] have already been successfully deployed in this context. While each algorithm has
its own merits and shortcomings, several studies reported comparable global performance for
various algorithms, with specific variations by drug and microbial species [10, 17, 28]. Finally,
different kinds of antibiotic susceptibility information can be considered: either discrete when the
objective is to distinguish susceptible from resistant (or non-susceptible) ones [10, 17, 21, 22], or
continuous, where one seeks to predict the minimum inhibitory concentration (MIC) of the
antimicrobial agent itself [3, 4, 8].
A critical challenge for the adoption of such predictive ML models by clinicians and microbiologists
resides in their level of interpretability and, ultimately, clinical action-driving ability. While the notion
of interpretability is somehow ill-defined, a natural requirement for the end-user would be to
achieve the prediction from a limited number of genomic features, that can be easily and
96
unambiguously interpreted as actual genetic determinants [25, 26]. This challenge is particularly
important using k-mer-based representations, for several reasons. Firstly, k-mers covering conserved
genomic regions are redundant and can be easily detected and filtered [29], but they define groups
of equivalent k-mers which are not always straightforward to interpret as genomic determinants [21-
23, 26]. Secondly, k-mers may not be specific of a given genomic region, hence may be hard to
annotate. This is especially the case for short k-mers, e.g., when k = 8 or k = 10 [4, 8]. Last but not
least, the k-mer-based representation of genomes intrinsically leads to very high-dimensional feature
spaces, with strongly correlated variables. Using k = 31 for instance, and depending on the bacterial
species considered, it is common to end up working with 105 - 106 (non-redundant) k-mers, many of
which are observed in almost the same sets of genomes, hence bringing almost the same
information regarding the studied phenotype.
We propose to rely on the adaptive cluster lasso (ACL) [30], an extension of Bühlmann et al. [31]
tailored to the high-dimension setting by means of a prior screening of variables. We implemented in
a R package a simple and efficient ACL-inspired strategy able to cope with the very high-dimension
and strong correlations of k-mer-based representation, leading to sparse and interpretable genomic
signatures. This approach compared favorably to the standard lasso on a systematic validation study
focusing on K. pneumoniae. It provided a comparable level of performance while offering better
interpretability of the genomic determinants involved in the models. We could identify known and
potentially novel resistance determinants from the corresponding k-mer signatures, which allowed to
extract meaningful scientific insights.
5.3 Methods
5.3.1 Datasets
Training dataset. We gathered the assembled genomes, provided as contigs, of 1665 strains to
develop MIC prediction models for K. pneumoniae [8]. This set of genomes defines our training
dataset. We focused on the 10 clinically most relevant antibiotics listed in Table 1 which belong to
seven different antibiotic classes. The reference MICs were cast into resistant, susceptible and
intermediate according to the Clinical and Laboratory Standards Institute (CLSI) breakpoints. The
intermediate and resistant strains were finally merged into a common category, to define a binary
classification problem aiming to distinguish susceptible (S) from non-susceptible (NS) strains. Table 1
provides the number of S/NS phenotypes available for each selected drug.
97
Table 1. Dataset constitution. This table provides the number of susceptible (S) and non-susceptible (NS) strains available in the training and test dataset for the various antibiotics considered. piper.tazo stands for piperacillin/tazobactam. Note that a limited number of susceptible strains is available in the test dataset for aztreonam, and to a lesser extent cefepime and meropenem.
k-merization of the training dataset. The k-merization was computed from the contigs of all training
genomes, using the DBGWAS software [25], with a k-mer size of 31 and filtering patterns with a
minor allele frequency (MAF) below 1%. DBGWAS allows for the deduplication of the strictly
equivalent k-mers by compacting overlapping non-branching paths of kmers into unitigs, thanks to
the use of a compacted De Bruijn Graph (cDBG) (Figure 1 A). DBGWAS stores the profiles of
presence/absence of each unitig in the training genomes in a matrix V such as Vi,j = 1 if the j-th unitig
is present in the i-th input genome and Vi,j = 0 otherwise (Figure 1, B1). Each vector Vi,j is then
transformed according to its allele frequency: if its allele frequency exceeds 0.5, meaning that it is
observed in more than 50% of the panel genomes, it is inverted as Vi,j = |1–Vi,j| so that its MAF
corresponds to its average value. This transformation renders identical two originally complementary
vectors. Keeping only the unique patterns then leads to an optimal reduction of the number of
features, without modifying the intrinsic statistical signal (Figure 1 B2). These unique, MAF-filtered,
patterns define the final variant matrix X, where Xi,j = 1 if the j-th pattern is found in the i-th genome,
and 0 otherwise. This process is described in details in Jaillard et al. [25]. The DBGWAS files
describing the cDBG are kept for the further interpretation of the genomic signatures, allowing to
visualize the unitigs of the selected patterns within their genomic environment.
In practice we carry out this k-merization process for each antibiotic separately, processing solely the
strains that have been phenotypically tested. The output of this k-merization step is a sparse variant
matrix X with, for instance in the case of the cefoxitin antibiotic, N = 1643 rows for the N cefoxitin-
phenotyped strains of the training panel and p = 1,234,397 columns representing the p distinct
patterns of presence/ absence retained by DBGWAS. The matrix X is binary as DBGWAS only encodes
the presence or absence in the genomes. It is sparse as only around 13% of the values are not null.
98
Figure 1. K-merization of the training genomes. Illustration of the DBGWAS process of k-merization and variant matrix construction. Refer to Jaillard et al. [25] for further details.
Test dataset. To validate the predictive performance of the models, we built an independent test
dataset involving 634 strains, including 114 strains from our bioMérieux collection (NCBI Bioproject
PRJNA449293 and PRJNA597427) and 520 strains from the PATRIC database (https://www.patricbrc.
org/). Such strains were mostly from the USA, the UK, Serbia, Greece and other European countries
and the MICs were obtained with either agar dilution, broth microdilution or Vitek 2 (bioMérieux,
Marcy l’Étoile, France) (see Supplementary Section S1). Table 1 provides the number of S/NS
phenotypes available in the test dataset.
5.3.2 Coping with highly correlated genomic features.
Logistic regression is a widely used generalized linear model addressing binary classification problems.
In our case, it consists of building a linear function defined for a strain represented by a vector x 𝜖 {0,
1}p as:
99
where p corresponds to the number of distinct patterns identified by DBGWAS, and x encodes their
presence/absence in the strain genome. To estimate the model coefficients and simultaneously
select a limited number of patterns from a training panel of n strains, one can rely on the L1 or lasso
penalty and consider the following optimization problem:
where yi = 0 if the ith strain, stored in the ith row of the training matrix X, is susceptible and 1
otherwise. The function L is the logistic loss function, which quantifies the discrepancy between the
true phenotypes yi of the strains and the predictions f(Xi,.) obtained by the model. The λ parameter
achieves a trade-off between this empirical error and the lasso regularization term, and is usually
optimized by cross-validation.
The feature selection ability of the lasso penalty is notoriously unstable in the presence of strong
correlation between features. This is particularly the case using k-mer based representations, making
it difficult to derive meaningful interpretations from the features selected by the model, and their
associated coefficients. We propose a simple and efficient three-step strategy to identify sparse and
interpretable genomic signatures.
Screening step. In this step, we screen features. For this purpose, we first fit a standard lasso-
penalized regression model on the original feature matrix X for several values of the regularization
parameter λ, and extract the set of features that are selected at some point on this regularization
path. Formally, letting (λ1, ..., λm) be the m values of the considered grid of λ, and B the p x m matrix
containing the model coefficients obtained by Equation 1. We define a set a of active features as:
and let pa = |a| be their number. Since the lasso cannot select more features than observations, we
typically end up with pa in the order of N (i.e., 103 in our case). We then extract the features which
are strongly correlated to the active ones from the entire feature matrix. For this purpose, we
compute a pa x p matrix G containing the pairwise correlations between the pa active features
identified beforehand and the p original ones. Formally, Gi,j = cor(X.,ai , X.,j), where cor is the standard
Pearson correlation between vectors of MAF patterns across the genomes, and is a classical criterion
to quantify linkage disequilibrium (LD) between genomic features [32]. Since we rely on binary
variables encoding the presence/absence of features in the genomes, Gi,j quantifies the extent to
which features i and j co-occur in the genomes. As pa is typically much smaller than p (in the orders
of 103 versus 106 in our case), computing this matrix is much easier than computing the entire p x p
100
correlation matrix. Finally, we extract the set e of features that are strongly correlated to at least one
active feature as:
where the hyperparameter s1 controls the minimum level of correlation required, and is referred to
as the screening threshold. This operation defines a set of pe = |e| features, called the set of
extended features. Obviously, we have pa ≤ pe ≤ p. In our context, we typically end up with a few
thousand extended features, hence pa < pe << p.
Clustering step. While the screening step identifies a limited number of features deemed sufficiently
correlated to the features identified by a standard lasso, the second step aims to explicitly define
groups, or clusters, of strongly correlated variables. We rely for this purpose on a bottom-up
agglomerative clustering procedure, as suggested by Bühlmann et al. [31]. More precisely, we first
define a pe x pe distance matrix D between extended features, defined as Di,j = |1 – cor(X.,ei , X.,ej )|.
This matrix is then used to carry out a hierarchical clustering, implemented in R by the hclust function,
using a minimum linkage criterion. The resulting dendrogram is finally cut at a height of 1–s2, the
second hyperparameter s2, called the clustering threshold, controlling the level of within-cluster
correlation.
Learning step. Finally, we summarize each identified cluster as a new composite variable, defined as
the average of the original variables defining the cluster, and carry out a standard lasso at the cluster
level. Since in our case the original variables encode the presence/absence of a given DBGWAS
pattern in the genomes, these composite variables correspond to the proportion of patterns involved
in a cluster that are present/absent in the genomes. Figure 2 summarizes this three-step method.
101
Figure 2. Three-step process. Illustration of the proposed three-step procedure.
5.3.3 Model selection
Our approach involves three hyperparameters that must be optimized for each antibiotic: the
screening and clustering thresholds s1 and s2 used to build the clusters of correlated variables, and
the regularization parameter λ involved in the final cluster-level lasso model. We relied on the
glmnet software [33] to fit the lasso models involved in both the screening and learning steps. We
used the default heuristic proposed by the software to define the grids of candidate values for the
regularization parameters. The screening and clustering thresholds were both systematically set to
0.95 based on preliminary experiments (see Supplementary Section S2), and we relied on a 10-fold
cross-validation procedure to optimize the regularization parameter involved in the final cluster-level
lasso model, as we now describe.
We first split the training dataset into ten folds, stratified by sequence type and phenotype. For each
of the ten folds, 9 tenth of the dataset were used to screen variables and identify clusters. The final
cluster-level lasso model was then fit and applied to the held-out strains, for each candidate value of
the regularization parameter. Our model selection strategy aimed to simultaneously maximize its
sensitivity and specificity, respectively defined as the fractions of correctly classified non-susceptible
and susceptible strains. For this purpose, a Receiver Operating Characteristic (ROC) curve was built
for each candidate regularization parameter after completion of the cross-validation procedure, and
the point closest to the optimal one (defined by a true positive rate of 1 and a false positive rate of 0)
was used to define the optimal sensitivity/specificity trade-off. Following Hicks et al. [28], we refer to
the average of the (optimal sensitivity and specificity as balanced accuracy (bACC). Finally, we
selected the sparsest model that allowed to maximize the balanced accuracy up to one point, in
order to reduce the risk of overfitting. In practice, this cross-validation procedure was repeated three
times and the selection was based on average balanced accuracy values obtained across the three
repetitions. Supplementary Figure S4 illustrates this model selection strategy.
5.3.4 Interpretation of the predictive signature
We use the DBGWAS software to interpret the genomic signatures, based on the cDBG built during
the k-merization step. The unitigs defining the patterns involved in the final model are visualized
within their neighborhood in the cDBG, which represents their genomic environment hence provides
insight on the type of variant involved, typically a plasmid-based acquired gene versus a local
mutation (single nucleotide polymorphism (SNP) or indel) in a chromosomal region.
102
5.3.5 Evaluation of the computational requirements
We evaluate the computational requirements of the standard lasso and cluster-lasso procedures by
measuring the time and memory required to compute a regularization path involving 100 values of
the regularization parameter. For the standard lasso, this simply amounts to calling the glmnet
function of the glmnet R package, using the variant matrix provided by DBGWAS. For the cluster-
lasso procedure, this amounts to:
i. making the same call to glmnet to identify the set of active variables,
ii. computing the pa x p correlation matrix G in order to identify the set of extended features,
iii. building the clusters of correlated variables
iv. making a second call to glmnet, using the variant matrix defined at the cluster-level.
This procedure is repeated five times for each drug, using a single Xeon E5-2690-V3 CPU.
5.4 Results
5.4.1 Cross-validation results
Table 2 provides the results obtained in terms of cross-validation performance and support size of
the models. The predictive performance is measured by the area under the ROC curve (AUC) and
balanced accuracy. Additional performance indicators are provided in Supplementary Table S1. The
support size of a model is defined as the number of features it involves, which respectively
corresponds to individual or clusters of DBGWAS patterns, for the lasso and cluster-lasso strategies.
We also report the overall number of unitigs involved, which is only slightly higher than the number
of features for the lasso and corresponds to unitigs in total LD. In contrast, this overall number is
markedly higher for the cluster-lasso strategy, because of the pattern clustering.
Table 2. Cross-validation results. This table summarizes the cross-validation results obtained by the lasso and cluster-lasso strategies for the 10 antibiotics, in terms of balanced accuracy (bACC), AUC, support size, overall number of unitigs involved and maximal number of unitigs associated to a single pattern or cluster (between brackets).
103
Both strategies show similar performance in terms of both balanced accuracy and AUC, confirming
that taking into account, or not, the correlation between features has a limited impact in terms of
predictive performance. We also note that the model support is often slightly smaller with cluster-
lasso (for 8 drugs out of 10), suggesting that several features selected separately with the lasso
ended up merged in a single cluster by the cluster-lasso. As expected, the overall number of unitigs
involved in a cluster-lasso model is significantly larger. Interestingly, it is not evenly distributed across
its features. In the meropenem model, for instance, 159 out of the 164 unitigs defining the model
features are associated to a single feature, suggesting that it corresponds to the presence of a gene,
as confirmed in the interpretation analysis depicted in the next section.
Finally, Figure 3 provides a graphical representation of the lasso and cluster-lasso signatures obtained
for ceftazidime, which are of moderate complexity. The heatmap shows the correlation between the
patterns involved in one signature and/or the other, and highlights the 8 major clusters identified by
the cluster-lasso strategy (clusters including more than 10 patterns). While all the patterns defining a
cluster have by construction a similar level of predictive power, the lasso model usually selected a
single one of them. There is an exception for the 3rd cluster, shown in green in the zoomed area of
Figure 3, where two patterns were selected as distinct features of the lasso model.
By explicitly reconstructing and providing these clusters of correlated features to the learning
algorithm, the cluster-lasso strategy leads to a more meaningful characterization of the genetic
determinants involved, as we describe below.
104
Figure 3. Correlation within features selected in the signatures. This heatmap shows the correlation matrix built from the features selected by the lasso and the cluster-lasso (identified by the orange and blue bars shown above the heatmap, respectively), for ceftazidime. The corresponding values of model coefficients are represented by green bars. The 8 major clusters (involving more than 10 patterns) of the cluster-lasso signatures are identified by a dedicated color ranging from red to grey. A zoom of the top left side of the figure allows a better reading of the colored bars for the major clusters 1, 3, 7 and 8.
5.4.2 Model interpretation
We focus on two drugs to illustrate the improved interpretability offered by cluster-lasso signatures:
meropenem, where the interpretation is straightforward, and cefoxitin, which is among the
signatures of highest support. Additional results obtained for the remaining drugs are deferred to
Supplementary Materials, Section S5.
As shown in Table 2, the lasso and cluster-lasso meropenem models involve 8 and 3 features,
respectively. As shown in Figure 4(B), each lasso feature corresponds to a single unitig, while the
cluster-lasso signature involves a large cluster of unitigs (159 out of the 164 involved). Figure 4(A)
shows the magnitude of the model coefficients. It reveals that the cluster-lasso signature is
essentially driven by a single prominent feature, while 4 to 5 features of the lasso signature have a
non-negligible weight. The major feature of the cluster-lasso signature corresponds to the large
cluster of correlated patterns, and the DBGWAS visualization (Figure 4(C)) shows that the
corresponding unitigs are organized as a long linear path in the cDBG. This suggests that this cluster
105
corresponds to an entire gene. The annotation provided by DBGWAS shows the gene to be the Class
A beta-lactamase blaKPC. The DBGWAS visualization obtained for the lasso signature indicates that 3
of the 8 features – features 1, 2 and 4 – are also co-located in a region of the cDBG annotated as
blaKPC. The fact that the lasso selected these specific unitigs within the blaKPC gene suggests that the
resistance determinants involved are SNPs or indels. While the gene-level annotation is the same as
that obtained with the cluster-lasso, the interpretation of the signature in terms of genetic variants is
therefore radically different. A closer look at the lasso signature reveals that the 3 blaKPC features are
actually strongly correlated: they are often observed together. Unsurprisingly, they belong to the
largest cluster involved in the cluster-lasso signature, and interestingly, their cumulative weight is
approximately equal to that of the cluster-lasso feature (3.4 instead of 3.3). By explicitly detecting
that these features are correlated, and merging them into a single feature, together with additional
correlated features not even involved in the lasso signature, the cluster-lasso leads to a more
meaningful interpretation of the underlying prediction model, in two aspects. Firstly, it captures the
true nature of the genomic determinant involved: the presence of the blaKPC gene, as opposed to
mutations within the gene. Secondly, it assesses the overall contribution of the gene presence in the
decision rule, while, in the lasso signature, this contribution is shared by several distinct yet
correlated features.
106
Figure 4. Interpretation of the meropenem signatures. This figure provides a detailed comparison of the lasso (left) and cluster-lasso (right) signatures. A) Absolute value of the coefficients of the models. B) Number of unitigs involved in the features of the models. C) Visualization of the first subgraph obtained by DBGWAS for each signature. Nodes of the graphs correspond to unitigs of the cDBG built by DBGWAS from the training panel of genomes, as illustrated in Figure 1 and detailed in [25]. Colors allow to identify which unitigs of the graphs in panel C are related to which features of the models in panels A and B.
Likewise, Figure 5 presents the DBGWAS analysis of the lasso and cluster-lasso signatures obtained
for cefoxitin. We focused on the two first subgraphs provided by the software, which represent the
two genomic neighbourhoods of the most important patterns, or clusters of patterns, involved in the
models. The subgraphs are indeed ordered according to the maximal absolute value of model
coefficients among all patterns or clusters involved in the subgraph. While DBGWAS identifies the
same resistance genes in both methods (the efflux pump ompK36 and blaKPC), the nature of the
underlying resistance determinants cannot be deduced from the lasso signature. The ompK36-
annotated subgraph obtained for the cluster-lasso signature (top-right panel of Figure 5) involves 2
clusters gathering 9 unitigs (clusters 1 and 3), and presents a topology attributable to a local
polymorphism: a complex bubble, with a fork separating susceptible (blue) and resistant (red) strains,
as described in [25]. The corresponding lasso subgraph, shown on the top-left panel, includes 4
patterns (patterns 1, 2, 32 and 56) each having its proper value of model coefficient, represented by
4 shades of colors ranging from blue to red. These distinct model coefficient values can lead to wrong
conclusions regarding the individual importance of the corresponding unitig sequences. Indeed,
aligning these unitigs with annotated ompK36 sequences reveals that features 2 and 56 both
represent the wild type, while features 1 and 32 align to the insertion of two amino acids in the L3
loop, as described in Novais et al. [34] (Supplementary Figure S6). The second lasso subgraph
(bottom-left panel of Figure 5) includes a single feature of the signature (shown in purple),
surrounded by seven nodes (shown in grey), among which two are annotated as blaKPC. The node of
the signature is however not annotated itself, hence the subgraph could be interpreted as a local
polymorphism in the promoter region of the blaKPC gene. The cluster-lasso subgraph shown on the
bottom-right panel reveals however that this unitig was selected by the lasso among hundreds of
highly correlated unitigs. They all belong to cluster 2, which includes the complete blaKPC gene (shown
between brackets) and plasmid sequences in strong LD.
107
Figure 5. DBGWAS visualizations for the interpretation of the cefoxitin signatures. This figure presents the two first subgraphs obtained by DBGWAS for the lasso and cluster-lasso signatures. The DBGWAS subgraphs are ordered by decreasing maximal absolute value of model coefficient among all patterns/ clusters involved in the subgraph. Likewise, pattern and cluster identifiers are ordered by decreasing absolute value of model coefficient, meaning for instance that pattern/cluster #1 has a greater weight in the model that pattern/cluster #2. The nodes (unitigs) belonging to patterns/clusters of the signatures are colored by the value of their model coefficients (from blue to red, indicating negative and positive values, respectively). The grey nodes/unitigs, not involved in the models, represent their genomic neighbourhood. The nodes for which an annotation related to antibiotic resistance was found are surrounded by a black circle. Bold brackets are used on the bottom right subgraph to highlight these black-circled nodes. This particular subgraph gathers 7 clusters, whose identifiers are reported on the picture. Cluster 2 is the largest one, and includes the blaKPC-annotated nodes. The dashed arrow shows which node of the cluster-lasso blaKPC subgraph corresponds to the one selected by the lasso.
Performance on the test set
Table 3 shows the predictive performance obtained on the test set by the lasso and cluster-lasso
signatures, as well as the models defined by Nguyen et al. [8], in terms of sensitivity, specificity and
balanced accuracy.
108
Table 3. Test set results. This table summarizes the results obtained on the test dataset by the lasso, cluster-lasso and Nguyen et al. [8] models for the 10 antibiotics, in terms of sensitivity, specificity and balanced accuracy (bACC). The MIC predicted by the Nguyen et al. [8] models were converted into S/NS categorical phenotypes according to the CLSI breakpoints.
We first noted that the lasso and cluster-lasso strategies reached a similar level of balanced accuracy
for most drugs, although they did not always achieve the same trade-off in terms of sensitivity and
specificity. We noted however that the confidence intervals of the corresponding sensitivities and
specificities largely overlapped for all drugs but ceftazidime (Figure 6 and Supplementary Figure S8),
indicating that they were not significantly different between lasso and cluster-lasso, except for one
drug.
Figure 6. Test set results. This figure represents the ROC curves obtained for cefepime, cefoxitin, ceftazidime and meropenem by the lasso (red) and cluster-lasso (blue) signatures, as well as their associated sensitivities / specificities and that of the Nguyen et al. [8] models, with their 95% confidence intervals.
109
We also noted that the models proposed by Nguyen et al. [8] usually achieved a lesser level of
balanced accuracy. This was the case for all drugs but cefepime, imipenem and meropenem, where
the performance remained comparable. Apart from these three drugs, the loss ranged from 6.6
points for piperacillin-tazobactam to 23.6 points for aztreonam. Strikingly, these models usually
achieved a much lower level of specificity than the lasso and cluster-lasso ones. This was especially
the case for ceftazidime, piperacillin-tazobactam, tetracycline and aztreonam, where the specificity
fell below 50%. In the latter case, every single strain was actually classified as resistant, hence the
specificity was null. As can be seen from Figure 6 and Supplementary Figure S8, however, the
confidence intervals of their sensitivities and specificities often overlapped with the ROC curves of
the lasso and cluster-lasso models. That these models were however trained to predict MICs, which
we subsequently cast into S/NS categories according to the CLSI breakpoints. While this strategy may
not be optimal to evaluate the ability of these models to accurately predict MICs, we noted that the
agreement between reference and predicted MICs was much smaller on this dataset than reported
in the original publication (see Supplementary Table S3).
We often observed a serious drop between the predictive performance estimated by cross-validation
and that observed for the test set: more than 5 points of balanced accuracy for 6 drugs out of 10, and
up to 10 points or more for amikacin, cefoxitin, imipenem and meropenem (13.4, 10.2, 10.9 and 9.9
points, respectively). This suggested that the training dataset taken from Nguyen et al. [8] could not
account for the entire diversity displayed by K. pneumoniae. A simple resistome-based analysis done
using the kleborate software revealed indeed that the prevalence of well-known resistance genes
was sometimes very different in the two panels. This is illustrated in Figure 7 for amikacin and
imipenem, which suffered from the highest performance drop. Redesigning the training and test
datasets by shuffling the original ones in order to obtain a homogeneous split fixed this
generalization issue (Supplementary Section S9). This illustrates that while machine learning models
can indeed succeed in learning accurate prediction rules, they fail to generalize when the dataset
they are trained on does not account for the overall diversity of the bacterial species.
110
Figure 7. Resistome analysis. This figure compares the training and test panels of genomes in terms of predictive performance and resistome constitution for the drugs amikacin (top) and imipenem (bottom). Left: predictive performance in terms of sensibility, specificity, bACC and AUC estimated by cross-validation on the training set and measured on the test set, using the lasso signatures. Right: comparison of the resistome constitutions. Each kleborate resistance marker is represented by its prevalence in the resistant strains of the training (x-axis) and test (y-axis) panels.
Finally, Table 3 and Supplementary Figure S9 shows an uneven level of prediction performance
among the ten antibiotics considered. The best performances were obtained for ciprofloxacin and
ceftazidime, with an AUC around 95% using either the original or the redesigned datasets
(Supplementary Figure S9). The poorest performances were obtained for two beta-lactams: cefepime,
a 4th-generation cephalosporin, and the monobactam aztreonam. This may be due to a reduced
penetrance of their genetic determinants, as described in human genetics [35], because more
complex resistance mechanisms are involved, including efflux pumps, gene regulation, or plasmid
copy number [36-38].
5.4.3 Computational requirements
Figure 8 indicates that while the duration of the cluster-lasso was in average about three times
longer than the lasso (571 vs 180 seconds), it took only about 10 minutes to obtain an entire
regularization path defined at the cluster-level. Optimizing the regularization parameter using our
cross-validation process therefore took approximately 5 hours on a single CPU. We noted that while
111
the time required by the lasso was relatively homogeneous across drugs, it was more variable for the
cluster-lasso. This variability was due to the fact that the lasso used in the first step identified a
variable number of active features, which directly impacted the time required to screen the
remaining ones. This is illustrated in Supplementary Figure S7.
Figure 8. Time and memory requirements. The boxplots represent the variability of the time (panel A) and maximum memory (panel B) required to generate a lasso or cluster-lasso regularization path for the ten antibiotics.
In terms of memory, we noted that the cluster-lasso procedure led to an overhead of about 2 GB
with respect to the lasso, which was related to the computation of the correlation matrix G . In
practice, we limited this overhead by computing this matrix by slices, considering subsets of p’ =
10,000 features and computing pa x p’ matrices instead of the entire pa x p matrix at once. Altogether,
this led to a computationally efficient procedure, allowing to identify cluster-level signatures in a few
hours, for a limited memory footprint. We note that it could be straightforwardly parallelized, using
several CPUs to compute the various slices of the correlation matrix G.
5.5 Discussion
Representing bacterial genomes using k-mers leads to very high-dimensional representations with
strong correlation structures. This may hinder a meaningful interpretation of predictive models built
by sparse ML strategies like lasso-penalized regressions [39] or decision trees-based algorithms [40],
which are known to be unstable in this case: when some features are strongly correlated, they tend
to pick one, or few ones, out of them arbitrarily [41]. This instability may not be an issue in terms of
predictive performance: as long as one feature among a group of correlated ones appears in the
model, the prediction may be unchanged. It may however have a severe impact in terms of
interpretability, as the features selected by the model may provide an incomplete or erroneous
characterization of the causal resistance determinant.
112
We propose a simple and computationally efficient strategy to cope with the strong correlation
structures inherent to k-mer-based representations, and build sparse and meaningful genomic
signatures. While performing a systematic study on thousands of strains of K. pneumoniae, our
approach compared favorably to the state of the art, providing indeed a comparable level of
performance, while offering a greater interpretability of the genomic features involved in the models.
On this challenging genetically flexible bacterial species with significant accessory genome
components, this new approach allowed to extract meaningful scientific insights from the identified
signatures, as further detailed in Section S5 of the Supplementary Materials.
Central to our approach is a three-step strategy, where a sparse ML algorithm is first used to screen
features in a generic manner, which are then extended to clusters of strongly correlated features,
ultimately considered as candidate features to be included in the final antibiotic resistance prediction
model. While we here relied on lasso-penalized logistic regression for both the screening and final
learning stages, this principle is generic and could readily be transposed to other sparse ML
algorithms, like xgboost [4, 8] or set cover machines [26]. Likewise, it could straightforwardly be
extended to handle MICs or other phenotypic traits, as well as other types of genomic features (e.g.,
relying on SNPs instead of k-mers).
Several alternative strategies could be considered to handle correlations between k-mers. Most
related to our approach are the elastic-net and the group-lasso strategies, which also rely on logistic
regression – and more generally on generalized linear models – but with alternative regularization
penalties. The elastic-net penalty combines the lasso and the ridge penalties, which leads to sparse
models with a grouping mechanism: correlated features tend to be selected together [42]. This
approach was recently shown to be efficient in the context of bacterial genome-wide association
studies (GWAS), providing increased statistical power for the identification of genotype-phenotype
associations and accurate prediction rules [43]. As we demonstrate in Supplementary Section S10,
however, it remains limited in its ability to provide interpretable predictive signatures, for several
reasons. First, while it has the effect of stabilizing the lasso solution and of simultaneously activating
groups of correlated features, these groups are not defined explicitly, which intrinsically makes the
interpretation of the model difficult. Moreover, while the parameter controlling the trade-off
between the lasso and ridge penalties had a direct impact on the number of selected features, it had
little impact on the predictive performance of the model, thereby making it difficult to optimize
objectively. Finally, we empirically observed that it led to a partial and heterogeneous reconstruction
of the genomic features obtained by the cluster-lasso: a significant fraction of the cluster members
were not selected by the elastic-net, and the individual weights associated to the selected ones
greatly varied, although their level of predictive power was comparable.
113
The group-lasso penalty leverages predefined groups of features, ensuring that all features of a given
group are either active or inactive simultaneously [44]. This strategy was for instance considered in
human GWAS, using groups of SNPs defined spatially to account for their LD [45]. Transposing this
idea to bacterial genomes is challenging since no such prior information is available to define groups,
as LD can be genome-wide [29]. A solution could be to identify clusters of correlated k-mers using
agglomerating strategies [31], but is hard to carry out in practice from the high-dimensional datasets
involving 105 - 106 features encountered in our application. Our approach can therefore be seen as a
simple and efficient strategy to approximate such a group-lasso process in very high-dimensional
settings. Instead of collapsing groups of correlated features into composite variables, a natural
extension of our method would however be to rely on a group-lasso penalized regression defined at
the cluster level. Each feature would then be granted its own weight, which could allow to better
reflect their individual predictive power. We empirically observed that the weights variability within a
cluster was very small, as shown in Supplementary Figure S13, which therefore indicated that
keeping the features separated or averaging them is essentially equivalent. In practice, we find it
easier to explicitly collapse each cluster to a single composite variable to interpret the model
parameters.
On the practical side, our method involves two hyper-parameters, besides the regularization
parameter, to identify active variables and to build the final model. Although these so-called
screening and clustering thresholds did not have a strong influence in this study (Supplementary
Section S2), they may be cumbersome to optimize in practice for other applications. A natural
extension to our method would be to consider re-sampling strategies in the clustering step, in order
to identify stable clusters, whose constitution would be robust to the precise definition of the
clustering threshold [46]. Alternatively, one could rely on tree-guided lasso penalization to leverage
the entire dendrogram during the final learning step, which would then simultaneously identify
clusters and learn the prediction model [47].
Regarding AMR prediction, our study led on K. pneumoniae confirms several observations made
recently, namely that kmer- based approaches can learn sparse prediction rules without any prior
information, that predictions are more accurate with R/S models than MICs and that the level of
predictive performance can vary by antibiotic [26, 28]. Importantly, our study involved a novel panel
of 634 K. pneumoniae strains for the validation of the prediction models and suggested that the
problem is more challenging than reported in Nguyen et al. [8]. The figures reported in this study
were indeed probably optimistic because the genomes panel considered did not account for the
overall genomic diversity of the K. pneumoniae species (Supplementary Section S1). The 634
114
additional strains with genomes and phenotypes considered in this study will help learning more
accurate and generalizable predictions models.
Finally, the ML methods developed in this study are available in a generic R package that can be
easily transposed to other applications, not necessarily involving k-mers nor AMR phenotypes. On
the challenging dataset considering in this study, involving more than a thousand strains for more
than a million genomic features, the computational requirements remained limited and the
signatures could be identified in a few hours on a standard workstation. Coupled with the enriched
level of interpretability they offer, we believe our approach will help defining prediction models
amenable to routine diagnostics.
5.6 References
1. Gordon NC, Price JR, Cole K, Everitt R, Morgan M, Finney F, et al. Prediction of Staphylococcus
aureus Antimicrobial Resistance by Whole-Genome Sequencing. Journal of Clinical Microbiology
2014;52(4):1182–1191.
2. Walker TM, Kohl TA, Omar SV, Hedge J, Elias CDO, Bradley P, et al. Whole-genome sequencing for
prediction of Mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort
study. The Lancet Infections Diseases 2015;15:1193–1202.
3. Eyre DW, De Silva D, Cole K, Peters J, Cole MJ, Grad YH, et al. WGS to predict antibiotic MICs for
Neisseria gonorrhoeae. The Journal of Antimicrobial Chemotherapy 2017;72(7):1937–1947.
4. Nguyen M, Long SW, McDermott PF, Olsen RJ, Olson R, Stevens RL, et al. Using Machine Learning
To Predict Antimicrobial MICs and Associated Genomic Features for Nontyphoidal Salmonella.
Journal of Clinical Microbiology 2019;57(2).
5. Tyson GH, McDermott PF, Li C, Chen Y, Tadesse DA, Mukherjee S, et al. WGS accurately predicts
antimicrobial resistance in Escherichia coli. Journal of Antimicrobial Chemotherapy 2015;70(10).
6. Moradigaravand D, Palm M, Farewell A, Mustonen V, Warringer J, Parts L. Prediction of antibiotic
resistance in Escherichia coli from large-scale pan-genome data. PLOS Computational Biology
2018;14(12):1–17.
7. Deng X, Memari N, Teatero S, Athey T, Isabel M, Mazzulli T, et al. Whole-genome Sequencing for
Surveillance of Invasive Pneumococcal Diseases in Ontario, Canada: Rapid Prediction of Genotype,
Antibiotic Resistance and Characterization of Emerging Serotype 22F. Frontiers in Microbiology
2016;7:2099.
115
8. Nguyen M, Brettin T, Long SW, Musser JM, Olsen RJ, Olson R, et al. Developing an in silico
minimum inhibitory concentration panel test for Klebsiella pneumoniae. Scientific reports
2018;8(1):421.
9. Su M, Satola SW, Read TD. Genome-Based Prediction of Bacterial Antibiotic Resistance. Journal of
Clinical Microbiology 2019;57(3).
10. Yang Y, Niehaus KE, Walker TM, Iqbal Z, Walker AS, Wilson DJ, et al. Machine Learning for
Classifying Tuberculosis Drug-Resistance from DNA Sequencing Data. Bioinformatics 2017;p. btx801.
11. Coll F, McNerney R, Preston MD, Guerra-Assunção JA, Warry A, Hill-Cawthorne G, et al. Rapid
determination of anti-tuberculosis drug resistance from whole-genome sequences. Genome
Medicine 2015;7(1):51.
12. Bradley P, Gordon NC, Walker TM, Dunn L, Heys S, Huang B, et al. Rapid antibiotic-resistance
predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis.
Nature Communications 2015;6:10063.
13. Tanmoy AM, Westeel E, De Bruyne K, Goris J, Rajoharison A, Sajib MS, et al. Salmonella enterica
Serovar Typhi in Bangladesh: exploration of genomic diversity and antimicrobial resistance. mBio
2018;9(6):e02112–18.
14. Miotto P, Tessema B, Tagliani E, Chindelevitch L, Starks AM, Emerson C, et al. A standardised
method for interpreting the association between mutations and phenotypic drug resistance in
Mycobacterium tuberculosis. European Respiratory Journal 2017;50(6).
15. Mahé P, El Azami M, Barlas P, Tournoud M. A large scale evaluation of TBProfiler and Mykrobe for
antibiotic resistance prediction in Mycobacterium tuberculosis. PeerJ 2019 May;7:e6857.
16. Gygli SM, Borrell S, Trauner A, Gagneux S. Antimicrobial resistance in Mycobacterium tuberculosis:
mechanistic and evolutionary perspectives. FEMS Microbiology Reviews 2017 03;41(3):354–373.
17. Chen ML, Doddi A, Royer J, Freschi L, Schito M, Ezewudo M, et al. Beyond multidrug resistance:
Leveraging rare variants with machine and statistical learning models in Mycobacterium tuberculosis
resistance prediction. EBioMedicine 2019.
18. Palomino JC, Martin A. Drug resistance mechanisms in Mycobacterium tuberculosis. Antibiotics
2014;3:317–340.
116
19. Palmer AC, Kishony R. Understanding, predicting and manipulating the genotypic evolution of
with β-lactamase Sfh-1 from Serratia fonticola. The blaPFM-1 gene was chromosomally located and
likely acquired. Variants of PFM-1 sharing 90% to 92% amino acid identity were identified in bacterial
species belonging to the Pseudomonas fluorescens complex, including Pseudomonas libanensis (PFM-
2) and Pseudomonas fluorescens (PFM-3), highlighting that these species constitute reservoirs of
PFM-like encoding genes.
6.2 Main text
Metallo-β-lactamases (MBLs) are zinc-dependent enzymes that can catalyze the hydrolysis of
virtually all β-lactam antibiotics (including carbapenems) except for monobactams and that are
resistant to the β-lactamase inhibitors clavulanate, tazobactam, and avibactam (1). They constitute a
highly diverse family of enzymes and can be categorized into three subclasses, namely, B1, B2, and
B3 (2). The subclass B1 enzymes are the most clinically important since they comprise MBLs such as
IMP-1, NDM-1, SPM-1, KHM-1, VIM-1, and VIM-2 (3), widely identified in Enterobacteriaceae,
Acinetobacter spp., and Pseudomonas spp. Subclass B2 includes CphA (4, 5), ImiS (6, 7), and AsbM1
(8), which are intrinsic enzymes in Aeromonas spp., and Sfh-I (9) from the occasionally pathogenic
species Serratia fonticola. These carbapenemases are monozinc enzymes that usually shown much
higher hydrolysis rates against carbapenem substrates than the other β-lactams (9).
Production of MBLs in the Pseudomonas genus is frequently observed, with acquired MBL-encoding
genes (blaIMP, blaVIM, blaSPM) being reported worldwide mainly in Pseudomonas aeruginosa and, to a
lesser extent, in Pseudomonas fluorescens (10, 11). In addition, intrinsic MBL genes encoding subclass
B3 POM-1-like and PAM-1-like enzymes have been identified in Pseudomonas otitidis and
Pseudomonas alcaligenes, respectively (12–14).
P. fluorescens and related species belonging to a same complex are rarely associated with infections
in human medicine (15). Nevertheless, P. fluorescens can cause bloodstream infections in humans,
and most reported cases have been iatrogenic (16). Few studies have focused on the β-lactamase
gene content of the P. fluorescens complex. While P. fluorescens possesses a chromosomally located
and inducible Ambler class C β-lactamase gene (17), the acquired but chromosomally located blaBIC-1
gene encoding an Ambler class A carbapenemase was previously identified as a source of
carbapenem resistance in P. fluorescens isolates recovered from the Seine River, Paris (18).
121
Here, we analyzed a carbapenem-resistant Pseudomonas sp. isolate that had been recovered during
a survey aimed to study the spread of multidrug-resistant Gram- negative organisms among food
varieties and food-producing animals in Switzerland in 2018. Isolate MCP-106 was isolated from
chicken meat after an 18-h preenrichment in LB broth and subsequent selection on ChromID
CarbaSmart (bioMérieux, La Balme-les- Grottes, France). Carbapenemase production was tested
using the Rapid Carba NP test (19). Matrix-assisted laser desorption ionization–time of flight (MALDI-
TOF) analysis assigned the strain to the Pseudomonas synxantha species, and that assignment was
further confirmed by analysis of the rpoB and rpoD gene sequences (Fig. 1). P. synxantha, which
belongs to the P. fluorescens complex (20), is an environmental species that reduces and
accumulates the heavy metal chromium (21, 22) that is pathogenic to nematode eggs and may
therefore be used as a nematicidal agent (23). Susceptibility testing performed for β-lactams by disk
diffusion showed that P. synxantha strain MCP-106 was resistant to amino- and carboxypenicillins,
broad- spectrum cephalosporins, aztreonam, and carbapenems. Whole-genome sequencing was
performed using an Illumina MiSeq platform (2 × 150-bp paired ends) to assess the genetic
determinants of carbapenem resistance. The obtained reads were trimmed using trimmomatic 0.36,
assembled with SPAdes version 3.11.1 (24), and annotated with PROKKA version 1.12. TBLASTN
analysis of the DNA contigs using VIM as a reference revealed a chromosomally located MBL protein
that was named PFM-1 (Pseudomonas fluorescens metallo-β-lactamase). PFM-1 (encoded by the
blaPFM-1 gene) consisted of a β-lactamase with 253 amino acids and a relative molecular mass of 28.5
kDa.
FIG 1. Dendrogram performed by using the seven genes from the multilocus sequence typing (MLST) analysis in comparison with representative genes from other Pseudomonas species, in particular, the most closely related ones, which are Pseudomonas fluorescens and Pseudomonas synxantha. The alignment used for the tree calculation was performed with the Clustal Omega program.
122
A BLASTN analysis against the NCBI database revealed the presence of a blaPFM-like gene (named
blaPFM-2, with PFM-2 sharing 92% amino acid identity with PFM-1) in Pseudomonas libanensis strain
CIP105460 (GenBank accession no. GCA_001439685.1) (25) which actually belongs to the
Pseudomonas fluorescens sp. complex. In addition, genes encoding PFM-like products were also
identified in the genomes of a single P. fluorescens strain (WP_050516231.1) and two Pseudomonas
brenneri strains, sharing 90% amino acid identity with PFM-1 (WP_128593843.1 and OAE14554.1).
Furthermore, a gene encoding a more distantly related enzyme (75% amino acid identity) was found
in the genome of a Pseudomonas chlororaphis strain (WP_038635452.1). However, no other blaPFM-
like gene was identified in any other P. fluorescens genomes (or in any genomes of species belonging
to the same complex), despite numerous genomes of strains belonging to the P. fluorescens complex
(n = 145) having been fully sequenced. We then screened 10 P. fluorescens strains from our
laboratory collection, all of which had been recovered from human, animal, or environmental
samples. A PCR-based approach using primer pair PFM-1-Fw (5’-GTTACGCCTGATGGACTTTG-3’) and
PFM-1-Rv (5’-CTTAGAAGCATGTCAGTGCG-3’) for blaPFM-1 and primer pair PFM-2-Fw (5’-
CTGATCAGAAAATGTGGGGC-3’) and PFM-2-Rw (5’-GACACGCCGTGTTTCTATATC-3’) for blaPFM-2 was
employed. A single strain gave a positive result, and Sanger sequencing identified a blaPFM-like gene
(blaPFM-3) encoding a protein sharing 91% amino acid identity with PFM-1. The blaPFM-3 gene was
identified from P. fluorescens PF1, an isolate recovered from a water sample from the Seine River in
Paris, France, and also producing the Ambler class A carbapenemase BIC-1 (18). PFM-2 and PFM-3
differed by five amino acids.
Pairwise alignment of the sequences of the PFM-like amino acid sequences with those of other MBLs
revealed that these newly identified enzymes were most closely related to the subclass B2 MBL
enzymes. PFM-1 shares 71% amino acid identity with Sfh-1, originally identified in Serratia fonticola
strain UTAD54 (9), and 53% identity with CphA-1 from Aeromonas hydrophila (26). It shared very low
identity with subclass B1 MBLs such as NDM-1 (17%) and VIM-1 and IMP-1 (22%) (Fig. 2). Protein
alignments of the β-lactamase PFM-1 with representative subclass B2 MBLs revealed the presence of
conserved amino acid residues known to be involved in binding to zinc of class B β-lactamase (BBL)
(27) (Fig. 3). The motif Asn-Tyr-His-Thr-Asp (positions 116 to 120 [BBL nomenclature]), being a
distinctive feature of subclass B2 MBLs and presumably involved in the coordination of the two zinc
ions found in the active site of these enzymes, was identified in PFM-like enzymes. Amino acids
Asp120, Cys-221, and His-263, presumably involved in the binding of the second zinc ion in subclass
B2 MBLs, were also conserved in the PFM-like proteins.
123
FIG 2. Dendrogram of PFM-1, PFM-2, and PFM-3 in comparison with representative class B β-lactamases subjected to neighbor-joining analysis. The alignment used for the tree calculation was performed with the Clustal Omega program. Numbers in parentheses indicate percentages of amino acid identity with PFM-1. The β-lactamases used for the comparisons (GenBank accession numbers) were Sfh-1 (NZ_AUZV01000091.1), CphA-1 (X57102), ImiS (Y10415), ImiH (AJ548797), VIM-1 (AJ278514), IMP-1 (EF027105), NDM-1 (KJ018857), POM-1 (EU315252), and PAM-1 (AB858498). Percentages of amino acid identities compared to PFM-1 are indicated.
FIG 3. Alignment of the amino acid sequences of subclass B2 MBLs. Residues conserved in the enzymes are indicated by asterisks; colons indicate conservation between groups with strongly similar properties; dots indicate conservation between groups with weakly similar properties. The BBL numbering scheme (in bold) is used for residues conserved in MBLs.
In order to gain insight into the β-lactam resistance phenotype conferred by the corresponding
proteins, the blaPFM-1, blaPFM-2, and blaPFM-3 genes of P. synxantha strain MCP-106, P. libanensis strain
CIP105460, and P. fluorescens PF1 were cloned into plasmid pTOPO (Invitrogen, Illkirch, France) and
expressed in Escherichia coli. Cloning experiments were performed using the pCR-blunt TOPO cloning
124
kit (Invitrogen, Illkirch, France) after amplification of the genes with primers PFM-1-Fw and PFM-1-Rv
for blaPFM-1 and with primers PFM-2-Fw and PFM-2-Rw for blaPFM-2 and blaPFM-3. The resulting
recombinant plasmids were transformed into chemically competent E. coli TOP10 strains. Once
expressed in E. coli TOP10, similar resistance phenotypes were observed with the different PFM
variants, with reduced susceptibility to carbapenems seen (Table 1) but paradoxically no effect on
the other β-lactams tested such as amoxicillin, ticarcillin, cefoxitin, cefotaxime, and ceftazidime (data
not shown). MICs of carbapenems were determined by Etest and showed values for the PFM-3-
producing recombinant strain that were higher than those obtained with the PFM-1-producing and
PFM-2-producing recombinant strains, particularly for imipenem (Table 1).
Table 1. MICs of carbapenems for E. coli TOP10 recipient strain with and without the blaPFM genes and for Pseudomonas isolates.
aCIP105460 was originally described by Dabboussi et al. (25).
bPF1 was originally described by Girlich et al. (18).
cClavulanic acid was used at a concentration of 2 µg/ml.
dTazobactam was used at a concentration of 4 µg/ml.
Purification of the PFM-1 enzyme was performed using a four-liter LB broth culture of E. coli TOP10
(pTOPO-blaPFM-1) recombinant strain supplemented with kanamycin (50 µg/ml) and inoculated for 24
h at 37°C under shaking conditions. The bacterial culture was centrifuged, and the pellet was
resuspended in Tris-HCl buffer (50 mM Tris-HCl, 100 µM ZnCl2, pH 8.5) and sonicated using a Vibra-
Cell 75186 sonicator (Thermo Fisher Scientific). After filtration using a 0.22-µm pore size
nitrocellulose filter, the crude extract was loaded in a Q-Sepharose column connected to an
ÄKTAprime chromatography system (GE Healthcare, Glattbrugg, Switzerland) and eluted with a linear
NaCl gradient. The presence of the β-lactamase was monitored using the Rapid Carba NP test (19),
and the fractions showing the highest β-lactamase activity were pooled and dialyzed against 100 mM
phosphate buffer (pH 7.0), prior to 10-fold concentration performed with a Vivaspin 20 concentrator
(GE Healthcare). The purified β-lactamase extract was immediately used for enzymatic
determinations.
The protein concentrations were measured using Bradford reagent (Sigma-Aldrich, Buchs,
Switzerland), and the purity of the enzyme was estimated by SDS-PAGE analysis (GenScript, NJ, USA).
The purity of PFM-1 was estimated to be >95%, with a single dominant band visible on the SDS-
125
polyacrylamide gel. Kinetic measurements were performed at room temperature using phosphate-
buffered saline (PBS) buffer (0.1 M, pH 7) supplemented with ZnSO4 (5 µM) using a UV/visible
Ultrospec 2100 Pro spectrophotometer (Amersham Biosciences, Buckinghamshire, United Kingdom).
This kinetic analysis confirmed that PFM-1 hydrolyzed carbapenems; however, the catalytic efficiency
was slightly lower than that seen with the previously described subclass B2 MBLs (Table 2). In
contrast, hydrolysis of other β-lactam substrates such as benzylpenicillin or cefotaxime was not
detected (kcat value < 0.01 s—1). This study therefore characterized a novel family of subclass B2
MBLs with substantial carbapenemase activity. Compared to other subclass B2 MBLs, PFM-1
hydrolysis is limited to carbapenems, and the catalytic efficiency is lower.
Table 2. Kinetic parameters of purified β-lactamase PFM-1 and comparison with other B2 MBLs. Kinetic data are displayed for Sfh-1, CphA, and AsbM1 as reported previously by Fonseca et al. (30), Vanhove et al. (31), and Yang and Bush (8), respectively. ImiS kinetic values are presented for imipenem and meropenem as reported by Sharma et al. (32) and Crawford et al. (6), respectively. NR, not reported.
The levels of G+C content of blaPFM-1 (50%) and blaPFM-2/-3 (52%) differed from the expected range of
the G+C content of Pseudomonas genes (ca. 60%); in addition, the fact that no other blaPFM-like genes
were identified in several fully sequenced genomes of P. fluorescens strains available in the GenBank
databases further suggests a non- Pseudomonas origin. However, no obvious genetic element that
could have been involved in the acquisition of that gene was observed in its nearby genetic environ-
ment. Similarly, no mobile genetic elements were identified in their upstream vicinity by analyzing
the genes showing significant identities with blaPFM-1 in the GenBank database. It may be speculated
that those genes have been acquired by transformation since P. fluorescens strains, as with many
other Gram-negative nonfermenters, are spontaneously transformable at high frequency (28).
However, a discrepancy was always noticed between all of the putative MBL-encoding genes
(including blaPFM-1) and the surrounding chromosomal sequences in term of GC content (ca. 50%
versus ca 60%), suggesting a foreign origin (data not shown).
This work underlines that P. fluorescens-like species may possess class B β-lactamase genes that are,
however, not systematically present in their genomes. Although strains belonging to the P.
fluorescens complex are rarely involved in human infections, they are widely disseminated in the
environment and parts of the human microbiota and can also be found in chicken meat (16). Those
bacterial species may therefore constitute reservoirs of antimicrobial resistance genes (29).
126
6.3 Data availability
The sequences of PFM-1, PFM-2, and PFM-3 have deposited in the NCBI database under GenBank
accession numbers MN065826 (PFM-1), MN080496 (PFM- 2), and MN080497 (PFM-3). The sequence
of the whole genome of P. synxantha strain MCP-106 has been deposited under GenBank accession
number VSRO00000000.1, BioProject accession no. PRJNA561277, and BioSample accession no.
SAMN12612925.
6.4 References
1. Jeon J, Lee JH, Lee JJ, Park KS, Karim AM, Lee CR, Jeong BC, Lee SH. 2015. Structural basis for
carbapenem-hydrolyzing mechanisms of carbapenemases conferring antibiotic resistance. Int J Mol