High-throughput DNA Sequencing in Microbial Ecologykth.diva-portal.org/smash/get/diva2:925868/FULLTEXT01.pdf · 2016-05-03 · Luisa Warchavchik Hugerth (2016): High-throughput DNA
Post on 02-Jun-2020
3 Views
Preview:
Transcript
High-throughput DNA Sequencingin Microbial EcologyMethods and Applications
Luisa Warchavchik Hugerth
KTH Royal Institute of TechnologySchool of Biotechnology
Stockholm 2016
Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of Gene Technology, School of Biotechnology, KTH Royal Institute of Technology, Stockholm, Sweden
Summary Microorganisms play central roles in planet Earth’s
geochemical cycles, in food production, and in health and disease of humans and livestock. In spite of this, most microbial life formsremain unknown and unnamed, their ecological importance and potential technological applications beyond the realm of speculation. This is due both to the magnitude of microbial diversity and to technological limitations. Of the many advances that have enabled microbiology to reach new depth and breadth in the past decade, one of the most important is affordable high-throughput DNA sequencing. This technology plays a central role in each paper in this thesis.
Papers I and II are focused on developing methods to survey microbial diversity based on marker gene amplification and sequencing. In Paper I we proposed a computational strategyto design primers with the highest coverage among a given set of sequences and applied it to drastically improve one of the most commonly used primer pairs for ecological surveys of prokaryotes. In Paper II this strategy was applied to an eukaryotic marker gene. Despite their importance in the food chain, eukaryotic microbes are much more seldom surveyed than bacteria. Paper II aimed at making this domain of life more amenable to high-throughput surveys.
In Paper III, the primers designed in papers I and II were applied to water samples collected up to twice weekly from 2011 to 2013 at an offshore station in the Baltic proper, the Linnaeus Microbial Observatory. In addition to tracking microbial communities over these three years, we created predictive models for hundreds of microbial populations, based on their co-occurrence with other populations and environmental factors.
In paper IV we explored the entire metagenomic diversity in the Linnaeus Microbial Observatory. We used computational
tools developed in our group to construct draft genomes of abundant bacteria and archaea and described their phylogeny, seasonal dynamics and potential physiology. We were also able to establish that, rather than being a mixture of genomes from freshand saline water, the Baltic Sea plankton community is composed of brackish specialists which diverged from other aquatic microorganisms thousands of years before the formation of the Baltic itself.
Keywords: Baltic Sea; Microbial ecology; Metagenomics; Bacterioplankton
Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of Gene Technology, School of Biotechnology, Royal Institute of Technology, Stockholm, Sweden
Sammanfattning Mikroorganismer spelar centrala roller i de geokemiska
kretsloppen i hav och på land, för matproduktion, och för hälsa och sjukdom hos människor och djur. Trots det är de flesta mikroorganismernar fortfarande okända och deras ekologiska roller och potentiella teknologiska användningar går bara att spekulera i. Detta beror både på den enorma mikrobiella diversiteten och på teknologiska begränsningar. Av de många framsteg som har möjliggjort att mikrobiell ekologisk forskning har avancerat snabbt de senaste tio åren, är kanskeden viktigastestorskalig DNA sekvensering. Denna teknologi spelar en central roll i varje artikel i denna avhandling.
Artiklar I och II fokuserar på metodutveckling för kartläggning av mikrobiell diversitet genom PCR-amplifiering ochsekvensering av markörgener. I Artikel I presenterar vi en mjukvara som beräknar optimala primer sekvenser för att amplifiera så många mål-sekvenser som möjligt, och vi använder metoden för att avsevärt förbättra ett av de mest använda PCR-primer-paren för kartläggning av diversitet av prokaryoter (bakterier och arkéer). I Artikel II använder vi denna strategi på en eukaryot markörgen. Trots deras viktiga roller i näringskedjanär eukaryota mikrober än mindre utforskade än bakterier. Artikel II har som mål att göra den eukaryota domänen av livets träd mertillgänglig för storskaliga kartläggningar.
I Artikel III används primerna som utvecklades i artikel I och II på vattenprover som togs upp till två gånger i veckan från 2011 till 2013 vid en provtagning-station i centrala Östersjön, Linnaeus Microbial Observatory. Förutom att följa säsongsdynamiken hos mikrobiella populationer över dessa tre årskapade vi prediktiva modeller för hundratals mikrober, baserat på deras samtida förekomst med andra mikrober och miljöparemetrar.
I Artikel IV utforskade vi hela den metagenomiska diversitetet vid Linnaeus Microbial Observatory-stationen. Vi använde mjukvaror utvecklade i vår grupp för att från metagenom-data rekonstruera arvsmassorna hos talrika bakterieroch arkéer, och beskrev deras fylogeni, säsongsdynamik och fysiologiska potential. Vi kunde även etablera att, snarare än en blandning av sötvatten och marina mikroorganismer, så är Östersjöns plankton bräckvattenspecialister som divergerade frånandra akvatiska organismer långt innan Östersjön bildades.
Keywords: Baltic Sea; Microbial ecology; Metagenomics; Bacterioplankton
“Life on earth is such a good story you
cannot afford to miss the beginning...
Beneath our superficial differences we are
all of us walking communities of bacteria.
The world shimmers, a pointillist landscape
made of tiny living beings.”
- Lynn Margulis
List of publications
This thesis is based upon the following papers which are referred to in the text by their corresponding Roman numerals. All papers are included at the end of thesis.
Paper IHugerth, LW, Wefer HA, Lundin S, Jakobsson HE, Lindberg M, Rodin S, Engstrand L, Andersson AF (2014) DegePrime, a program for degenerate primer design for broad-taxonomic-rangePCR in microbial ecology studies. Appl Environ Microbiol 80 (16): 5116-23.
Paper IIHugerth LW, Muller EE, Hu YO, Lebrun LA, Roume H, Lundin D, Wilmes P, Andersson AF (2014) Systematic design of 18S rRNA gene primers for determining eukaryotic diversity in microbial consortia. PloS One 9 (4): e95567.
Paper IIIHugerth LW, Lindh MV, Sjöqvist C, Bunse C, Legrand C, Pinhassi J, Andersson AF. Seasonal dynamics and interactions among Baltic Sea prokaryotic and eukaryotic plankton assemblages. Manuscript
Paper IVHugerth LW, Larsson J, Alneberg J, Lindh MV, Legrand C, PinhassiJ. (2015) Metagenome-assembled genomes uncover a globalbrackish microbiome. Genome Biol 16:279.
Table of Contents
i. The importance of microbes on global and
regional scales ........................................................ 1
ii. An overview of techniques for microbial
community characterization .................................... 3
iii. High-throughput DNA sequencing in the
determination of microbial community
composition ….......................................................... 6
iv. Data processing and statistical analyses of microbiomics ..................................................…...... 16
v. Network construction and community dynamics .... 19
vi. Metagenomic surveys …...........................................24
vii. Genome reconstruction from metagenomic data …. 28
viii. Summary and perspectives ….................................. 30
ix. References …............................................................ 32
x. Acknowledgements …...............................................45
i. The importance of microbes on global and regional scales
Planet Earth is an intrinsically unstable system. Continents rise
and fall, oceans open up and are squeezed out again, new land is
colonized by life only to be ravaged by fire, and innumerable molecules
are constantly breaking down and reassembling everywhere, from deep
in the ocean to the upper reaches of the atmosphere. This instability is
crucial for the existence of life and is, to a large extent, mediated by life.
It is the smallest of life forms that carry out the crucial role of
converting gaseous nitrogen into biologically available forms, as well as
carrying out about half of the world photosynthesis [1]. Single-cell
organisms also outnumber larger life forms by several orders of
magnitude, in terms of total cell count, and above all in the combined
number of biochemical pathways available to them [2]. While
microorganisms in soils are crucial for sustaining land plants and
agricultural activity, in global terms it is the world’s oceans that host the
bulk of geochemical transformations. About half of the global flux of
vital elements such as carbon, nitrogen, phosphorus, iron and sulfur are
mediated through marine microbes [3, 4]. Understanding the life cycle
and metabolism of marine microorganisms is therefore a crucial step in
understanding, modelling and managing global planetary cycles as a
whole.
The life of microbes in an aquatic system takes place along an
ever-mixing water column. Along this water column, levels of light,
oxygen and nutrients vary with depth, forming various niches where
different metabolic strategies have greatest success: photosynthesis and
other light-based strategies can only be successful in depths reached by
light, aerobic metabolism requires dissolved oxygen, sulfate reduction
cannot happen in the presence of oxygen and reactive nitrogen oxides
etc. In addition to this macro-scale diversity, on the size scale of
microbial life every spoonful of marine water is in itself highly patchy
[5]. Particles produced as macroorganisms feed, shed cells, defecate, lay
eggs and die function as transient oases for heterotrophic bacteria in an
otherwise very poor environment. Microorganisms themselves interact
with each other. These interactions can be direct, such as the predation
of bacteria by protists, or bacterial biofilm formation. They can also
1
happen indirectly, such as through competition for the same resources
or through leaching of nutrients that can be taken up by other
microorganisms [6].
Besides their role in global geochemical cycles, marine bacteria
also play important economical and social roles at the local and regional
scale. By fixing carbon, cyanobacteria and phytoplankton are the bases
of marine food webs, thereby sustaining fish stocks. On the other hand,
blooms of neurotoxic cyanobacteria negatively impact fish stocks, as
well as water quality and tourism in affected areas. Conversely, human
activities affect the chemistry of the water column, such as through high
phosphorus and nitrogen runoff from agricultural land or wastewater
treatment facilities promoting fast bacterial growth and oxygen
depletion, a process known as eutrophication [7, 8]. Further, selective
fishing can propagate down the foodweb all the way to zooplankton and
possibly bacterioplankton [9]. On the other hand, a well-managed
marine environment can support various established or nascent
industries, from tourism to fisheries to algal farming.
The Baltic Sea is an environment where the interplay between
human activity, water chemistry and bacterial life is particularly
prominent. Due to its narrow connection with the open ocean, water
retention time in the Baltic Sea is around 50 years [10]. Agricultural and
industrial activities in the surrounding countries have greatly affected
the nitrogen and phosphorous loadings in the area, albeit phosphorous
is now much more carefully managed [11]. These nutrient influxes have
led to a large increase in area and duration of total anoxia in the bottom
of the Baltic Sea [12], thereby affecting geochemistry both in the water
column and in the sediment below. Increased global temperatures have
led to a shorter ice-cover period over the water, as well as to an increase
in the influx of high nutrient-load freshwater each spring [13]. Some of
these impacts have quick and visible effects, such as anoxic zones,
larger harmful algal blooms and the collapse of fish stocks. Other effects
of a changing water chemistry might take much more time to be
noticeable at the human scale, and the correlation between the effect
and the result might not be readily appreciated. Nevertheless, the fact
remains that human activities are having deep impacts in a delicately
balanced system whose pieces and links are still largely unknown.
2
ii. An overview of techniques for microbial community
characterization
While humans have been selectively breeding bacteria and fungi
for food fermentation for several centuries, the first observations of
microbial organisms were made in the 1670’s by Antony van
Leeuwenhoek (he first observed microbes (“animalcules”) in saliva) and
the first purposeful and successful isolation of bacteria for scientific
purposes was attained by Robert Koch and Julius Petri in the 1870’s.
Both direct observation and culturing remain invaluable techniques to
this day, albeit both have limitations.
Culturing is the gold standard of microbial characterization, as it
provides large amounts of cells from a clonal population, and allows any
number of functional tests on bacterial biochemistry, physiology and
genetics to be performed. It was however evident even to Koch that
different bacteria grow best in different settings, and by the early
1900’s it was accepted that the vast majority of bacteria could not be
cultivated with standard techniques, a phenomenon later dubbed “the
great plate count anomaly” [14]. Therefore, most of what is known today
about bacterial physiology stems from a very small subset of easily
culturable bacteria of medical or veterinary importance which grow well
in the presence of high nutrient loads [15].
Reasons for refraction to culturing are many. Firstly, in the
absence of knowledge of the specific growth requirements of an
organism, trial-and-error is not a feasible way to determine them:
“When it is not clear what facet of the environment is not being properly
replicated (nutrients, pH, osmotic conditions, temperature, or many
more), attempting to vary all of these conditions at once results in a
multidimensional matrix of possibilities that cannot be exhaustively
addressed with reasonable time and effort. “ - Stewart, 2012 [16]
In particular, many organisms have rather narrow windows of
growth, becoming dormant when any of a number of micro- or
macronutrients is lacking or in excess, or when physical conditions
deviate by a few percentage points from their optima [15]. These
microbes might survive in the environment in boom-and-bust cycles,
growing rapidly when conditions are optimal but remaining dormant for
3
extended periods of time [17, 18]. Conversely, other microbes sustain
long-term continuous growth at a pace so slow as to be nearly
indistinguishable, in a lab setting, from a failure to thrive [15, 19]. In
addition to not providing the required growth substrates at appropriate
rates, laboratory settings might easily generate toxic conditions, such as
oxidative stress [20, 21], which in the environment are either absent or
mitigated by other strains. Finally, organisms might fail to grow due to
missing certain pathways, which can be mitigated by adding
intermediates to the medium [22], or be dependent on scavenging
molecules such as siderophores produced by other members of their
community [23]
Even today, despite the development of high-throughput dilution-
to-extinction culturing techniques [24–27], culture chambers that mimic
natural environments [19, 28, 29] and co-culturing approaches [30–32],
isolating and culturing bacteria is a complex and time-consuming
endeavour.
An alternative to culturing is to perform microscopy directly on
environmental samples. High-resolution microscopy techniques such as
electron microscopy, confocal microscopy and photoswitchable
fluorophores allow a number of specific biological questions to be
addressed directly from images of live or fixated bacteria (reviewed in
[33]). However, regardless of technology, with observation alone it can
be extremely hard to achieve a reasonable functional or taxonomic
resolution for the diversity of microbes typically found in an
environmental sample. It takes years of training as a taxonomist to excel
in the visual identification of microbes, even ones with as much
morphological diversity as protists; and even then there are strong
observer effects (reviewed in [34, 35]).
To move beyond the difficulties of culturing and the limitations of
microscopy, microbial ecologists moved increasingly towards molecular
fingerprinting. Starting in 1977, Carl Woese and colleagues established
the suitability of the small subunit (SSU) of the ribosomal rna gene
(rRNA) for inferring phylogenetic relationships between prokaryotic
organisms, a property later verified to also apply to eukaryotes [36–38].
Norman Pace and colleagues soon started applying the same technique
4
to natural communities [39, 40]. Together with the ribosomal internal
transcribed spacer (ITS), this is still the most commonly used gene for
community phylogenetic composition analysis (community
fingerprinting). The advantages of using SSU rRNA for community
fingerprinting are many: i. This gene is universal in all cellular life forms
ii. It is a highly conserved gene, serving to a large degree as a reliable
molecular chronometer iii. It is seldom, if not ever, transferred
horizontally iv. It possesses both conserved and variable regions, so that
the conserved regions can be targeted by molecular approaches and the
variable ones be used as identifying markers. A handful of other genes,
such as the large subunit (LSU) rRNA share these properties, but the
length of ~1,500 bp of the bacterial SSU rRNA made it amenable to
early molecular techniques, and the impressive body of knowledge that
has since accumulated with this gene as a basis make a switch to other
markers very impractical, except in certain sub-fields such as mycology,
where ITS and LSU are still widely used.
The 1990s saw the first high-throughput environmental
fingerprinting approaches, also sometimes referred to as microbiomics.
It is the decade of techniques such as denaturing gradient gel
electrophoresis (DGGE, [41]), terminal restriction fragment length
polymorphism (T-RFLP, [42]) and automated ribosomal intergenic space
analysis (ARISA, [43]), all of which are based on the characteristic travel
distance of polymerase chain reaction (PCR) amplified DNA fragments
(amplicons) in an electrophoretic chamber. These banding patterns can
be used directly to compare broad changes in taxonomic composition of
samples in different conditions. Even though, in each of these
techniques, different organisms might give rise to identical bands, each
band is treated as an operational taxonomic unit (OTU). To assign a
tentative phylogeny to the each OTU, high abundance tags can be
selected for sequencing, or clone libraries be generated directly from
the environment, which will most likely contain the most highly
abundant tags.
At around the same time, microarrays emerged as sequence-
based alternatives to fingerprinting. The downside of microarrays is that
identification is restricted to sequences previously known and printed
onto the array [44]. While this limits its applications as a general
5
environmental survey tool, microarrays can still be valuable tools in
focused clinical, industrial or environmental monitoring settings [45–
47]. Since microarrays can cover various regions of the genome, they
are also useful for distinguishing between closely related species or
strains [48–50].
iii. High-throughput DNA sequencing in
the determination of microbial community composition
Assigning taxonomy to every OTU in a broad environmental
survey requires sequencing each amplified DNA fragment. While for
most environments this is still not achievable with current technologies,
the first steps in this direction came in the early 2000’s, with the rise of
high-throughput DNA sequencing. In 2006, the first study was published
using 454 pyrosequencing for assessing microbial communities, a
survey of the microbial diversity in a marine water community [51]. This
study, while sequencing relatively shallowly (6,505-22,994
sequences/sample), already presented two of the main characteristics of
sequencing-based microbiomics that came to be seen as standards in
the field: rarefaction curves very far from reaching saturation, which
indicated a much larger microbial diversity than previously suspected,
and a highly uneven community, with 3-4 orders of magnitude of
difference in abundance between the least and most abundant tags.
These previously unknown low abundance organisms were dubbed in
the paper the “rare biosphere”, a term still in use and whose biological
relevance is much discussed [52].
As more and more research groups started using high-
throughput gene tag sequencing, first with 454 pyrosequencing and
later with Illumina technology, it also became increasingly clear that
these methods, while less biased than some of their predecessors, do
produce a considerable number of artefacts, which can be very hard to
detect and separate from true biological signal.
The first source of bias and artefacts in microbiomics is sampling
itself. Solid samples such as soil can have extreme short-distance
heterogeneity [53]. The amount of material used for extraction, and the
definition of the sample (eg, whether they’re homogenised in bulk or
6
kept separately) has to be suited to the research question at hand. As
for aquatic samples, long term studies must contend with the issue of
the flowing and mixing of water masses. A stationary sampling, fixed to
geographical coordinates, faces the issue that changes observed in the
microbial community can be a true change within a community or a
replacement of one community by another as the water flows. As an
alternative to the stationary Eulerian sampling, it is possible to follow a
water mass using a buoy and collect samples around it, a strategy
termed Lagrangian sampling. This approach, however, cannot be
extended for longer than a few weeks, after which the water mass is
mixed beyond the point where it can be meaningfully considered
coherent with the initial sample. The temporal dimension is crucial
regardless of the sampling strategy, since the frequency of sampling
should be (but often isn’t) commensurate with the rate of the biological
process of interest.
The next source of bias and artefacts is the DNA extraction
method. Extraction relates to sampling, since different methods require
and tolerate different amounts of starting material. The physico-
chemical characteristics of the environment and of the biological
material in it will in turn interact with the extraction method, producing
a more or less efficient disruption of cell walls and membranes and
removal of contaminants. A failure to appropriately disrupt certain types
of cell wall will cause those organisms to be underestimated in the
community profile. A failure to remove contaminants such as other
biomolecules and organic acids will inhibit the DNA amplification step,
leading to amplification biases and eventually even sample loss [54–57].
Finally, for samples of low microbial density, such as patient blood
samples, minute amounts of DNA or cellular contamination in any
reagent or piece of equipment used in extraction will generate spurious
reads [58].
It should also be noted that RNA can also be extracted and
analysed as complementary DNA (cDNA). DNA is a more stable
molecule than RNA, so community signatures are less likely to
experience radical change at DNA level during sample collection [56,
59]. On the other hand, different organisms, specially eukaryotes, can
have an enormous range of copies of the rRNA gene in their genomes,
7
which hinders a simple correlation between gene copies and cell
numbers [60]. The number of rRNA copies per cell, however, is largely
independent of the number of gene copies in prokaryotic cells and is
instead correlated with the level of activity in cells, which in turn is
often inversely correlated with cell abundance in a system [61–63].
These differences can result in very different community profiles for
cDNA and DNA analyses, especially in deep water layers, which at the
DNA level are more affected by sinking dead and dying cells [63, 64].
After nucleic acid extraction (and cDNA synthesis when studying
RNA), the region of interest must be amplified and prepared for
sequencing. This is almost always achieved through polymerase chain
reaction (PCR), a method which is sturdy and cost-effective, but may
introduce large biases to the sample [65]. PCR depends on primers,
short DNA molecules (usually 15-30 bp) of defined sequence that bind to
the ends of the DNA target region on the template strands and allow a
DNA polymerase to synthetise a new DNA strand complementary to the
template downstream of the primer’s 3’ end. By flanking the region of
interest with two primers, its copy number is doubled at every reaction
cycle, hence the term “chain reaction”. The amplified region is called
amplicon, but is often referred to as “tag”, as a reference to its role in
identifying microbial taxa.
The dependence on primer binding means that a DNA template
which does not present complementarity to the primers will not be
amplified and its corresponding organism will be a false negative in the
microbiome profile. The odds of failing to amplify a tag corresponding to
a given taxon decrease if a mixture of primers with base-level variations
is used (degenerate primers). On the other hand, this increases the odds
of amplifying other DNA regions, creating artificial diversity. Therefore,
the exact sequence of the PCR primer must be considered in terms of
the community at hand and the acceptability of different biases in the
resulting amplicon pool. This was the issue addressed in PAPER I and
PAPER II.
8
In PAPER I we describe a heuristic that attempts to find the
primer sequences that match the largest amount of sequences in a
database given a multiple sequence alignment, a length l and a
maximum degree of degeneracy d, which we named Degeprime (Fig. 1).
We used this algorithm to improve on a commonly used microbiomics
primer pair to extend it from domain bacteria to domain archaea,
proposed an alternative primer site of interest for short read
9
Figure 1: an overview of the Degeprime heuristic
technologies and showed that the community profiles produced using
the resulting primer pairs are comparable to a high degree to profiles
obtained through PCR-free approaches. In the same paper, we
presented an experimental procedure for producing amplicons ready for
multiplex sequencing in Illumina sequencers using a nested PCR
approach (Fig. 2).
In PAPER II the technologies developed in PAPER I were used to
develop a primer pair for microbiomics of eukaryotes. Microbial
eukaryotes have fundamental roles in both aquatic and terrestrial
ecosystems. In marine systems, eukaryotic phytoplankton bloom in early
spring, when a combination of increased sunlight and waters rich in
nutrients from winter mixing gives them optimal growth conditions.
These phytoplankton cells, both during their lives and after death by
starvation or lysis, exude a wide variety of organic compounds that feed
bacterial heterotrophs, replenishing a community starved after the
winter months [66]. Bacteria are in turn predated by zooplankton,
reincorporating the carbon and nutrients into the broader food web, a
process known as the microbial loop. In addition to this, carbon fixed by
eukaryotic phytoplankton can enter the trophic cascade directly through
predation by zooplankton or be entirely removed from circulation as
recalcitrant organic matter sinks to the bottom of the ocean, a process
known as microbial carbon sink. Fungi in soil have long been recognised
for their important role in symbiosis with plants and as decomposers of
recalcitrant organic matter. Other small eukaryotes in soil, however,
have been largely neglected and, in general, high-throughput
microbiology often neglects the eukaryotic component of communities
[67], thereby failing to account for crucial parts of the trophic cascade
and element cycling. The goal of PAPER II was to make the eukaryotic
fraction as easily accessible as the prokaryotic.
Even with the best possible primer pairs and PCR procedures,
sequencing itself introduces errors to the data. Three main kinds of
errors can arise from sequencing: substitutions ( a base is read in place
of another), insertions (a base is read more times than were actually
present) and deletions (a base is skipped). Each sequencing platform
has its characteristic error profiles and assorted suite of tools for
10
11
Figure 2: a strategy for generating Illumina-ready amplicons based on 2 PCR reactions
handling them. Error characteristics of the Illumina platform, used for
every paper in this work, have been thoroughly investigated elsewhere
[65].
Specifically in microbiomics, after the initial data filtering a
standard procedure is to “pick OTU”. OTU stands for “operational
taxonomic unit” and is a term used in ecological research when exact
taxonomic assignment is impossible or irrelevant. In the specific case of
sequencing, OTUs are defined by clustering sequences according to
similarity. This step is meant to eliminate erroneous sequences, which
should deviate from a true sequence by only a few bases, but also to
reduce sequence diversity into true biological diversity. Since small
variations in tag composition can be observed within a single species
and even within different operons of the same gene within a single cell,
it is assumed that tags differing by only a small percentage of their
bases represent functionally equivalent cells. This is not always true, as
in the well documented case of Escherichia spp. and Shigella spp.,
which despite having clearly distinct natural histories harbour the exact
same sequence along the full length of their 16S rRNA gene [68].
OTU picking procedures can in very broad terms be divided into
hierarchical clustering (based on single-linkage, average-linkage or
complete-linkage) and heuristic approaches, where the most important
of the latter is the Usearch/Uparse approach. In single-linkage, a
sequence is placed in a cluster if it has a similarity above a threshold to
at least one other sequence in the cluster. This procedure tends to form
very large clusters with a lot of heterogeneity and is rarely used, except
occasionally for very fastly evolving genes such as the fungal ITS region
[69]. Complete linkage, on the other hand, requires that all sequences in
a cluster have similarity above threshold to all others. This method
produces therefore more OTUs than all others, and tends to
overestimate measures of community richness, especially for noisy data.
On average-linkage, finally, the average similarity between a sequence
and all others in the same cluster has to be above the threshold. Since it
is computationally very demanding to run an all-against-all comparison
on datasets of millions of reads, as done in hierarchical clustering,
Usearch approximates complete- or average-linkage approaches by only
comparing each sequence to a “centroid” sequence within each cluster
12
[70]. By selecting a distance cutoff between this centroid sequence and
the candidate sequences that is half of the maximum distance
acceptable between any two sequences within a cluster, a full-linkage
clustering is approximated. The main disadvantage of this approach is
that the order in which sequences are handled affects the final result.
Therefore, sequences are generally sorted by decreasing abundance
before clustering, since abundant sequences are less likely to be
artifacts. From the description of these methods, it is clear that the
nominal similarity threshold of a cluster can imply a wide range of de
facto distance between sequences in a cluster, depending on the
approach used, a fact that is often glossed over when discussing
microbiomics.
The most popular software packages for OTU picking are
Usearch [70], Mothur [71] and Qiime [72], which runs Uclust in the
background [73]. However, these and other approaches suffer of OTU
instability, that is, the fact that the same sequence might be assigned to
different OTUs depending on the community context [74, 75]. In
general, clusters are selected at 97% similarity [76]. This has several
problems, starting with the fact that 97% similarity over the full length
of a ~1,200 bp gene doesn’t translate directly to 97% similarity over any
given region of the gene [77]. Further, since the methods used are often
heuristic, a 3% distance doesn’t mean exactly the same thing
throughout all packages. Finally, the 97% similarity cutoff is to a large
degree arbitrary, since different taxa might have much less of a distance
between their tags and still represent functionally distinct clades [76,
78]. For the less developed fields of eukaryotic microbiomics (and
increasingly for bacteria as well) higher degrees of similarity are often
used [79, 80]. This stems from an understanding of the different way
taxonomy is applied to eukaryotes as compared to prokaryotes, ie clades
with more morphological variety tend to be assigned a finer grained
classification [81]. The appropriateness of any method is ultimately
dependent on the research question being addressed. OTU-picking at
97% similarity has both been shown to recapitulate natural history well,
when assessed from a global perspective, and to harbour extreme
heterogeneity, when studied at a narrower scale [82, 83]. This issue is
far from being resolved, as the very concept of species is the subject of
13
much controversy between and within different branches of
microbiology [76].
In an attempt to advance the methodological aspect of OTU
picking, several approaches have been recently published which attempt
to produce biologically meaningful OTUs independently of a predefined
level of similarity. Each of them has a different approach to separating
the noise introduced by PCR and sequencing from true biological
variety. DADA2 (Divisive Amplicon Denoising Algorithm 2) uses the
quality scores of bases, produced by the sequencing platform, to
calculate a substitution error model for the sequencing run at hand. It
then uses this error model to “correct” reads, that is, assign low
frequency reads to higher frequency reads from which they could be
derived by substitution with high probability [84]. Cluster-free filtering
recognises the denoising step in DADA2 as the best available, but its
own error model doesn’t take base quality into account. Instead, it
discards reads with a high probability of error and sorts through the
remaining variation by considering the temporal dynamics of OTUs
[85]. The idea in this case is that while erroneous sequences will have
similar dynamics to their corresponding correct sequence, true
biological variants will react differently to different stimuli. Minimum-
entropy decomposition, in contrast, does not use information across
samples, but neither does it treat each position in a multiple alignment
as equally significant. Instead, it calculates the Shannon entropy for
each position in an alignment and uses positions that are peaks in
entropy to split sequences into smaller clusters. The procedure
continues until no cluster exists which has a significant entropy peak
and contains a given number of sequence reads. Empirically, this
approach has been shown to reveal community dynamics that would
have been obfuscated by 97% OTU clustering [86]. An interesting
computing-power saving approach, which however does not give
differential treatment to more informative positions, has been presented
as Swarm v2, in which the possible 1-base substitution variants of each
sequence are pre-computed, converting a quadratic order all-sequence
comparison into a linear hash comparison [87]. In a small comparison
study performed in-house, however, we came to the conclusion that all
these methods with widely different first principles produce largely
14
similar ecological results, for instance in terms of alpha and beta
diversity or class-level taxonomic composition, as simply clustering to
97% using Uparse. However, it was also clear that the problem of OTU
instability is not entirely by-passed by any of these heuristics.
After the OTUs in a study are determined, it is crucial to assign a
taxonomic classification to them, so that the OTU dynamics across
samples can be interpreted in light of what is known about these taxa
from previous studies, and more broadly to allow comparison across
microbiomics studies. Unfortunately, there is also no consensus in the
research community about how to assign taxonomy to OTUs. Certain
workflows, such as QIIME [72] and Mothur [71], include the
classification step. Other software are dedicated exclusively to it. For
instance, the Ribosomal Database Project (RDP) Classifier, uses a naïve
bayesian approach to classify sequences based on exact matches of 8-
letter words, and performs bootstrapping to give probability estimates
of the correctness of the assignment [88]. Another popular approach to
sequence classification is the Silva Incremental Aligner (SINA) [89].
SINA uses an initial k-mer based search similar to that of the RDP, but
then uses the subset of the reference sequences matched best by the k-
mer search to construct a direct acyclic graph representing the
composite of all unique selected sequences and calculates an exact
alignment between the query sequence and the reference candidates.
Finally, the sequence taxonomy is assigned as the least common
ancestor of the top-scoring alignments. These approaches generally
perform much more poorly for eukaryotes than prokaryotes, due both to
more incomplete databases and to a more elaborate taxonomy.
Therefore, databases and placement strategies for eukaryotic microbes
are still being developed [90, 91]. For well-studied environments of
limited diversity, placing OTUs directly over a phylogenetic tree is a
good strategy for assigning least common ancestor taxonomy to OTUs of
interest, but this approach is computationally demanding and doesn’t
scale well for large datasets with high taxonomic diversity [92].
Any combination of methods and algorithms chosen to profile
community microbiomes have their own intrinsic and unavoidable
biases. Which method produces the results closest to the underlying
community is difficult to assess and depends on the specific community
15
under study, but being aware of the biases produced by each method is
crucial both to method selection and to data interpretation and
comparison across studies.
iv. Data processing and statistical analyses of microbiomics
Multisample microbiomics data is generally summarised as a
table of read counts per OTU per sample. These tables are generally
very sparse, especially for OTUs belonging to the rare biosphere. The
interpretation of these counts of 0 is not straightforward, since they may
represent the true absence of an OTU or its presence under the
detection limit. Each of the various steps of processing, from DNA
extraction through library preparation and sequencing, give a different
yield for each sample, meaning that the detection limit is neither shared
by all samples nor easily estimated. While this can be bypassed by
normalizing counts per sample, that means that observations are no
longer independent, since an increase in the relative abundance of an
OTU induces a perceived reduction in all others.
Given these caveats, the choice of metric when comparing
samples or OTUs through clustering or ordination analyses can have
important consequences. In addition to true distance metrics, it is
common in microbial ecology to use correlation coefficients, such as
Pearson’s product moment and Spearman’s rank, or measures of
dissimilarity, such as Bray-Curtis’. The appropriate metric for a study
might depend on the size of the effect of interest and on the depth of
sampling. Semi-quantitative measures such as Spearman’s require a
larger effect size, while Euclidean distances often requires a very large
sample [93]. Bray-Curtis can be appropriate for datasets with many
zeroes, but may also lack sensitivity [94].
An alternative to OTU-based distances is to use phylogenetic
distances. While these approaches also require several non-trivial
choices, such as the underlying phylogenetic tree and the placement of
OTUs on it, phylogenetic distances are still more biologically
meaningful, not least because phylogenetic relatedness is often
associated to trait conservation [95]. As is the case for OTU-based
metrics, using a quantitative or qualitative approach to community
16
comparison can lead to very different results [96]. This can be
ameliorated through an appropriate weighting procedure, such as
generalised Unifrac [97].
Regardless of the metric chosen, exploratory methods are often
used for an initial assessment of how different communities cluster or
distribute themselves along a gradient. One of the oldest and most
common exploratory methods is the principal component analysis, or
PCA. In it, variables are treated as axes in an Euclidean
multidimensional space and the first principal component is by
definition placed on the direction representing the largest variation of
the data. The second component is placed in the direction orthogonal to
this that explains the largest amount of the remaining variance and so
on. The first two or three components often explain a large amount of
the variation, allowing a visual inspection of the distance between
samples in two- or three dimensional space. Furthermore, the
percentage of the variation explained by each axis indicates whether
there are dominant drivers present or not. To avoid the constraint of the
Euclidean distance, principal coordinate analysis (PCoA) can be used
together with any dissimilarity matrix. Another conceptually similar
strategy is correspondence analysis (CA), where rather than maximising
the percentage of variance explained by each axis, the correspondence
between rows and columns in the matrix is optimised. Unlike these
techniques, in multidimensional scaling (MDS), the number of
dimensions to which the dataset should be reduced is chosen a priori
and the algorithm finds the distribution of objects in the lower-
dimension space that best corresponds to their distances in the full
dimension, while also calculating a stress function representing the
amount of the distortion between spaces. Non-metric MDS, or NMDS, is
an extension of this technique using ranks of distances rather than their
value.
It is often not clear what the main driver of the community over
the gradient is, or even how many overlapping gradients there are.
These are the cases where exploratory methods are most needed, but
also where the biases of each method most affect the biological
interpretation of results. For instance, the horseshoe effect, where
sparse matrices driven by a single dominant gradient assume an arch-
17
like pattern when submitted to PCA or CA, may mask other, more subtle
gradients, and detrending techniques to eliminate this effect often erase
true patterns [93]. In datasets with many overlapping gradients, an
NMDS will often produce a clearer overview of the data distribution
than methods that don’t limit the number of dimensions [98]. It is
therefore recommendable to try a variety of different approaches and
retain not only those which explain the largest proportion of the
variation in the dataset, but also those that propose underlying
biological mechanisms amenable to further investigation.
It is also possible to test the effect of specific parameters on the
community composition, using constrained techniques or discriminant
analysis. These techniques assess the proportion of the variation in the
data which can be explained by specific parameters or combinations of
them. Some of these methods test how well one matrix can be explained
by another symmetrically, and are useful for instance to see whether the
bacterial community in one filter fraction corresponds to that in another
from the same sampling event, or to assess correlations between the
prokaryotic and eukaryotic fractions of a sample. Similarly to PCA,
canonical correlation analysis (CCorA) tries to find the linear
combinations of variables in two datasets that provide the maximum
correlation between them. In Procrustes analysis, the same set of
objects (eg samples) placed on different spaces (eg biological domains
or metabolites) are moved, rotated and scaled to minimise the square
root of the square sum of distances between each pair of corresponding
objects.
When it is clear which are the explanatory variables and which
are the response variables, as is typically the case when comparing
environmental parameters and microbial communities, the methods
discussed above can be constrained accordingly. Redundancy analysis
(RDA) is a version of PCA extended to sets of variables and constrained
to the explanatory variables. Likewise, canonical correspondence
analysis (CCA) is CA constrained to the explanatory variables. If one
variable overwhelms the effect of all others, as can be the case in
intervention studies where all treated samples are clustered together
and apart from the non-treated, a principal responses curve (PRC) can
be used [99]. This approach is also useful if an overlap of many
18
potentially interacting gradients makes the visual interpretation of an
RDA or CCA plot impossible.
Another family of analyses aims at assessing the significance of
the correlation between two matrices. This includes tests such as
analysis of similarities (ANOSIM), and Mantel’s test. The latter
calculates the correlation between corresponding positions in two
matrices and assesses significance by permutation. ANOSIM, similarly
to NMDS, is based on ranks of object distances, and compares the ranks
of distances of objects within classes with those between classes.
None of the strategies discussed here can distinguish correlation
from causation, except perhaps in intervention studies. More
importantly, clusters and gradients produced along artificial axes do not
necessarily correspond to any underlying biological effect. From a
mathematical perspective, variables of different type (eg metabolomics
versus microbiomics) will often have different variance-to-mean
characteristics, which requires appropriate data transformation [98].
New methods for testing hypotheses based on high-throughput data are
still being developed, and understanding their strengths as well as their
assumptions is a crucial and challenging issue for microbial ecologists.
v. Network construction and community dynamics
Given the overall design and caveats of a microbiomics study, the
sampling strategy will determine which biological questions can be
addressed. A comparison of the microbiome of diseased patients and
healthy controls can show which taxa differentially colonise these
subjects, but will give no insight into whether the altered microbiome is
promoting disease or if the disease state selects different taxa. On an
experiment with animal models or directly on the environment it is
possible to see how the microbiome reacts to the intervention. In some
cases, it is possible to alter the microbiome (through antibiotics,
prebiotics and/or the introduction of foreign microbial cells) and observe
their impact on environment or host phenotype. However, there is
mounting evidence that within a given ecosystem, interactions between
taxa play a more important role in driving community dynamics than
environmental forcing [18, 100].
19
The most basic approach to hypothesising interactions between
microbial populations is through pairwise relationships, either as
presence/absence (“checkerboard patterns”) or through quantitative
measures.The latter generally rely on measures of correlation such as
Spearman’s and Pearson’s, while the hypergeometric distribution is
appropriate for binary data [101, 102]. In either case, the underlying
hypothesis is that, if there is an interaction between two species, and
given similar environments with similar resources, these two species
will co-occur more likely than expected by chance if their interaction is
beneficial (mutualism or commensalism) and co-occur less likely than
expected by chance if their interactions is prejudicial (competition or
ammenalism). However, two of the most important types of interactions
in natural systems, predation and parasitism, are beneficial to one of the
parts (the predator or parasite) and prejudicial to the other (the prey or
host), complicating the ecological interpretation of co-occurrence
patterns. Furthermore, given the intricacies of microbial metabolism, it
is seldom clear if a species is excluded from a niche due to negative
interactions with other organisms or due to unmeasured variables in the
environment. Nevertheless, mapping pairwise correlations can be a
useful first step in developing a network hypothesis. Since in the typical
case thousands of correlations and anticorrelations will be tested, the
significance of any association has to be tested for significance and
subjected to multiple testing correction. This is done by randomising the
interaction network and calculating the distribution of scores. It is
however still not clear what the correct randomisation procedure is for
this type of data [94].
In addition to being a rich representation of interactions between
particular nodes, properties of the network itself can contain
information about the system. For instance, microbial networks are
generally modular, scale-free and have short average path length [94].
How these mathematical properties translate into biological properties
is still open to debate. It is not clear, for instance, whether a node with a
high out-degree represents a keystone clade whose demise would
severely perturb the entirety of the system, or whether the levels of
redundancy and plasticity in biological systems are enough to replace
these hubs without much propagation of perturbation. In the case of
20
bacteria in particular, not only does the community present a certain
level of plasticity, but single species and even individual cells can
dramatically alter their life strategy in response to disturbances,
decoupling to a large extent a community's taxonomic composition from
its functional profile [103]. Network properties also interact with
community characteristics such as richness and evenness, and often
have opposite effects in the resulting resistance and resilience of the
community to perturbation, so that broad natural laws of community
stability might be impossible to obtain [103].
A powerful approach to gain insight into the internal mechanisms
of a natural microbial community is sampling time-series with
appropriate intervals and length and using techniques such as local
similarity analysis [104, 105] or auto- and cross-correlation [18, 106,
107]. If a system has an intrinsic periodicity, such as annual cycles, a
few full cycles should be included in the study to separate recurring
patterns, random fluctuations and system drift (time decay), as was
done for the Western English Channel and in lake Mendota [18, 108].
Also important is to consider that different processes might take place
at different rates, corresponding to one or more sampling intervals, as
seen in the San Pedro Ocean Time-Series, or, conversely, that
associations that are significant in the short term are irrelevant in
longer time-spans, as described for different stations on the coast of
California [105, 109].
Strong seasonal recurrence has been reported in several sites,
with the rate of interannual decay declining with the length of the time-
series [18, 64]. Seasonal patterns are believed to be formed by a
combination of external forcing (light, temperature, wind and currents)
and intrinsic community mechanisms, such as trade-offs between
resistance to predation and growth efficiency. Temporal decay is a
phenomenon whereby events which are close in time are more related
than those that are further apart. This can happen due to random
chance events which accumulate over time or due to undetected
changes in underlying conditions. Due to seasonality, the time-frame
which is relevant for most microbial assemblages are those which are
one-year apart, ie, in the same season. Stochastic factors mean that
there is significant loss of signal from one year to the next. However,
21
environmental and biological constraints maintain community variation
within certain boundaries. Therefore, calculating inter-annual decay
over several years tends to give lower yearly rates than would be
calculated on shorter time-series. In addition to recurrent and linear
(time-decay) patterns, dramatic but rare events can occasionally also be
observed in long time-series, as otherwise rare taxa bloom in response
to changing environmental conditions, which has been reported in the
Western English Channel, in the Bermudas and in the central Baltic Sea
[18, 110, 111].
In PAPER III, the methods developed in PAPER I and PAPER II
were applied to a 3-year long time-series of samples collected from the
Linnaeus Microbial Observatory, 10 km east of Öland. These samples
were collected during the entire ice-free period, with sampling intervals
of 4-15 days between samples. An initial analysis, focusing on the 2011
bacterial fraction, was published by Lindh et al. [111]. Using 454
pyrosequencing and clustering at 97% identity, 3079 OTUs were formed,
the vast majority of which was rare (<0.1% abundance) and infrequent.
Considering the sequencing depth obtained by 454, of <10,000 reads
per sample, it is expected that rare populations will also be infrequent.
Amongst the abundant OTUs (>1%), some had clear seasonal patterns;
some had less clear peaks, which are either stochastic, driven by
unmeasured environmental parameters or by hydrodynamics; and 8
were abundant in more than half of samples. Interestingly, these 8
OTUs, which together account for >50% of the total number of
sequencing reads, span three phyla, including three proteobacterial
classes. Concentrations of nitrate, ammonium, silicate and phosphate,
as well as dinoflagellate biovolume, correlated significantly with
community composition. Community richness was significantly
correlated to chlorophyll a concentration, indicating that eukaryotic
phytoplankton play an important role in sustaining a diverse
heterotrophic bacterial community. The communities as a whole formed
significantly distinct clusters by season, with gradual changes in
abundance within season, but sharp changes in composition on seasonal
transitions.
22
This study was extended in PAPER III by sequencing prokaryotic
tags from three years (2011-2013) as well as eukaryotic tags for the
latter two. Some of the observations made in the Lindh study [111] were
confirmed, such as the temporal clusters of 16S tags and a fuzzier, but
still present, clustering of 18S tags. However, a degree of uncertainty
and inconsistency between years was also observed. Despite the deeper
sequencing obtained with Illumina technology, most OTUs are rare and
infrequent, with only 13% of them present in half the samples or more.
The dynamics of individual OTUs are fairly distinct from year to year,
with their maxima often occurring in different seasons (40% of OTUs)
and in a different period in relation to algal blooms (60% of OTUs). This
unpredictability extends to correlations between OTUs, with only 30-
40% of strong links (Spearman’s r >= 0.7) being predicted by more than
one of the yearly time-series. The dynamics of eukaryotic tags are less
predictable, since few strong links are observed between them, or
between them and bacterial tags. Accordingly, no combination of the
environmental factors measured could explain more than 33% of the
community variation for any of the microbial subsets or seasons
analysed.
Despite this large degree of uncertainty, more than half of the
most frequent OTUs could be modeled with a fair degree of precision.
Using random forests, predictive models were generated for each OTU
based on other OTUs, environmental factors or both sources of data.
309 bacterial tags and 120 bacterial tags could be modelled and present
R² > 0.5 when comparing predicted data to actual measurements as
well as statistical significance compared to randomised data. Further,
for 83 bacterial tags, models could be found to predict their abundance
in one time point based on the previous one, 61 could be predicted two
time points ahead and several others (20-40 per interval) could be
predicted at longer time-spans. This has important implications for
environmental monitoring, since it implies that OTUs of interest can
have their relative abundance predicted at least two weeks in advance,
shifting surveillance from a responsive to a predictive and proactive
paradigm.
23
vi. Metagenomic surveys
While microbiomics can provide an inventory of microbial taxa in
an environment and even propose links between taxa or between these
and environmental parameters, it cannot do a comprehensive functional
profiling of the environment of interest. Assessing the metabolic
potential of a complex environment requires sequencing a
representative set of functional genes in it; this is achievable through
shotgun metagenomics.
The term metagenomics was coined by Handelsman and
colleagues in 1998, when they described “the collective genomes of soil
microflora, which we term the metagenome of the soil” and proposed
discovering new pathways for the synthesis of natural products by
transforming E. coli with genes from soil microbes [112]. The current
understanding of metagenomics, that is, the sequencing of the full gene
inventory of an environment, appeared first in 2004 with the release of
two papers describing whole DNA sequencing from environmental
samples. In one, near complete genomes from a simple acid mine
drainage biofilm were described, [113] while the other focused on
genomic fragments from an oligotrophic marine environment, the
Sargasso Sea [114].
While all the considerations above on sample collection and DNA
extraction also apply to metagenomics, the issues of primer selection
and PCR biases are circumvented. The computational demands of
metagenomic analysis are, however, much greater. Some metagenomic
analyses, such as limited phylogenetic profiling, can be done directly
based on short sequencing reads [115], but most of the information in
the genome can only be analysed based on nearly full-length genes or
operons. In fact, the closer to a full chromosome that can be obtained,
the more information on genetic architecture, physiology and evolution
can be obtained. The process of converting short sequencing reads into
much longer contiguous stretches, or contigs, is called assembly and is
still an active research area for isolate genomes as well as for
metagenomes [116].
A naïve approach to genome assembly is to use the ends of reads
that overlap other reads to piece them together. This strategy, named
24
overlap-layout-consensus (OLC), has been highly successful on the
relatively long sequencing reads produced by technologies such as
Sanger sequencing. However, the large amount of short reads (~50-300
bp) produced by Illumina and other massively parallel sequencing
technologies is unsuitable for OLC, as the process would be
simultaneously computationally intractable and extremely error prone.
Instead, most modern assemblers are based on de Bruijn graphs. A de
Bruijn graph is a directed graph which represents overlaps between
short sequences such that each node is a sequence and each edge
represents an overlap between them. For genome assembly, the length
of the short sequence, or k-mer, is typically between 21 and 91, and
each edge represents a 1-base overlap (i.e. a k+1-mer). Since there are
far fewer unique k-mers of a given length than unique sequencing reads,
the de Bruijn graph structure greatly reduces the amount of memory
required for the assembly, and the search space and hence computation
time. With the graph in place, the problem of assembly is reduced to a
problem of finding the shortest path connecting all nodes in the graph.
If there were no loops in the graph and sequencing quality was perfect,
this problem would be trivial. However, sequencing errors and unequal
sequencing coverage create dead ends in the path which must be
sorted. In addition, there is natural biologic repetition in the genome, so
the formation of loops is inevitable. To solve this, most assemblers rely
on heuristics based on the depth of sequencing in each path. Most
errors will be rare, so low coverage paths (“bubbles”) may be discarded.
True repetitive regions will have particularly high coverage, and can
therefore either be present in the assembly multiple times or be left as a
separate contig, in case they cannot safely be connected to the rest of
the assembly.
In the case of metagenome assembly, all the issues described
above still hold, with the aggravation that many genes and genetic
elements will be found with small variations in several genomes. In this
case, a low coverage path is not necessarily a sign of a sequencing error
to be collapsed into a contig of higher coverage; rather, it may represent
true biological variation. The use of coverage to differentiate between
different possible paths in a de Bruijn graph is therefore even more
crucial for metagenomic assembly than for single genomes.
25
Metagenomic assemblies are often highly fragmented, that is,
many more short contigs are generated than the number of full length
chromosomes and plasmids in the biological sample, due to repeated
regions between genomes and genomes of low abundance in the
community. One way to improve this is to combine the strengths of
different k-mer lengths. While short k-mers will generally have higher
coverage, they are less likely to be able to resolve loops in the de Bruijn
graph, which results in a long total length of the assembly, but spread
over several short contigs. Long k-mers have more capacity to bridge
over repeats and resolve loops but, having lower coverage, they will
generally not result in an equally long total assembly since many k-mers
will be missing. Therefore, several assembly strategies exist that
attempt to combine the strengths of short and long k-mers through
procedures such as scaffolding, reassembly with an OLC method or
built-in progressive assembly. Scaffolding does not directly extend
contigs, but places them in the correct order and direction while
estimating the size of assembly gaps (unassembled regions) between
them. This is generally done by taking advantage of mate-pair reads
with long insert sizes that bypass the maximum length of the DNA
fragment that can be sequenced. While in principle useful, this strategy
requires special laboratory procedures and is in practice rarely used in
metagenomic sequencing. On the other hand, producing contigs by de
Bruijn graph assembly and then clipping them into long substrings
(~500-1000 bp) reduces the problem of short-read assembly into that of
long-read assembly, which is tractable by OLC. This approach has been
shown to significantly improve the length of contigs produced with a
very small increase in the rate of misassembly [117–120]. Finally, some
assemblers, such as Megahit and IDBA-UD use progressively increasing
k-mers in their assemblies, while retaining contigs obtained at previous
iterations [121, 122]. They also implement different thresholds for
removing erroneous k-mers at each iteration, which is a great advantage
when assembling data with highly uneven coverage, as is inevitably the
case for metagenomic datasets.
To extract biologically relevant information from assembled
contigs, the next crucial step is usually gene prediction. Predicted genes
are used first and foremost for inferring possible metabolic pathways
26
present in the community. Although the presence of a gene does not
imply that it is being transcribed and much less that it is enzymatically
active, it is a reasonable assumption that communities presenting a
certain gene should at least have the potential for realising that activity
under appropriate conditions, such as in the presence of substrates and
the absence of inhibitors. While gene prediction and gene functional
assignment are not trivial problems, available software is sufficiently
accurate to allow good estimates of metabolic potential from complex
communities [123]. The quantitative information contained in the
sequencing runs is lost during assembly. Therefore, to quantify the
relative copy number of genes and genomes from a metagenome
requires mapping reads back to the assembly. This quantitative
information can then be related to quantitative measures of
environmental parameters.
Conserved gene families can also be used to infer the
phylogenetic origin of contigs, and therefore of the underlying
community [124, 125]. Other approaches for the phylogenetic
placement of contigs are based on k-mer profiles, alignment of full
contigs to a database or combinations of these approaches. A
fundamental distinction to be made between these approaches is that
marker-gene based strategies will necessarily only classify the subset of
all contigs which contains these genes, while an alignment or k-mer
based approach can be applied to all contigs and even individual reads.
A comprehensive comparison of available software was performed by
Peabody and colleagues [126]. By performing in silico and in vitro clade
exclusion experiments, that is, removing from the database the target
taxon, they found that several methods tend to grossly overestimate the
community diversity. The precision and sensitivity of every method falls
as expected with increasing levels of clade exclusion and increases with
read length. When the query sequences are present in the database, k-
mer based methods outperform those based on marker genes, since they
can analyse all the sequencing information and not only predicted
genes. In the more realistic case where metagenomic reads come from
sequences not already known in databases, however, marker-gene-based
predictions outperform k-mer based approaches by far. This also
highlights the importance of good assembly and gene calling
27
procedures, to maximise the number of contigs containing a correctly
predicted marker gene.
Global oceans were among the first environments to be explored
through metagenomics and are still among the most researched [114,
127–129]. These studies have uncovered much phylogenetic novelty at
all taxonomic levels. Functional novelty was also discovered, such as
novel proteorhodopsin families, novel pigment gene clusters and
ammonia oxidation capacity in Archaea [114, 130]. Crucially, with
metagenomics, hypotheses initially postulated from microbiomics
studies can be assessed at the functional gene level. For instance,
Venter and colleagues confirmed in the Sargasso Sea that
bacterioplankton contigs cluster into defined species, not a continuum
of diversity [114]. Later studies deepened this observation by showing
that even cosmopolitan clades such as SAR11 display microdiversity and
geographical structuring [131]. From a biogeography perspective,
Dupont, Larsson and colleagues confirmed the observation that
bacterioplankton functional genes in the Baltic Sea follows its salinity
gradient, with a typically freshwater profile in the low salinity North
becoming increasingly marine towards the mesohaline southwest [129].
Metagenomic data has the further advantage that, because it is broad
and unbiased by primer choice, it can be combined across studies to
present a more complete picture of a biological phenomenon or
reexamined by other research groups to answer questions not even
considered by those which produced the data.
vii. Genome reconstruction from metagenomic data
Although a great deal can be learnt about a bacterial community
based on gene prediction, in practice, metabolism takes place in
structured pathways, mostly in and on cells, and not as a loose
collection of genes. Therefore, much more information can be obtained
when contigs are put together into the context of a single organism.
While full genomes are the ultimate goal of assembly, this is in practice
not feasible, and other sources of information must be used to bin
contigs into fragmented draft genomes (metagenome-assembled
genomes, or MAGs). Sequence-based parameters, such as GC-content
28
and tetranucleotide frequencies have been successfully used to produce
MAGs in some cases, but these approaches can generally only
discriminate down to the genus level [132–134]. By including
information on coverage of contigs across multiple related samples it is
possible to obtain a finer level of resolution, at species and sometimes
strain levels [135–137]. For the sake of reproducibility and cross-study
comparison, this process should be fully automated. A few tools have
been put forth that can do that [138–141], although binning genomes
from specific organisms of interest might still require semi-manual
approaches [142].
New clades are increasingly being proposed based on MAG data
[137, 142–146]. In addition to providing new genomes from ecologically
important clades, the MAG approach can be used to close outstanding
gaps in the tree of life. By strategically sampling from environments
known to contain phylogenetic novelty and sometimes using semi-
manual approaches, deeply diverging clades have been found and
characterised. These include 35 bacterial phyla from a proposed
candidate phyla radiation [137], several genomes from previously
proposed candidate phyla [145] and an archaeal clade proposed to be a
sister clade to that which gave rise to all eukaryotes [142].
In PAPER IV, the automated contig binning tool CONCOCT [138]
was used to generate high-quality MAGs from water samples from the
Linnaeus Microbial Observatory’s 2012 sampling. The degree of
completeness and purity of bins was estimated based on 36 universal
prokaryotic single-copy genes were considered. The bins considered for
further analyses were those that presented at least 30 of these genes,
no more than two of which in multiple copies. Due to this very stringent
selection criteria, only 89 bins were selected, corresponding to 29
bacterial and 1 archaeal species, each of which was termed BACL, for
BAltic CLuster. With the exception of BACL8, all of these species had
never been genome sequenced before. In a few cases, these were the
first published genomes for bacterial lineages known from 16S
sequencing to be highly abundant in freshwater environments (acIV of
Actinobacteria and LD19 of Verrucomicrobia). The presented genomes
complement the previous 16S data with information on these organisms'
29
metabolic potentials and may guide future efforts in isolating them.
Mapping sequencing reads from previous metagenomic surveys of lakes
and oceans from around the globe to each of the genomes assembled
gave their spatial distribution, showing that lineages are differentially
abundant in different salinity levels. Actinobacteria, for instance, are
more closely related to freshwater organisms, while Bacteroidetes and
Alphaproteobacteria are closer to their marine counterparts.
Reads from previous surveys were also mapped to the mass of
unbinned contigs that comprised the rest of the LMO community. This
revealed a clear separation of aquatic environments based on salinity.
While it was already known that there is a strong divide between fresh
and marine environments [147], previous studies suggested that
brackish environments were comprised of a mixture of typically fresh
and typically marine OTUs [129, 148]. In this work, we could show that
this community is in fact comprised of brackish water specialists, an
observation later confirmed in a survey of the Caspian Sea [149].
viii. Summary and perspectives
Life on Earth was exclusively microbial for most of its history,
and is still predominantly so. Scientists have been striving to catalogue,
understand and manage this wealth of life for almost 250 years, and yet
been severely limited by available techniques. Historically, while general
ecology has been based on direct observation combined with
mathematical modelling, breakthroughs in microbial ecology have been
coupled to technological advance. With recent advances in technologies
such as microfluidics and high-throughput DNA sequencing, as well as
the steady growth of computational methods and processing capacity,
the pace of advance in microbial ecology has been greatly increased. In
addition to the approaches already discussed in this work,
microbiologists can now use metatranscriptomics, metaproteomics,
single-cell genome sequencing, flow cytometry, cell sorting, high-
throughput image analysis and nanoSIMS (nanoscale mass
spectrometry), together providing a wide array of complementary
techniques for assessing microbial phylogeny and activity in bulk as well
as at the single-cell level.
30
While the work of mapping and modelling microbial life on Earth
will remain an open field of basic scientific inquiry, it is important to also
consider the potential medical and technological applications of these
studies. From alternative fuel sources to environmental
decontamination, antibiotic resistance to allergy and autoimmunity
prevention and treatment, many of the biggest challenges of our times
may soon find their answers in the myriad strategies microorganisms
adopt to survive, compete, cooperate and thrive on Earth. It is therefore
crucial that the full potential, as well as the caveats and biases, of
established and nascent microbiology approaches are understood. It is
my hope that the work contemplated by this doctoral thesis will play a
part in this endeavour.
31
x. References
1. Falkowski P: Ocean Science: The power of plankton. Nature 2012,
483:S17–20.
2. Whitman WB, Coleman DC, Wiebe WJ: Prokaryotes: the unseen majority.
Proc Natl Acad Sci U S A 1998, 95:6578–6583.
3. Falkowski PG, Fenchel T, Delong EF: The microbial engines that drive
Earth’s biogeochemical cycles. Science 2008, 320:1034–1039.
4. Zehr JP, Kudela RM: Nitrogen cycle of the open ocean: from genes to
ecosystems. Ann Rev Mar Sci 2011, 3:197–225.
5. Stocker R: Marine microbes see a sea of gradients. Science 2012,
338:628–633.
6. Azam F, Malfatti F: Microbial structuring of marine ecosystems. Nat Rev
Microbiol 2007, 5:782–791.
7. Glibert PM, Heil CA, Hollander D, Revilla M, Hoare A, Alexander J, Murasko
S: Evidence for dissolved organic nitrogen and phosphorus uptake
during a cyanobacterial bloom in Florida Bay. Mar Ecol Prog Ser 2004,
280:73–83.
8. Conley DJ, Paerl HW, Howarth RW, Boesch DF, Seitzinger SP, Havens KE,
Lancelot C, Likens GE: Ecology. Controlling eutrophication: nitrogen and
phosphorus. Science 2009, 323:1014–1015.
9. Casini M, Hjelm J, Molinero J-C, Lövgren J, Cardinale M, Bartolino V, Belgrano
A, Kornilovs G: Trophic cascades promote threshold-like shifts in pelagic
marine ecosystems. Proc Natl Acad Sci U S A 2009, 106:197–202.
10. Leppäranta M, Myrberg K: Physical Oceanography of the Baltic Sea.
Chichester: Praxis Publishing Ltd; 2009.
11. Elmgren R, Blenckner T, Andersson A: Baltic Sea management:
Successes and failures. Ambio 2015, 44 Suppl 3:335–344.
12. Carstensen J, Andersen JH, Gustafsson BG, Conley DJ: Deoxygenation of
the Baltic Sea during the last century. Proc Natl Acad Sci U S A 2014,
111:5628–5633.
13. Rutgersson A, Jaagus J, Schenk F, Stendel M: Observed changes and
variability of atmospheric parameters in the Baltic Sea region during the
last 200 years. Clim Res 2014, 61:177–190.
14. Staley JT KA: Measurement of in Situ Activities of Nonphotosynthetic
Microorganisms in Aquatic and Terrestrial Habitats. Annu Rev Microbiol
1985, 39:321–346.
15. Lagier J-C, Edouard S, Pagnier I, Mediannikov O, Drancourt M, Raoult D:
Current and past strategies for bacterial culture in clinical microbiology.
Clin Microbiol Rev 2015, 28:208–236.
32
16. Stewart EJ: Growing unculturable bacteria. J Bacteriol 2012, 194:4151–
4160.
17. Iluz D, Dishon G, Capuzzo E, Meeder E, Astoreca R, Montecino V, Znachor P,
Ediger D, Marra J: Short-term variability in primary productivity during a
wind-driven diatom bloom in the Gulf of Eilat (Aqaba). Aquat Microb Ecol
2009, 56:205–215.
18. Gilbert JA, Steele JA, Caporaso JG, Steinbrück L, Reeder J, Temperton B,
Huse S, McHardy AC, Knight R, Joint I, Somerfield P, Fuhrman JA, Field D:
Defining seasonal marine microbial community dynamics. ISME J 2012,
6:298–308.
19. Zengler K, Toledo G, Rappe M, Elkins J, Mathur EJ, Short JM, Keller M:
Cultivating the uncultured. Proc Natl Acad Sci U S A 2002, 99:15681–15686.
20. Morris JJ, Johnson ZI, Szul MJ, Keller M, Zinser ER: Dependence of the
cyanobacterium Prochlorococcus on hydrogen peroxide scavenging
microbes for growth at the ocean’s surface. PLoS One 2011, 6:e16805.
21. Tanaka T, Kawasaki K, Daimon S, Kitagawa W, Yamamoto K, Tamaki H,
Tanaka M, Nakatsu CH, Kamagata Y: A hidden pitfall in the preparation of
agar media undermines microorganism cultivability. Appl Environ
Microbiol 2014, 80:7659–7666.
22. Nye KJ, Fallon D, Gee B, Messer S, Warren RE, Andrews N: A comparison
of blood agar supplemented with NAD with plain blood agar and
chocolated blood agar in the isolation of Streptococcus pneumoniae and
Haemophilus influenzae from sputum. Bacterial Methods Evaluation
Group. J Med Microbiol 1999, 48:1111–1114.
23. D’Onofrio A, Crawford JM, Stewart EJ, Witt K, Gavrish E, Epstein S, Clardy J,
Lewis K: Siderophores from neighboring organisms promote the growth
of uncultured bacteria. Chem Biol 2010, 17:254–264.
24. Aakra A, Utåker JB, Nes IF, Bakken LR: An evaluated improvement of the
extinction dilution method for isolation of ammonia-oxidizing bacteria. J
Microbiol Methods 1999, 39:23–31.
25. Rappé MS, Connon SA, Vergin KL, Giovannoni SJ: Cultivation of the
ubiquitous SAR11 marine bacterioplankton clade. Nature 2002, 418:630–
633.
26. Aoi Y, Kinoshita T, Hata T, Ohta H, Obokata H, Tsuneda S: Hollow-fiber
membrane chamber as a device for in situ environmental cultivation.
Appl Environ Microbiol 2009, 75:3826–3833.
27. Liu W, Kim HJ, Lucchetta EM, Du W, Ismagilov RF: Isolation, incubation,
and parallel functional testing and identification by FISH of rare
microbial single-copy cells from multi-species mixtures using the
33
combination of chemistrode and stochastic confinement. Lab Chip 2009,
9:2153–2162.
28. Nichols D, Cahoon N, Trakhtenberg EM, Pham L, Mehta A, Belanger A,
Kanigan T, Lewis K, Epstein SS: Use of ichip for high-throughput in situ
cultivation of “uncultivable” microbial species. Appl Environ Microbiol
2010, 76:2445–2450.
29. Sizova MV, Hohmann T, Hazen A, Paster BJ, Halem SR, Murphy CM, Panikov
NS, Epstein SS: New approaches for isolation of previously uncultivated
oral bacteria. Appl Environ Microbiol 2012, 78:194–203.
30. Kaeberlein T, Lewis K, Epstein SS: Isolating “uncultivable”
microorganisms in pure culture in a simulated natural environment.
Science 2002, 296:1127–1129.
31. Tanaka Y, Hanada S, Manome A, Tsuchida T, Kurane R, Nakamura K,
Kamagata Y: Catellibacterium nectariphilum gen. nov., sp. nov., which
requires a diffusible compound from a strain related to the genus
Sphingomonas for vigorous growth. Int J Syst Evol Microbiol 2004, 54(Pt
3):955–959.
32. Morris JJ, Kirkegaard R, Szul MJ, Johnson ZI, Zinser ER: Facilitation of
Robust Growth of Prochlorococcus Colonies and Dilute Liquid Cultures
by “Helper” Heterotrophic Bacteria. Appl Environ Microbiol 2008, 74:4530–
4534.
33. Coltharp C, Xiao J: Superresolution microscopy for microbiology. Cell
Microbiol 2012, 14:1808–1818.
34. Moreira D, López-García P: The molecular ecology of microbial
eukaryotes unveils a hidden world. Trends Microbiol 2002, 10:31–
35. Silva PC: Historical review of attempts to decrease subjectivity in
species identification, with particular regard to algae. Protist 2008,
159:153–161.
36. Woese CR, Fox GE: Phylogenetic structure of the prokaryotic domain:
the primary kingdoms. Proc Natl Acad Sci U S A 1977, 74:5088–5090.
37. Woese CR, Stackebrandt E, Macke TJ, Fox GE: A phylogenetic definition
of the major eubacterial taxa. Syst Appl Microbiol 1985, 6:143–151.
38. Woese CR: Bacterial evolution. Microbiol Rev 1987, 51:221–271.
39. Stahl DA, Lane DJ, Olsen GJ, Pace NR: Characterization of a Yellowstone
hot spring microbial community by 5S rRNA sequences. Appl Environ
Microbiol 1985, 49:1379–1384.
40. Pace NR, Stahl DA, Lane DJ, Olsen GJ: Analyzing natural microbial
populations by rRNA sequences. ASM News 1985, 51:4–12.
34
41. Muyzer G, de Waal EC, Uitterlinden AG: Profiling of complex microbial
populations by denaturing gradient gel electrophoresis analysis of
polymerase chain reaction-amplified genes coding for 16S rRNA. Appl
Environ Microbiol 1993, 59:695–700.
42. Liu WT, Marsh TL, Cheng H, Forney LJ: Characterization of microbial
diversity by determining terminal restriction fragment length
polymorphisms of genes encoding 16S rRNA. Appl Environ Microbiol 1997,
63:4516–4522.
43. Fisher MM, Triplett EW: Automated approach for ribosomal intergenic
spacer analysis of microbial diversity and its application to freshwater
bacterial communities. Appl Environ Microbiol 1999, 65:4630–4636.
44. Ehrenreich A: DNA microarray technology for the microbiologist: an
overview. Appl Microbiol Biotechnol 2006, 73:255–273.
45. Humbert JF, Quiblier C, Gugger M: Molecular approaches for monitoring
potentially toxic marine and freshwater phytoplankton species. Anal
Bioanal Chem 2010, 397:1723–1732.
46. Ricke SC, Khatiwara A, Kwon YM: Application of microarray analysis of
foodborne Salmonella in poultry production: a review. Poult Sci 2013,
92:2243–2250.
47. Zumla A, Al-Tawfiq JA, Enne VI, Kidd M, Drosten C, Breuer J, Muller MA, Hui
D, Maeurer M, Bates M, Mwaba P, Al-Hakeem R, Gray G, Gautret P, Al-Rabeeah
AA, Memish ZA, Gant V: Rapid point of care diagnostic tests for viral and
bacterial respiratory tract infections--needs, advances, and future
prospects. Lancet Infect Dis 2014, 14:1123–1135.
48. Lehner A, Loy A, Behr T, Gaenge H, Ludwig W, Wagner M, Schleifer K-H:
Oligonucleotide microarray for identification of Enterococcus species.
FEMS Microbiol Lett 2005, 246:133–142.
49. Singh DV, Mohapatra H: Application of DNA-based methods in typing
Vibrio cholerae strains. Future Microbiol 2008, 3:87–96.
50. Narihiro T, Sekiguchi Y: Oligonucleotide primers, probes and molecular
methods for the environmental monitoring of methanogenic archaea.
Microb Biotechnol 2011, 4:585–602.
51. Sogin ML, Morrison HG, Huber JA, Mark Welch D, Huse SM, Neal PR,
Arrieta JM, Herndl GJ: Microbial diversity in the deep sea and the
underexplored “rare biosphere.” Proc Natl Acad Sci U S A 2006, 103:12115–
12120.
52. Lynch MDJ, Neufeld JD: Ecology and exploration of the rare biosphere.
Nat Rev Microbiol 2015, 13:217–229.
35
53. Certini G, Campbell CD, Edwards AC: Rock fragments in soil support a
different microbial community from the fine earth. Soil Biol Biochem 2004,
36:1119–1128.
54. Walker AW, Martin JC, Scott P, Parkhill J, Flint HJ, Scott KP: 16S rRNA
gene-based profiling of the human infant gut microbiota is strongly
influenced by sample processing and PCR primer choice. Microbiome
2015, 3:440.
55. Gorzelak MA, Gill SK, Tasnim N, Ahmadi-Vand Z, Jay M, Gibson DL:
Methods for Improving Human Gut Microbiome Data by Reducing
Variability through Sample Processing and Storage of Stool. 2015.
56. Reck M, Tomasch J, Deng Z, Jarek M, Husemann P, Wagner-Döbler I: Stool
metatranscriptomics: A technical guideline for mRNA stabilisation and
isolation. BMC Genomics 2015, 16:804.
57. Weiss S, Amir A, Hyde ER, Metcalf JL, Song SJ, Knight R: Tracking down
the sources of experimental contamination in microbiome studies.
Genome Biol 2014, 15:1704.
58. Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, Turner P,
Parkhill J, Loman NJ, Walker AW: Reagent and laboratory contamination can
critically impact sequence-based microbiome analyses. BMC Biol 2014,
12:1–12.
59. Lim YW, Haynes M, Furlan M, Robertson CE, Harris JK, Rohwer F:
Purifying the Impure: Sequencing Metagenomes and
Metatranscriptomes from Complex Animal-associated Samples. J Vis Exp
2014.
60. Gong J, Dong J, Liu X, Massana R: Extremely High Copy Numbers and
Polymorphisms of the rDNA Operon Estimated from Single Cell Analysis
of Oligotrich and Peritrich Ciliates. Protist 2013, 164:369–379.
61. Jones SE, Lennon JT: Dormancy contributes to the maintenance of
microbial diversity. Proc Natl Acad Sci U S A 2010, 107:5881–5886.
62. Campbell BJ, Yu L, Heidelberg JF, Kirchman DL: Activity of abundant and
rare bacteria in a coastal ocean. Proc Natl Acad Sci U S A 2011, 108:12776–
12781.
63. Zhang Y, Zhao Z, Dai M, Jiao N, Herndl GJ: Drivers shaping the diversity
and biogeography of total and active bacterial communities in the South
China Sea. Mol Ecol 2014, 23:2260–2274.
64. Cram JA, Chow C-ET, Sachdeva R, Needham DM, Parada AE, Steele JA,
Fuhrman JA: Seasonal and interannual variability of the marine
bacterioplankton community throughout the water column over ten
years. ISME J 2015, 9:563–580.
36
65. Schirmer M, Ijaz UZ, D’Amore R, Hall N, Sloan WT, Quince C: Insight into
biases and sequencing errors for amplicon sequencing with the Illumina
MiSeq platform. Nucleic Acids Res 2015, 43:e37.
66. Buchan A, LeCleir GR, Gulvik CA, González JM: Master recyclers: features
and functions of bacteria associated with phytoplankton blooms. Nat Rev
Microbiol 2014, 12:686–698.
67. Caron DA, Worden AZ, Countway PD, Demir E, Heidelberg KB: Protists are
microbes too: a perspective. ISME J 2009, 3:4–12.
68. Zuo G, Xu Z, Hao B: Shigella strains are not clones of Escherichia coli
but sister species in the genus Escherichia. Genomics Proteomics
Bioinformatics 2013, 11:61–65.
69. Lindahl BD, Nilsson RH, Tedersoo L, Abarenkov K, Carlsen T, Kjøller R,
Kõljalg U, Pennanen T, Rosendahl S, Stenlid J, Kauserud H: Fungal community
analysis by high-throughput sequencing of amplified markers--a user’s
guide. New Phytol 2013, 199:288–299.
70. Edgar RC: UPARSE: highly accurate OTU sequences from microbial
amplicon reads. Nat Methods 2013, 10:996–998.
71. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hatmann M, Hollister EB,
Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger
GG, Van Horn DJ, Weber CF: Introducing Mothur: Open-source, platform-
independent community- supported software for describing and
comparing microbial communities. Appl Environ Microbiol 2009, 75:7537–
7541.
72. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello
EK, Fierer N, Peña AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D,
Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J,
Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J,
Knight R: QIIME allows analysis of high-throughput community
sequencing data. Nat Methods 2010, 7:335–336.
73. Edgar RC: Search and clustering orders of magnitude faster than
BLAST. Bioinformatics 2010, 26:2460–2461.
74. He Y, Caporaso JG, Jiang X-T, Sheng H-F, Huse SM, Rideout JR, Edgar RC,
Kopylova E, Walters WA, Knight R, Zhou H-W: Stability of operational
taxonomic units: an important but neglected property for analyzing
microbial diversity. Microbiome 2015, 3:20.
75. Schmidt TSB, Matias Rodrigues JF, von Mering C: Limits to robustness
and reproducibility in the demarcation of operational taxonomic units.
Environ Microbiol 2015, 17:1689–1706.
37
76. Gevers D, Cohan FM, Lawrence JG, Spratt BG, Coenye T, Feil EJ,
Stackebrandt E, Van de Peer Y, Vandamme P, Thompson FL, Swings J: Re-
evaluating prokaryotic species. Nat Rev Microbiol 2005, 3:733–739.
77. Schloss PD: The effects of alignment quality, distance calculation
method, sequence filtering, and region on the analysis of 16S rRNA
gene-based studies. PLoS Comput Biol 2010, 6:e1000844.
78. Fox GE, Wisotzkey JD, Jurtshuk P Jr: How close is close: 16S rRNA
sequence identity may not be sufficient to guarantee species identity. Int
J Syst Bacteriol 1992, 42:166–170.
79. Not F, del Campo J, Balagué V, de Vargas C, Massana R: New insights into
the diversity of marine picoeukaryotes. PLoS One 2009, 4:e7143.
80. Stoeck T, Bass D, Nebel M, Christen R, Jones, Richards BHWA, TA: Multiple
marker parallel tag environmental DNA sequencing reveals a highly
complex eukaryotic community in marine anoxic water. Mol Ecol 2010,
19(Sup. 1):21–31.
81. Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P: Toward
automatic reconstruction of a highly resolved tree of life. Science 2006,
311:1283–1287.
82. Koeppel AF, Wu M: Surprisingly extensive mixed phylogenetic and
ecological signals among bacterial Operational Taxonomic Units. Nucleic
Acids Res 2013, 41:5175–5188.
83. Schmidt TSB, Matias Rodrigues JF, von Mering C: Ecological consistency
of SSU rRNA-based operational taxonomic units at a global scale. PLoS
Comput Biol 2014, 10:e1003594.
84. Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP:
DADA2: High resolution sample inference from amplicon data. bioRxiv
2015:024034.
85. Tikhonov M, Leach RW, Wingreen NS: Interpreting 16S metagenomic
data without clustering to achieve sub-OTU resolution. ISME J 2015, 9:68–
80.
86. Murat Eren A, Morrison HG, Lescault PJ, Reveillaud J, Vineis JH, Sogin ML:
Minimum entropy decomposition: Unsupervised oligotyping for sensitive
partitioning of high-throughput marker gene sequences. ISME J 2014,
9:968–979.
87. Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M: Swarm v2: highly-
scalable and high-resolution amplicon clustering. PeerJ 2015, 3:e1420.
88. Wang Q, Garrity GM, Tiedje JM, Cole JR: Naive Bayesian classifier for
rapid assignment of rRNA sequences into the new bacterial taxonomy.
Appl Environ Microbiol 2007, 73:5261–5267.
38
89. Pruesse E, Peplies J, Glöckner FO: SINA: accurate high-throughput
multiple sequence alignment of ribosomal RNA genes. Bioinformatics
2012, 28:1823–1829.
90. Guillou L, Bachar D, Audic S, Bass D, Berney C, Bittner L, Boutte C, Burgaud
G, de Vargas C, Decelle J, Del Campo J, Dolan JR, Dunthorn M, Edvardsen B,
Holzmann M, Kooistra WHCF, Lara E, Le Bescot N, Logares R, Mahé F, Massana
R, Montresor M, Morard R, Not F, Pawlowski J, Probert I, Sauvadet A-L, Siano R,
Stoeck T, Vaulot D, et al.: The Protist Ribosomal Reference database (PR2):
a catalog of unicellular eukaryote small sub-unit rRNA sequences with
curated taxonomy. Nucleic Acids Res 2013, 41(Database issue):D597–604.
91. Hu YOO, Karlson B, Charvet S, Andersson AF: Diversity of Pico- to
Mesoplankton Along the 2000 km Salinity Gradient of the Baltic Sea.
bioRxiv 2015: 035485
92. Matsen FA, Kodner RB, Armbrust EV: pplacer: linear time maximum-
likelihood and Bayesian phylogenetic placement of sequences onto a
fixed reference tree. BMC Bioinformatics 2010, 11:538.
93. Kuczynski J, Liu Z, Lozupone C, McDonald D, Fierer N, Knight R: Microbial
community resemblance methods differ in their ability to detect
biologically relevant patterns. Nat Methods 2010, 7:813–819.
94. Faust K, Raes J: Microbial interactions: from networks to models. Nat
Rev Microbiol 2012, 10:538–550.
95. Martiny JBH, Jones SE, Lennon JT, Martiny AC: Microbiomes in light of
traits: A phylogenetic perspective. Science 2015, 350:aac9323.
96. Lozupone CA, Hamady M, Kelley ST, Knight R: Quantitative and
qualitative beta diversity measures lead to different insights into factors
that structure microbial communities. Appl Environ Microbiol 2007,
73:1576–1585.
97. Chen J, Bittinger K, Charlson ES, Hoffmann C, Lewis J, Wu GD, Collman RG,
Bushman FD, Li H: Associating microbiome composition with
environmental covariates using generalized UniFrac distances.
Bioinformatics 2012, 28:2106–2113.
98. Paliy O, Shankar V: Application of multivariate statistical techniques in
microbial ecology. Mol Ecol 2016.
99. van den Brink PJ, den Besten PJ, bij de Vaate A, ter Braak CJF: Principal
response curves technique for the analysis of multivariate biomonitoring
time series. Environ Monit Assess 2009, 152:271–281.
100. Lima-Mendez G, Faust K, Henry N, Decelle J, Colin S, Carcillo F, Chaffron S,
Ignacio-Espinosa JC, Roux S, Vincent F, Bittner L, Darzi Y, Wang J, Audic S,
Berline L, Bontempi G, Cabello AM, Coppola L, Cornejo-Castillo FM, d’Ovidio F,
39
De Meester L, Ferrera I, Garet-Delmas M-J, Guidi L, Lara E, Pesant S, Royo-
Llonch M, Salazar G, Sánchez P, Sebastian M, et al.: Ocean plankton.
Determinants of community structure in the global plankton
interactome. Science 2015, 348:1262073.
101. Chaffron S, Rehrauer H, Pernthaler J, von Mering C: A global network of
coexisting microbes from environmental and whole-genome sequence
data. Genome Res 2010, 20:947–959.
102. Freilich S, Kreimer A, Meilijson I, Gophna U, Sharan R, Ruppin E: The
large-scale organization of the bacterial network of ecological co-
occurrence interactions. Nucleic Acids Res 2010, 38:3857–3868.
103. Shade A, Peter H, Allison SD, Baho D, Berga M, Buergmann H, Huber DH,
Langenheder S, Lennon JT, Martiny JBH, Matulich KL, Schmidt TM, Handelsman
J: Fundamentals of Microbial Community Resistance and Resilience.
Front Microbiol 2012, 3.
104. Ruan Q, Dutta D, Schwalbach MS, Steele JA, Fuhrman JA, Sun F: Local
similarity analysis reveals unique associations among marine
bacterioplankton species and environmental factors. Bioinformatics 2006,
22:2532–2538.
105. Steele JA, Countway PD, Xia L, Vigil PD, Beman JM, Kim DY, Chow C-ET,
Sachdeva R, Jones AC, Schwalbach MS, Rose JM, Hewson I, Patel A, Sun F,
Caron DA, Fuhrman JA: Marine bacterial, archaeal and protistan
association networks reveal ecological linkages. ISME J 2011, 5:1414–
1425.
106. Fuhrman JA, Hewson I, Schwalbach MS, Steele JA, Brown MV, Naeem S:
Annually reoccurring bacterial communities are predictable from ocean
conditions. Proc Natl Acad Sci U S A 2006, 103:13104–13109.
107. David LA, Materna AC, Friedman J, Campos-Baptista MI, Blackburn MC,
Perrotta A, Erdman SE, Alm EJ: Host lifestyle affects human microbiota on
daily timescales. Genome Biol 2014, 15:R89.
108. Kara EL, Hanson PC, Hu YH, Winslow L, McMahon KD: A decade of
seasonal dynamics and co-occurrences within freshwater
bacterioplankton communities from eutrophic Lake Mendota, WI, USA.
ISME J 2013, 7:680–684.
109. Needham DM, Chow C-ET, Cram JA, Sachdeva R, Parada A, Fuhrman JA:
Short-term observations of marine bacterial and viral communities:
patterns, connections and resilience. ISME J 2013, 7:1274–1285.
110. Vergin KL, Done B, Carlson CA, Giovannoni SJ: Spatiotemporal
distributions of rare bacterioplankton populations indicate adaptive
strategies in the oligotrophic ocean. Aquat Microb Ecol 2013, 71:1–13.
40
111. Lindh MV, Sjöstedt J, Andersson AF, Baltar F, Hugerth LW, Lundin D,
Muthusamy S, Legrand C, Pinhassi J: Disentangling seasonal
bacterioplankton population dynamics by high frequency sampling.
Environ Microbiol 2015:2459–2476.
112. Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM: Molecular
biological access to the chemistry of unknown soil microbes: a new
frontier for natural products. Chem Biol 1998, 5:R245–9.
113. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM,
Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF: Community structure and
metabolism through reconstruction of microbial genomes from the
environment. Nature 2004, 428:37–43.
114. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu
D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW,
Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H,
Pfannkoch C, Rogers Y, Smith HO: Environmental Genome Shotgun
Sequencing of the Sargasso Sea. Science 2004, 304:66–74.
115. Kopylova E, Noé L, Touzet H: SortMeRNA: fast and accurate filtering of
ribosomal RNAs in metatranscriptomic data. Bioinformatics 2012, 28:3211–
3217.
116. Ekblom R, Wolf JBW: A field guide to whole-genome sequencing,
assembly and annotation. Evol Appl 2014, 7:1026–1042.
117. Luo C, Tsementzi D, Kyrpides NC, Konstantinidis KT: Individual genome
assembly from complex community short-read metagenomic datasets.
ISME J 2012, 6:898–901.
118. Luo C, Tsementzi D, Kyrpides N, Read T, Konstantinidis KT: Direct
comparisons of Illumina vs. Roche 454 sequencing technologies on the
same microbial community DNA sample. PLoS One 2012, 7:e30087.
119. Deng X, Naccache SN, Ng T, Federman S, Li L, Chiu CY, Delwart EL: An
ensemble strategy that significantly improves de novo assembly of
microbial genomes from metagenomic next-generation sequencing data.
Nucleic Acids Res 2015, 43:e46.
120. Hugerth L, Larsson J, Alneberg J, Lindh M, Legrand C, Pinhassi J,
Andersson A: Metagenome-assembled genomes uncover a global brackish
microbiome. Genome Biology 2015:279.
121. Peng Y, Leung HCM, Yiu SM, Chin FYL: IDBA-UD: a de novo assembler
for single-cell and metagenomic sequencing data with highly uneven
depth. Bioinformatics 2012, 28:1420–1428.
41
122. Li D, Liu C-M, Luo R, Sadakane K, Lam T-W: MEGAHIT: an ultra-fast
single-node solution for large and complex metagenomics assembly via
succinct de Bruijn graph. Bioinformatics 2015, 31:1674–1676.
123. Seemann T: Prokka: rapid prokaryotic genome annotation.
Bioinformatics 2014, 30:2068–2069.
124. Segata N, Börnigen D, Morgan XC, Huttenhower C: PhyloPhlAn is a new
method for improved phylogenetic and taxonomic placement of
microbes. Nat Commun 2013, 4:2304.
125. Darling AE, Jospin G, Lowe E, Matsen FA 4th, Bik HM, Eisen JA: PhyloSift:
phylogenetic analysis of genomes and metagenomes. PeerJ 2014, 2:e243.
126. Peabody MA, Van Rossum T, Lo R, Brinkman FSL: Evaluation of shotgun
metagenomics sequence classification methods using in silico and in
vitro simulated communities. BMC Bioinformatics 2015, 16:363.
127. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S,
Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, Baden-
Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li
K, Kravitz S, Heidelberg JF, Utterback T, Rogers Y-H, Falcón LI, Souza V, Bonilla-
Rosso G, Eguiarte LE, Karl DM, Sathyendranath S, et al.: The Sorcerer II
Global Ocean Sampling expedition: northwest Atlantic through eastern
tropical Pacific. PLoS Biol 2007, 5:DOI: 10.1371/journal.pbio.0050077.
128. Yooseph S, Nealson KH, Rusch DB, McCrow JP, Dupont CL, Kim M, Johnson
J, Montgomery R, Ferriera S, Beeson K, Williamson SJ, Tovchigrechko A, Allen
AE, Zeigler LA, Sutton G, Eisenstadt E, Rogers Y, Friedman R, Frazier M, Venter
JC: Genomic and functional adaptation in surface ocean planktonic
prokaryotes. Nature 2010, 468:60–66.
129. Dupont CL, Larsson J, Yooseph S, Ininbergs K, Goll J, Asplund-Samuelsson J,
McCrow JP, Celepli N, Allen LZ, Ekman M, Lucas AJ, Hagström Å, Thiagarajan
M, Brindefalk B, Richter AR, Andersson AF, Tenney A, Lundin D, Tovchigrechko
A, Nylander J, Brami D, Badger JH, Allen AE, Rusch DB, Hoffman J, Norrby E,
Friedman R, Pinhassi J, Venter JC, Bergman B: Functional Tradeoffs Underpin
Salinity-Driven Divergence in Microbial Community Composition. PLoS
One 2014:DOI: 10.1371/journal.pone.0089549.
130. Larsson J, Celepli N, Ininbergs K, Dupont CL, Yooseph S, Bergman B,
Ekman M: Picocyanobacteria containing a novel pigment gene cluster
dominate the brackish water Baltic Sea. ISMEJ 2014, 8:1892–1903.
131. Brown MV, Lauro FM, DeMaere MZ, Muir L, Wilkins D, Thomas T, Riddle
MJ, Fuhrman JA, Andrews-Pfannkoch C, Hoffman JM, McQuaid JB, Allen A,
Rintoul SR, Cavicchioli R: Global biogeography of SAR11 marine bacteria.
Mol Syst Biol 2012, 8:595.
42
132. Abe T, Sugawara H, Kinouchi M, Kanaya S, Ikemura T: Novel
phylogenetic studies of genomic sequence fragments derived from
uncultured microbe mixtures in environmental and clinical samples.
DNA Res 2005, 12:281–290.
133. Chatterji S, Yamazaki I, Bai Z, Eisen JA: CompostBin: A DNA
Composition-Based Algorithm for Binning Environmental Shotgun
Reads. In Research in Computational Molecular Biology. Springer Berlin
Heidelberg; 2008:17–28. [Lecture Notes in Computer Science]
134. Dick GJ, Andersson AF, Baker BJ, Simmons SL, Thomas BC, Yelton AP,
Banfield JF: Community-wide analysis of microbial genome sequence
signatures. Genome Biol 2009, 10:R85.
135. Sharon I, Morowitz MJ, Thomas BC, Costello EK, Relman DA, Banfield JF:
Time series community genomics analysis reveals rapid shifts in
bacterial species, strains, and phage during infant gut colonization.
Genome Res 2012, 23:111–120.
136. Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW, Nielsen
PH: Genome sequences of rare, uncultured bacteria obtained by
differential coverage binning of multiple metagenomes. Nat Biotechnol
2013, 31:533–538.
137. Brown CT, Hug LA, Thomas BC, Sharon I, Castelle CJ, Singh A, Wilkins MJ,
Wrighton KC, Williams KH, Banfield JF: Unusual biology across a group
comprising more than 15% of domain Bacteria. Nature 2015, 523:208–211.
138. Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti L,
Loman NJ, Andersson AF, Quince C: Binning metagenomic contigs by
coverage and composition. Nat Methods 2014, 11:1144–1146.
139. Imelfort M, Parks D, Woodcroft BJ, Dennis P, Hugenholtz P, Tyson GW:
GroopM: an automated tool for the recovery of population genomes from
related metagenomes. PeerJ 2014:e603.
140. Nielsen HB, Almeida M, Juncker AS, Rasmussen S, Li J, Sunagawa S,
Plichta DR, Gautier L, Pedersen AG, Le Chatelier E, Pelletier E, Bonde I, Nielsen
T, Manichanh C, Arumugam M, Batto J, dos Santos M, Blom N, Borruel N,
Burgdorf KS, Boumezbeur F, Casellas F, Doré J, Dworzynski P, Guarner F,
Hansen T, Hildebrand F, Kaas RS, Kennedy S, Kristiansen K, et al.:
Identification and assembly of genomes and genetic elements in complex
metagenomic samples without using reference genomes. Nat Biotechnol
2014, 32:822–828.
141. Cleary B, Brito IL, Huang K, Gevers D, Shea T, Young S, Alm EJ: Detection
of low-abundance bacterial strains in metagenomic datasets by
eigengenome partitioning. Nat Biotechnol 2015, 33:1053–1060.
43
142. Spang A, Saw JH, Jørgensen SL, Zaremba-Niedzwiedzka K, Martijn J, Lind
AE, van Eijk R, Schleper C, Guy L, Ettema TJG: Complex archaea that bridge
the gap between prokaryotes and eukaryotes. Nature 2015, 521:173–179.
143. Ghai R, Mizuno CM, Picazo A, Camacho A, Rodriguez-Valera F: Key roles
for freshwater Actinobacteria revealed by deep metagenomic
sequencing. Mol Ecol 2014, 23:6073–6090.
144. Mizuno CM, Rodriguez-Valera F, Ghai R: Genomes of planktonic
Acidimicrobiales: widening horizons for marine Actinobacteria by
metagenomics. MBio 2015, 6.
145. Luef B, Frischkorn KR, Wrighton KC, Holman H-YN, Birarda G, Thomas BC,
Singh A, Williams KH, Siegerist CE, Tringe SG, Downing KH, Comolli LR,
Banfield JF: Diverse uncultivated ultra-small bacterial cells in
groundwater. Nat Commun 2015, 6:6372.
146. Hug LA, Baker BJ, Anantharaman K, Brown CT, Probst AJ, Castelle CJ,
Butterfield CN, Hernsdorf AW, Amano Y, Ise K, Suzuki Y, Dudek N, Relman DA,
Finstad KM, Amundson R, Thomas BC, Banfield JF: A new view of the tree of
life. Nature Microbiology 2016:16048.
147. Logares R, Bråte J, Bertilsson S, Clasen JL, Shalchian-Tabrizi K, Rengefors
K: Infrequent marine–freshwater transitions in the microbial world.
Trends Microbiol 2009, 17:414–422.
148. Herlemann DPR, Labrenz M, Jurgens K, Bertilsson S, Waniek JJ, Andersson
AF: Transitions in bacterial communities along the 2000 km salinity
gradient of the Baltic Sea. ISME J 2011, 5:1571–1579.
149. Mehrshad M, Amoozegar MA, Ghai R, Shahzadeh Fazeli SA, Rodriguez-
Valera F: Genome reconstruction from metagenomic datasets reveals
novel microbes in the brackish waters of the Caspian Sea. Appl Environ
Microbiol 2016.
44
x. Acknowledgements
First of all, my most heartfelt thanks to Anders, who has been a
wonderful supervisor ever since my master’s. I really wasn’t planning on
staying more than 4 months under your wings, but 5 years later, here I
am. I stayed on because of you. You never lost your temper or
discouraged me even when I was losing patience with myself for all
those silly mistakes. I definitely had no interest in the Baltic Sea before
starting our projects together. Sometimes I claim I still don’t, but that’s
probably a lie.
A large round of thank-yous to everyone that has passed through
the Environmental Genomics lab these past 5 years. To Daniel who was
around to answer my first silly questions about Perl and is still around
for equally silly questions about phylogeny placement. To Ino, who was
so much fun to work with and gave whole new levels of meaning to the
word “Riiiiiiight”. To Johannes, who has come to my rescue so many
times with coke and code. To John, who rescued the Baltic Genomes
when we needed him, and ruined Hermann’s for me. To Yue, who made
sure I wasn’t all alone in the lab bench anymore, and who’s been a great
travel companion. To Hugo, who has been an endless source of
collaboration and friendship. To Nelson, who’s always repaid the
simplest of favours with limitless kindness. To Conny, for never
forgetting that protists are microbes too! And to Jürg, Olov and Kajsa,
for good advice, fika and the odd bit of office gossip.
Another important round of thank-yous goes to the Kalmar gang.
Jarone, my reluctant co-supervisor from whom I learn so much each
time we meet. To Markus for the dedicated work in the LMO time-
series and for your endless emails pushing my R-skills a bit further, and
for the tweets that always make laugh. To Carina for taking over the
time-series and always keeping me in the loop, and also for the great
times we’ve had at conferences. You’ve also convinced certain people to
attend SAME in Uppsala, and for this I’ll always be deeply grateful. And
to Åke, who claims to be retired but is always around to discuss science
in his soothing tone and make everyone see the bigger picture.
For the friends and colleagues in Alfa 3, Linda, Elin, Amanda,
Pelin, Lumi, Jimmie, Guille, Phil, Erik, Mickan, Fredrik, Sanja,
45
Anders, Britta, Stefania, Kostas, Kim, Maja, Annelie, Nemo, Ema,
Simon, Carlos, Francesco, Mau, you are all too many to mention, but
you’ve been great company and a pleasure to share a lab with. Extra
special thanks to Kicki who not only keeps us all on our heels to make
sure the lab works, also found time, time and again, to train me in new
protocols when I needed it. Thank you, Peter, for the last minute help!
And thanks to Valle, who could always be drawn to the lunch room with
a simple “I need to talk math” or “I need to whine”, and is still around
via email to do the same.
For the Alfa 6 crowd, Ani, Walter, Oxana, Per, Axel, Özge.
Thanks for the cake and the music, the political clashes and the history
discussions.
Arne, Erik Lindhal, Afshin, Lars Arvestad, Thijs Ettema: you
have all at some point said something which ended up in my
words_to_never_forget.txt. The fact that you yourselves probably don’t
even know what it was just makes it all the more meaningful.
Tere, Hari, Jojo, Mille, Sus, Johanna: you took the long haul.
Thanks for the tjejfika, the MF parties, the birthdays, the house-
warmings, the football, the movie nights and all those unicorns! I'm
moving away, but don't think I'm leaving you.
To PromenadorQuestern och med Baletten Paletten for
being a constant reminder that happiness is an active choice. JAAAAA!
Tagga! Ni är alla så fina och jag har inte utryme för så många namn,
men kramar kan alla få!
And finally, thanks to my parents, my sister, my grandparents, my
priminhos and my migas, who’ve kept me company from a distance and
made sure I always knew I came from somewhere and had somewhere
to return to.
46
top related