Top Banner
DNA based Studies of Microbial Diversity Jonathan A. Eisen University of California, Davis 1 DNA based Studies of Microbial Diversity Jonathan A. Eisen University of California, Davis Monday, January 28, 13
166

Eisen Lecture for Ian Korf genomics course

Jan 29, 2018

Download

Documents

Jonathan Eisen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Eisen Lecture for Ian Korf genomics course

DNA based Studies of Microbial Diversity

Jonathan A. Eisen

University of California, Davis

1

DNA based Studies of Microbial Diversity

Jonathan A. Eisen

University of California, Davis

Monday, January 28, 13

Page 2: Eisen Lecture for Ian Korf genomics course

Sequencing and Microbes

• Four major “ERAs” in use of sequencing for microbial diversity studies

• Each area represented by the Eras is being revolutionized by new sequencing methods

2Monday, January 28, 13

Page 3: Eisen Lecture for Ian Korf genomics course

Era I: rRNA Tree of Life

3

Era I: rRNA Tree of Life

Monday, January 28, 13

Page 4: Eisen Lecture for Ian Korf genomics course

4

Ernst Haeckel 1866

www.mblwhoilibrary.org

PlantaeProtistaAnimalia

Monday, January 28, 13

Page 6: Eisen Lecture for Ian Korf genomics course

Woese

6Monday, January 28, 13

Page 7: Eisen Lecture for Ian Korf genomics course

Woese

Monday, January 28, 13

Page 8: Eisen Lecture for Ian Korf genomics course

Woese

Monday, January 28, 13

Page 9: Eisen Lecture for Ian Korf genomics course

Woese

Monday, January 28, 13

Page 10: Eisen Lecture for Ian Korf genomics course

Woese and Fox

• Abstract: A phylogenetic analysis based upon ribosomal RNA sequence characterization reveals that living systems represent one of three aboriginal lines of descent: (i) the eubacteria, comprising all typical bacteria; (ii) the archaebacteria, containing methanogenic bacteria; and (iii) the urkaryotes, now represented in the cytoplasmic component of eukaryotic cells.

Monday, January 28, 13

Page 11: Eisen Lecture for Ian Korf genomics course

Woese and Fox

• Propose “three aboriginal lines of descent” Eubacteria Archaebacteria Urkaryotes

Monday, January 28, 13

Page 12: Eisen Lecture for Ian Korf genomics course

Woese 1987 - rRNA

Microbiological Reviews 51:22112

Woese

Monday, January 28, 13

Page 13: Eisen Lecture for Ian Korf genomics course

• Appearance of microbes not informative (enough)

• rRNA Tree of Life identified two major groups of organisms w/o nuclei

• rRNA powerful for many reasons, though not perfect

13

Barton, Eisen et al. “Evolution”, CSHL Press. 2007.

Based on tree from Pace 1997 Science 276:734-740

Monday, January 28, 13

Page 14: Eisen Lecture for Ian Korf genomics course

Tree of Life

• Three main kinds of organisms Bacteria Archaea Eukaryotes

• Viruses not alive, but some call them microbes

• Many misclassifications occurred before the use of molecular methods

14Monday, January 28, 13

Page 15: Eisen Lecture for Ian Korf genomics course

The Tree of Life2006

adapted from Baldauf, et al., in Assembling the Tree of Life, 2004 15Monday, January 28, 13

Page 16: Eisen Lecture for Ian Korf genomics course

The Tree of Life2006

adapted from Baldauf, et al., in Assembling the Tree of Life, 2004

Monday, January 28, 13

Page 17: Eisen Lecture for Ian Korf genomics course

Era II: rRNA in the Environment

17

Era II: rRNA in the Environment

Monday, January 28, 13

Page 18: Eisen Lecture for Ian Korf genomics course

Plant/Animal Field Studies

18Monday, January 28, 13

Page 19: Eisen Lecture for Ian Korf genomics course

Plant/Animal Field Studies

18Monday, January 28, 13

Page 20: Eisen Lecture for Ian Korf genomics course

Plant/Animal Field Studies

18Monday, January 28, 13

Page 21: Eisen Lecture for Ian Korf genomics course

Plant/Animal Field Studies

18Monday, January 28, 13

Page 22: Eisen Lecture for Ian Korf genomics course

Plant/Animal Field Studies

18Monday, January 28, 13

Page 23: Eisen Lecture for Ian Korf genomics course

Plant/Animal Field Studies

18Monday, January 28, 13

Page 24: Eisen Lecture for Ian Korf genomics course

Plant/Animal Field Studies

18Monday, January 28, 13

Page 25: Eisen Lecture for Ian Korf genomics course

Microbial Field Studies

19Monday, January 28, 13

Page 26: Eisen Lecture for Ian Korf genomics course

Microbial Field Studies

19Monday, January 28, 13

Page 27: Eisen Lecture for Ian Korf genomics course

Microbial Field Studies

19Monday, January 28, 13

Page 28: Eisen Lecture for Ian Korf genomics course

Microbial Field Studies

19Monday, January 28, 13

Page 29: Eisen Lecture for Ian Korf genomics course

Microbial Field Studies

19Monday, January 28, 13

Page 30: Eisen Lecture for Ian Korf genomics course

Microbial Field Studies

19Monday, January 28, 13

Page 31: Eisen Lecture for Ian Korf genomics course

Microbial Field Studies

19Monday, January 28, 13

Page 32: Eisen Lecture for Ian Korf genomics course

Culturing Microbes

20Monday, January 28, 13

Page 33: Eisen Lecture for Ian Korf genomics course

Great Plate Count Anomaly

21Monday, January 28, 13

Page 34: Eisen Lecture for Ian Korf genomics course

Culturing Microscopy

Great Plate Count Anomaly

22Monday, January 28, 13

Page 35: Eisen Lecture for Ian Korf genomics course

Culturing Microscopy

CountCount

Great Plate Count Anomaly

23Monday, January 28, 13

Page 36: Eisen Lecture for Ian Korf genomics course

<<<<

Great Plate Count Anomaly

24

Culturing Microscopy

CountCountMonday, January 28, 13

Page 37: Eisen Lecture for Ian Korf genomics course

Great Plate Count Anomaly

25

Problem because appearance not

effective for “who is out there?” or “what are they

doing?”

<<<<

Culturing Microscopy

CountCountMonday, January 28, 13

Page 38: Eisen Lecture for Ian Korf genomics course

Great Plate Count Anomaly

26

Problem because appearance not

effective for “who is out there?” or “what are they

doing?”

<<<<

Culturing Microscopy

CountCount

Solution?

Monday, January 28, 13

Page 39: Eisen Lecture for Ian Korf genomics course

Great Plate Count Anomaly

27

Problem because appearance not

effective for “who is out there?” or “what are they

doing?”

<<<<

Culturing Microscopy

CountCount

Solution?

DNA

Monday, January 28, 13

Page 40: Eisen Lecture for Ian Korf genomics course

Collect from environment

Analysis of uncultured microbes

28Monday, January 28, 13

Page 41: Eisen Lecture for Ian Korf genomics course

DNA extraction

PCR SequencerRNA genes

Sequence alignment = Data matrixPhylogenetic tree

PCR

rRNA1

Yeast

Makes lots of copies of the rRNA genes in sample

E. coli

Humans

A

T

T

A

G

A

A

C

A

T

C

A

C

A

A

C

A

G

G

A

G

T

T

CrRNA1

E. coli Humans

Yeast

29

rRNA1 5’

...TACAGTATAGGTGGAGCTAGCGATC

GATCGA... 3’

PCR and phylogenetic analysis of rRNA genes

Monday, January 28, 13

Page 42: Eisen Lecture for Ian Korf genomics course

DNA extraction

PCR SequencerRNA genes

Sequence alignment = Data matrixPhylogenetic tree

PCR

rRNA1

rRNA2

Makes lots of copies of the rRNA genes in sample

rRNA1 5’

...ACACACATAGGTGGAGCTAGCGATC

GATCGA... 3’

E. coli

Humans

A

T

T

A

G

A

A

C

A

T

C

A

C

A

A

C

A

G

G

A

G

T

T

CrRNA1

E. coli Humans

rRNA2

30

rRNA2 5’

...TACAGTATAGGTGGAGCTAGCGATC

GATCGA... 3’

PCR and phylogenetic analysis of rRNA genes

Yeast T A C A G TYeast

Monday, January 28, 13

Page 43: Eisen Lecture for Ian Korf genomics course

DNA extraction

PCR SequencerRNA genes

Sequence alignment = Data matrixPhylogenetic tree

PCR

rRNA1

rRNA2

Makes lots of copies of the rRNA genes in sample

rRNA1 5’...ACACACATAGGTGGAGC

TAGCGATCGATCGA... 3’

E. coli

Humans

A

T

T

A

G

A

A

C

A

T

C

A

C

A

A

C

A

G

G

A

G

T

T

CrRNA1

E. coli Humans

rRNA2

31

rRNA2 5’..TACAGTATAGGTGGAGCT

AGCGACGATCGA... 3’

PCR and phylogenetic analysis of rRNA genes

rRNA3 5’...ACGGCAAAATAGGTGGA

TTCTAGCGATATAGA... 3’

rRNA4 5’...ACGGCCCGATAGGTGGATTCTAGCGCCATAGA... 3’

rRNA3 C A C T G T

rRNA4 C A C A G T

Yeast T A C A G T

Yeast

rRNA3 rRNA4

Monday, January 28, 13

Page 44: Eisen Lecture for Ian Korf genomics course

PCR

32

PCR and phylogenetic analysis of rRNA genes

Monday, January 28, 13

Page 45: Eisen Lecture for Ian Korf genomics course

Major phyla of bacteria & archaea (as of 2002)

No cultures

Some cultures33

Monday, January 28, 13

Page 46: Eisen Lecture for Ian Korf genomics course

The Hidden Majority Richness estimates

Bohannan and Hughes 2003Hugenholtz 2002

34Monday, January 28, 13

Page 47: Eisen Lecture for Ian Korf genomics course

Human microbiome example

35Monday, January 28, 13

Page 48: Eisen Lecture for Ian Korf genomics course

Censored

Censored

A: Human biogeography

36Monday, January 28, 13

Page 49: Eisen Lecture for Ian Korf genomics course

A: Human biogeography

37Monday, January 28, 13

Page 50: Eisen Lecture for Ian Korf genomics course

Fig. S13

Glanspenis

Hair

Labiaminora

Acinetobacter Actinomycetales Actinomycineae Alistipes Anaerococcus Bacteroidales

Bacteroides Bifidobacteriales Branhamella Campylobacter Capnocytophaga Carnobacteriaceae1

Carnobacteriaceae2 Clostridiales Coriobacterineae Corynebacterineae Faecalibacterium Finegoldia

Fusobacterium Gemella Lachnospiraceae Lachnospiraceae (inc. sed.) Lactobacillus Leptotrichia

Micrococcineae Neisseria Oribacterium Parabacteroides Pasteurella Pasteurellaceae

Peptoniphilus Prevotella Prevotellaceae Propionibacterineae Ruminococcaceae Staphylococcus

Streptococcus Veillonella Other

Axilla (L)

Ext. auditorycanal (L)

Volarforearm (L)

Palmar indexfinger (L)

Poplitealfossa (L)

Naris (L)

Plantarfoot (L)

Oral cavity

Umbilicus

External nose

Lat. pinna (L)

Palm (L)

Gut

Plantarfoot (R)

Forehead

Dorsal tongue

Lat. pinna (R)

Palm (R)

Axilla (R)

Ext. auditorycanal (R)

Volarforearm (R)

Palmar indexfinger (R)

Poplitealfossa (R)

Naris (R)

A: Human biogeography

38Monday, January 28, 13

Page 51: Eisen Lecture for Ian Korf genomics course

Vertebrate Microbiomes

Diverse microorganisms and microbial communities are a feature of modern life on the Earth, and have probably been necessary for the evolution of life as we know it1.

Microorganisms formed spatially organized communi-ties as early as 3.25 billion years ago, when some left their mark in the fossil record2. Today, microbial life is found in diverse communities all over the biosphere. The high level of novelty that is necessary for microorganisms to develop a diversity of cell lineages and inhabit a vast range of habitats probably required that whole com-munities exchange innovations1. Comparative studies of microbial communities are starting to reveal which environmental features, such as biogeography, salinity or redox potential, have important effects on the organiza-tion of microbial diversity3–6. These types of analyses are now being extended to the microbial communities that populate a globally ubiquitous but ephemeral habitat: the body surfaces of animals, including those of humans.

Multicellular eukaryotes have existed for at least one-quarter of the Earth’s history, or 1.2 billion years7. Thus, an already long history of interaction between multicel-lular life-forms and microbial communities preceded, and probably shaped, the evolution of vertebrates. The legacy of ancient associations between hosts and their epibiotic microbial communities is evident in the present-day effects that the gut microbiota exerts on host biology, which range from the structure and functions of the gut and the innate and adaptive immune systems, to

host energy metabolism8–11. Host responses to microbial colonization are evolutionarily conserved among diverse vertebrates, including zebrafish, mice and humans12. The underlying factors that dictate our interactions with our microbial partners therefore provide some of the foundations of our Homo sapiens genome.

If microbial communities are, and have always been, so intricately associated with their vertebrate hosts, then how specialized are body-associated microbial lineages to vertebrates and how distinct are they from those that populate the non-living environments of the biosphere? In this Analysis, we place our human gut microbiota in the context of many other diverse microbiotas, from our close relatives the primates, to more distantly related mammals, other metazoans and ‘free-living’ microbial communities. This evolutionary ecology perspective helps put the recently initiated international Human Microbiome Project (see Further information)13 in the context of the biosphere within which humans and their microorganisms have evolved.

Diet and the evolution of modern humansFood is central to the evolution of H. sapiens. During the first half of the evolution of our lineage, Australopithecus species split from prehistoric apes and persisted from ~4.4 Mya (million years ago) until ~2.5 Mya14. This early split has been associated with a dietary shift to seeds and soft fruits, based on comparisons of australopithecine

*Center for Genome Sciences, Washington University School of Medicine, St Louis, Missouri 63108, USA. ‡Department of Microbiology, Cornell University, Ithaca, New York 14850, USA. §Department of Chemistry and Biochemistry, University of Colorado, Boulder, Colorado 80309, USA. ||Department of Computer Science, University of Colorado, Boulder, Colorado 80309, USA. ¶These authors contributed equally to this work. Correspondence to J.I.G. e-mail: [email protected]

MicrobiotaThe complete set of microbial lineages that live in a particular environment.

Worlds within worlds: evolution of the vertebrate gut microbiotaRuth E. Ley*‡¶, Catherine A. Lozupone*§¶, Micah Hamady||, Rob Knight§ and Jeffrey I. Gordon*

Abstract | In this Analysis we use published 16S ribosomal RNA gene sequences to compare the bacterial assemblages that are associated with humans and other mammals, metazoa and free-living microbial communities that span a range of environments. The composition of the vertebrate gut microbiota is influenced by diet, host morphology and phylogeny, and in this respect the human gut bacterial community is typical of an omnivorous primate. However, the vertebrate gut microbiota is different from free-living communities that are not associated with animal body habitats. We propose that the recently initiated international Human Microbiome Project should strive to include a broad representation of humans, as well as other mammalian and environmental samples, as comparative analyses of microbiotas and their microbiomes are a powerful way to explore the evolutionary history of the biosphere.

776 | OCTOBER 2008 | VOLUME 6 www.nature.com/reviews/micro

ANALYSIS

Genera that cross the divide. Another way to visualize the vertebrate gut–environment dichotomy is by using a network diagram that displays, in addition to the clus-tering of hosts with similar microbiotas, the bacterial genera they share. In this representation of the data, the vertebrate gut samples are more connected to one another than to the environmental samples (FIG. 4a,b). As in the UniFrac-based analysis, the non-gut human samples also occupy an intermediate position between the free-living and the gut communities. FIGURE 5 shows the phylogenetic classification of operational taxonomic units (OTUs) that are shared between samples: among humans, an over-whelming number of these are from the Firmicutes, with a smaller number from the Bacteroidetes. By contrast, the free-living communities share OTUs from a wider range of phyla. Samples from the guts of obese humans cluster away from the samples of healthy subjects, and most of their shared OTUs are found in the Firmicutes. This obser-vation is consistent with the finding that samples from obese individuals have a higher number of OTUs from Firmicutes than samples from lean subjects31.

Bacterial genera that inhabit both the vertebrate gut-associated microbiotas and the free-living com-munities can be considered to be cosmopolitan. As the analyses discussed above mainly determine the dominant members of a microbiota, these genera are presumed to grow and subsist in the gut environment (autochthonous members) rather than simply passing through as transient members of the gut microbial community (allochthonous members). Among these cosmopolitan groups was the Pseudomonadaceae

family of the gammaproteobacteria class. This fam-ily contained OTUs from both the vertebrate gut and free-living communities in saline and non-saline habitats. Members of the Enterobacteriales order (also from the gammaproteobacteria) were detected in the vertebrate gut, termite gut and other invertebrates, as well as in a surface soil sample and anoxic saline water. Staphylococcaceae family members (from the phylum Firmicutes and class Bacilli) were common in the ver-tebrate gut samples, but were also detected in soil and cultures derived from freshwater and saline habitats. Finally, members of the Fusobacterium genus were detected in salt-water sediments, in addition to the vertebrate gut. The cosmopolitan distribution of these organisms might have made them particularly impor-tant for introducing novel functions during evolution of the gut microbiota, as they could bring new useful genes from the global microbiome into the gut microbiome through horizontal gene transfer. However, it should be noted that some OTUs that are common in humans

Nature Reviews | Microbiology

16S

ribos

omal

RN

A se

quen

ces

(%)

0

20

40

60

80

100

Bacteroidetes (red)

Firmicutes (blue)

Vertebrate

gut

Termite gut

Salt-wate

r surface

Salt wate

r

Subsurface, anoxic or sediment

Other human

Non-saline cultured

Insects or earth

worms

Soils or fr

eshwater se

diments

Mixed wate

r

Figure 3 | Relative abundance of phyla in samples. Bar graph showing the proportion of sequences from each sample that could be classified at the phylum level. The colour codes for the dominant Firmicutes and Bacteroidetes phyla are shown. For a complete description of the colour codes see Supplementary information S2 (figure). ‘Other humans’ refers to body habitats other than the gut; for example, the mouth, ear, skin, vagina and vulva (see Supplementary information S1 (table)).

Figure 4 | Network analysis of bacterial communities from animal-associated and free-living communities. The panel on the left includes a schematic key that illustrates features of the network analysis and genera keys for panels a and b. Labels are sample nodes. Rounded squares represent operational taxonomic units (OTUs) shared by two or more samples (shown in grey in panels a and b), whereas diamonds represent the set of OTUs that are unique to a sample. Network diagrams are colour coded according to habitat.

!

ANALYSIS

782 | OCTOBER 2008 | VOLUME 6 www.nature.com/reviews/micro

ANALYSIS

39Monday, January 28, 13

Page 52: Eisen Lecture for Ian Korf genomics course

40Monday, January 28, 13

Page 53: Eisen Lecture for Ian Korf genomics course

The Built Environment

ORIGINAL ARTICLE

Architectural design influences the diversity andstructure of the built environment microbiome

Steven W Kembel1, Evan Jones1, Jeff Kline1,2, Dale Northcutt1,2, Jason Stenson1,2,Ann M Womack1, Brendan JM Bohannan1, G Z Brown1,2 and Jessica L Green1,3

1Biology and the Built Environment Center, Institute of Ecology and Evolution, Department ofBiology, University of Oregon, Eugene, OR, USA; 2Energy Studies in Buildings Laboratory,Department of Architecture, University of Oregon, Eugene, OR, USA and 3Santa Fe Institute,Santa Fe, NM, USA

Buildings are complex ecosystems that house trillions of microorganisms interacting with eachother, with humans and with their environment. Understanding the ecological and evolutionaryprocesses that determine the diversity and composition of the built environment microbiome—thecommunity of microorganisms that live indoors—is important for understanding the relationshipbetween building design, biodiversity and human health. In this study, we used high-throughputsequencing of the bacterial 16S rRNA gene to quantify relationships between building attributes andairborne bacterial communities at a health-care facility. We quantified airborne bacterial communitystructure and environmental conditions in patient rooms exposed to mechanical or windowventilation and in outdoor air. The phylogenetic diversity of airborne bacterial communities waslower indoors than outdoors, and mechanically ventilated rooms contained less diverse microbialcommunities than did window-ventilated rooms. Bacterial communities in indoor environmentscontained many taxa that are absent or rare outdoors, including taxa closely related to potentialhuman pathogens. Building attributes, specifically the source of ventilation air, airflow rates, relativehumidity and temperature, were correlated with the diversity and composition of indoor bacterialcommunities. The relative abundance of bacteria closely related to human pathogens was higherindoors than outdoors, and higher in rooms with lower airflow rates and lower relative humidity.The observed relationship between building design and airborne bacterial diversity suggests thatwe can manage indoor environments, altering through building design and operation the communityof microbial species that potentially colonize the human microbiome during our time indoors.The ISME Journal advance online publication, 26 January 2012; doi:10.1038/ismej.2011.211Subject Category: microbial population and community ecologyKeywords: aeromicrobiology; bacteria; built environment microbiome; community ecology; dispersal;environmental filtering

Introduction

Humans spend up to 90% of their lives indoors(Klepeis et al., 2001). Consequently, the way wedesign and operate the indoor environment has aprofound impact on our health (Guenther andVittori, 2008). One step toward better understandingof how building design impacts human healthis to study buildings as ecosystems. Built envi-ronments are complex ecosystems that containnumerous organisms including trillions of micro-organisms (Rintala et al., 2008; Tringe et al., 2008;Amend et al., 2010). The collection of microbiallife that exists indoors—the built environment

microbiome—includes human pathogens and com-mensals interacting with each other and with theirenvironment (Eames et al., 2009). There have beenfew attempts to comprehensively survey the builtenvironment microbiome (Rintala et al., 2008;Tringe et al., 2008; Amend et al., 2010), with moststudies focused on measures of total bioaerosolconcentrations or the abundance of culturable orpathogenic strains (Berglund et al., 1992; Toivolaet al., 2002; Mentese et al., 2009), rather than a morecomprehensive measure of microbial diversity inindoor spaces. For this reason, the factors thatdetermine the diversity and composition of the builtenvironment microbiome are poorly understood.However, the situation is changing. The develop-ment of culture-independent, high-throughputmolecular sequencing approaches has transformedthe study of microbial diversity in a variety ofenvironments, as demonstrated by the recent explo-sion of research on the microbial ecology of aquaticand terrestrial ecosystems (Nemergut et al., 2011)

Received 23 October 2011; revised 13 December 2011; accepted13 December 2011

Correspondence: SW Kembel, Biology and the Built EnvironmentCenter, Institute of Ecology and Evolution, Department of Biology,University of Oregon, Eugene, OR 97405, USA.E-mail: [email protected]

The ISME Journal (2012), 1–11& 2012 International Society for Microbial Ecology All rights reserved 1751-7362/12

www.nature.com/ismej

Microbial Biogeography of Public Restroom SurfacesGilberto E. Flores1, Scott T. Bates1, Dan Knights2, Christian L. Lauber1, Jesse Stombaugh3, Rob Knight3,4,

Noah Fierer1,5*

1 Cooperative Institute for Research in Environmental Science, University of Colorado, Boulder, Colorado, United States of America, 2 Department of Computer Science,

University of Colorado, Boulder, Colorado, United States of America, 3 Department of Chemistry and Biochemistry, University of Colorado, Boulder, Colorado, United

States of America, 4 Howard Hughes Medical Institute, University of Colorado, Boulder, Colorado, United States of America, 5 Department of Ecology and Evolutionary

Biology, University of Colorado, Boulder, Colorado, United States of America

Abstract

We spend the majority of our lives indoors where we are constantly exposed to bacteria residing on surfaces. However, thediversity of these surface-associated communities is largely unknown. We explored the biogeographical patterns exhibitedby bacteria across ten surfaces within each of twelve public restrooms. Using high-throughput barcoded pyrosequencing ofthe 16 S rRNA gene, we identified 19 bacterial phyla across all surfaces. Most sequences belonged to four phyla:Actinobacteria, Bacteriodetes, Firmicutes and Proteobacteria. The communities clustered into three general categories: thosefound on surfaces associated with toilets, those on the restroom floor, and those found on surfaces routinely touched withhands. On toilet surfaces, gut-associated taxa were more prevalent, suggesting fecal contamination of these surfaces. Floorsurfaces were the most diverse of all communities and contained several taxa commonly found in soils. Skin-associatedbacteria, especially the Propionibacteriaceae, dominated surfaces routinely touched with our hands. Certain taxa were morecommon in female than in male restrooms as vagina-associated Lactobacillaceae were widely distributed in femalerestrooms, likely from urine contamination. Use of the SourceTracker algorithm confirmed many of our taxonomicobservations as human skin was the primary source of bacteria on restroom surfaces. Overall, these results demonstrate thatrestroom surfaces host relatively diverse microbial communities dominated by human-associated bacteria with clearlinkages between communities on or in different body sites and those communities found on restroom surfaces. Moregenerally, this work is relevant to the public health field as we show that human-associated microbes are commonly foundon restroom surfaces suggesting that bacterial pathogens could readily be transmitted between individuals by the touchingof surfaces. Furthermore, we demonstrate that we can use high-throughput analyses of bacterial communities to determinesources of bacteria on indoor surfaces, an approach which could be used to track pathogen transmission and test theefficacy of hygiene practices.

Citation: Flores GE, Bates ST, Knights D, Lauber CL, Stombaugh J, et al. (2011) Microbial Biogeography of Public Restroom Surfaces. PLoS ONE 6(11): e28132.doi:10.1371/journal.pone.0028132

Editor: Mark R. Liles, Auburn University, United States of America

Received September 12, 2011; Accepted November 1, 2011; Published November 23, 2011

Copyright: ! 2011 Flores et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported with funding from the Alfred P. Sloan Foundation and their Indoor Environment program, and in part by the NationalInstitutes of Health and the Howard Hughes Medical Institute. The funders had no role in study design, data collection and analysis, decision to publish, orpreparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

Introduction

More than ever, individuals across the globe spend a largeportion of their lives indoors, yet relatively little is known about themicrobial diversity of indoor environments. Of the studies thathave examined microorganisms associated with indoor environ-ments, most have relied upon cultivation-based techniques todetect organisms residing on a variety of household surfaces [1–5].Not surprisingly, these studies have identified surfaces in kitchensand restrooms as being hot spots of bacterial contamination.Because several pathogenic bacteria are known to survive onsurfaces for extended periods of time [6–8], these studies are ofobvious importance in preventing the spread of human disease.However, it is now widely recognized that the majority ofmicroorganisms cannot be readily cultivated [9] and thus, theoverall diversity of microorganisms associated with indoorenvironments remains largely unknown. Recent use of cultiva-tion-independent techniques based on cloning and sequencing ofthe 16 S rRNA gene have helped to better describe these

communities and revealed a greater diversity of bacteria onindoor surfaces than captured using cultivation-based techniques[10–13]. Most of the organisms identified in these studies arerelated to human commensals suggesting that the organisms arenot actively growing on the surfaces but rather were depositeddirectly (i.e. touching) or indirectly (e.g. shedding of skin cells) byhumans. Despite these efforts, we still have an incompleteunderstanding of bacterial communities associated with indoorenvironments because limitations of traditional 16 S rRNA genecloning and sequencing techniques have made replicate samplingand in-depth characterizations of the communities prohibitive.With the advent of high-throughput sequencing techniques, wecan now investigate indoor microbial communities at anunprecedented depth and begin to understand the relationshipbetween humans, microbes and the built environment.

In order to begin to comprehensively describe the microbialdiversity of indoor environments, we characterized the bacterialcommunities found on ten surfaces in twelve public restrooms(six male and six female) in Colorado, USA using barcoded

PLoS ONE | www.plosone.org 1 November 2011 | Volume 6 | Issue 11 | e28132

the stall in), they were likely dispersed manually after women usedthe toilet. Coupling these observations with those of thedistribution of gut-associated bacteria indicate that routine use oftoilets results in the dispersal of urine- and fecal-associated bacteriathroughout the restroom. While these results are not unexpected,they do highlight the importance of hand-hygiene when usingpublic restrooms since these surfaces could also be potentialvehicles for the transmission of human pathogens. Unfortunately,previous studies have documented that college students (who arelikely the most frequent users of the studied restrooms) are notalways the most diligent of hand-washers [42,43].

Results of SourceTracker analysis support the taxonomicpatterns highlighted above, indicating that human skin was theprimary source of bacteria on all public restroom surfacesexamined, while the human gut was an important source on oraround the toilet, and urine was an important source in women’srestrooms (Figure 4, Table S4). Contrary to expectations (seeabove), soil was not identified by the SourceTracker algorithm asbeing a major source of bacteria on any of the surfaces, includingfloors (Figure 4). Although the floor samples contained family-leveltaxa that are common in soil, the SourceTracker algorithmprobably underestimates the relative importance of sources, like

Figure 3. Cartoon illustrations of the relative abundance of discriminating taxa on public restroom surfaces. Light blue indicates lowabundance while dark blue indicates high abundance of taxa. (A) Although skin-associated taxa (Propionibacteriaceae, Corynebacteriaceae,Staphylococcaceae and Streptococcaceae) were abundant on all surfaces, they were relatively more abundant on surfaces routinely touched withhands. (B) Gut-associated taxa (Clostridiales, Clostridiales group XI, Ruminococcaceae, Lachnospiraceae, Prevotellaceae and Bacteroidaceae) were mostabundant on toilet surfaces. (C) Although soil-associated taxa (Rhodobacteraceae, Rhizobiales, Microbacteriaceae and Nocardioidaceae) were in lowabundance on all restroom surfaces, they were relatively more abundant on the floor of the restrooms we surveyed. Figure not drawn to scale.doi:10.1371/journal.pone.0028132.g003

Figure 4. Results of SourceTracker analysis showing the average contributions of different sources to the surface-associatedbacterial communities in twelve public restrooms. The ‘‘unknown’’ source is not shown but would bring the total of each sample up to 100%.doi:10.1371/journal.pone.0028132.g004

Bacteria of Public Restrooms

PLoS ONE | www.plosone.org 5 November 2011 | Volume 6 | Issue 11 | e28132

high diversity of floor communities is likely due to the frequency ofcontact with the bottom of shoes, which would track in a diversityof microorganisms from a variety of sources including soil, which isknown to be a highly-diverse microbial habitat [27,39]. Indeed,bacteria commonly associated with soil (e.g. Rhodobacteraceae,Rhizobiales, Microbacteriaceae and Nocardioidaceae) were, on average,more abundant on floor surfaces (Figure 3C, Table S2).Interestingly, some of the toilet flush handles harbored bacterialcommunities similar to those found on the floor (Figure 2,Figure 3C), suggesting that some users of these toilets may operatethe handle with a foot (a practice well known to germaphobes andthose who have had the misfortune of using restrooms that are lessthan sanitary).

While the overall community level comparisons between thecommunities found on the surfaces in male and female restroomswere not statistically significant (Table S3), there were gender-

related differences in the relative abundances of specific taxa onsome surfaces (Figure 1B, Table S2). Most notably, Lactobacillaceaewere clearly more abundant on certain surfaces within femalerestrooms than male restrooms (Figure 1B). Some species of thisfamily are the most common, and often most abundant, bacteriafound in the vagina of healthy reproductive age women [40,41]and are relatively less abundant in male urine [28,29]. Ouranalysis of female urine samples collected as part of a previousstudy [26] (Figure 1A), found that Lactobacillaceae were dominant inurine, therefore implying that surfaces in the restrooms whereLactobacillaceae were observed were contaminated with urine. Otherstudies have demonstrated a similar phenomenon, with vagina-associated bacteria having also been observed in airplanerestrooms [11] and a child day care facility [10]. As we foundthat Lactobacillaceae were most abundant on toilet surfaces andthose touched by hands after using the toilet (with the exception of

Figure 2. Relationship between bacterial communities associated with ten public restroom surfaces. Communities were clustered usingPCoA of the unweighted UniFrac distance matrix. Each point represents a single sample. Note that the floor (triangles) and toilet (asterisks) surfacesform clusters distinct from surfaces touched with hands.doi:10.1371/journal.pone.0028132.g002

Table 1. Results of pairwise comparisons for unweighted UniFrac distances of bacterial communities associated with varioussurfaces of public restrooms on the University of Colorado campus using the ANOSIM test in Primer v6.

Door in Door out Stall in Stall outFaucethandle

Soapdispenser

Toilet flushhandle Toilet seat Toilet floor

Door in

Door out 20.139

Stall in 0.149 20.053

Stall out 20.074 20.083 20.037

Faucet handle 20.062 20.011 20.092 20.040

Soap dispenser 20.020 0.014 20.060 20.001 0.070

Toilet flush handle 0.376* 0.405* 0.221 0.350* 0.172* 0.470*

Toilet seat 0.742* 0.672* 0.457* 0.586* 0.401* 0.653* 0.187*

Toilet floor 0.995* 0.988* 0.993* 0.961* 0.758* 0.998* 0.577* 0.950*

Sink floor 1.000* 0.995* 1.000* 0.974* 0.770* 1.000* 0.655* 0.982* 20.033

The R-statistic is shown for each comparison with asterisks denoting comparisons that were statistically significant at P#0.01.doi:10.1371/journal.pone.0028132.t001

Bacteria of Public Restrooms

PLoS ONE | www.plosone.org 4 November 2011 | Volume 6 | Issue 11 | e28132

10 FEBRUARY 2012 VOL 335 SCIENCE www.sciencemag.org 650

NEWSFOCUS

CR

ED

ITS

(T

OP

TO

BO

TT

OM

): (P

HO

TO

) C

OU

RT

ES

Y G

ILB

ER

TO

FLO

RE

S; (C

HA

RT

) G

. E

. F

LO

RE

S E

T A

L.,

PLO

S O

NE

6, 1

1 (2

01

1);

PH

OT

O B

Y S

ISIR

A G

OR

TH

ALA

In just that short time, the microbes had begun to take on a “signature” of outside air (more types from plants and soil), and 2 hours after the windows were shut again, the proportion of microbes from the human body increased back to pre-vious levels.

The s tudy, which appeared online 26 Janu-ary in The ISME Journal, found that mechanically ventilated rooms had lower microbial diversity than ones with open win-dows. The availability of fresh air translated into lower proportions of microbes associ-ated with the human body, and consequently, fewer potential pathogens. Although this result suggests that having natural airfl ow may be healthier, Green says answering that question requires clinical data; she’s hoping to convince a hospital to participate in a study to see if the incidence of hospital-acquired infections is associated with a room’s micro-bial community.

For his part, Peccia, who is also a Sloan grantee, is merging microbiology and the

physics of aerosols to look more closely at how the movement of air affects microbes. Peccia says his group is building on work by air-quality engineers and scientists, but “we want to add biology to the equation.”

Bacteria in air behave like other particles; their size dictates how they disperse or settle. Humans in a room not only shed microbes from their skin and mouths, but they also drum up microbial material from the fl oor as

they move around. But to quantify those con-tributions, Peccia’s team has had to develop new methods to collect airborne bacteria and extract their DNA, as the microbes are much less abundant in air than on surfaces.

In one recent study, they used air fi lters to sample airborne particles and microbes in a classroom during 4 days during which students were present and 4 days during which the room was vacant. They measured the abundance and type of fungal and bac-terial genomes present and estimated the microbes’ concentrations in the entire room. By accounting for bacteria entering and leav-

ing the room through ventilation, they calculated that people shed or resuspended about 35 million bacterial cells per person per hour. That number is much higher than the several-hundred-thousand maximum previously estimated to be present in indoor air, Peccia reported last fall at the American Association for Aerosol Research Conference in Orlando, Florida.

His group’s data also suggest that rooms have “memories” of past human inhabitants. By kick-ing into the air settled microbes from the fl oor, occupants expose themselves not just to the microbes of a person coughing next to them, but also possibly to those from a person who coughed in the room a few hours or even days ago.

Peccia hopes to come up with ways to describe the distribution of bacteria indoors that can be used in conjunction with exist-ing knowledge about particulate matter and chemicals in designing healthier buildings. “My hope is that we can bring this enough to the forefront that people who do aerosol sci-ence will fi nd it as important to know biology as to know physics and chemistry,” he says.

Still, even though he’s a willing partici-

pant in indoor microbial ecology research, Peccia thinks that the field has yet to gel. And the Sloan Foundation’s Olsiewski shares some of his con-cern. “Everybody’s gen-erating vast amounts of

data,” she says, but looking across data sets can be diffi cult because groups choose dif-ferent analytical tools. With Sloan support, though, a data archive and integrated analyt-ical tools are in the works.

To foster collaborations between micro-biologists, architects, and building scientists, the foundation also sponsored a symposium on the microbiome of the built environment at the 2011 Indoor Air conference in Austin, Texas, and launched a Web site, MicroBE.net, that’s a clearinghouse of information on the fi eld. Although Olsiewski won’t say how long the foundation will fund its indoor microbial ecology program, she says Sloan is committed to supporting all of the current projects for the next few years. The program’s ultimate goal, she says, is to create a new fi eld of scientifi c inquiry that eventually will be funded by tradi-tional government funding agencies focused on basic biology and environmental policy.

Matthew Kane, a microbial ecologist and program director at the U.S. National Sci-ence Foundation (NSF), says that although there was interest in these questions prior to the Sloan program, the Sloan Foundation has taken a directed approach to funding the research, and “I have no doubt that their investment is going to reap great returns.” So far, though, NSF has funded only one study on indoor microbes: a study of Pseudomonas bacteria in human households.

As studies like Green’s building ecology analysis progress, they should shed light on how indoor environments differ from those traditionally studied by microbial ecologists. “It’s important to have a quantitative under-standing of how building design impacts microbial communities indoors, and how these communities impact human health,” Green says. But it remains to be seen whether we’ll someday design and maintain our build-ings with microbes in mind.

–COURTNEY HUMPHRIES

Courtney Humphries is a freelance writer in Boston and author of Superdove.

100

80

60

40

20

0

Ave

rag

e c

on

trib

uti

on

(%

)

Door in

Door out

Stall i

n

Stall o

ut

Faucet h

andles

Soap disp

enser

Toile

t seat

Toile

t flu

sh h

andle

Toile

t flo

or

Sink f

loor

SOURCES

Soil

Water

Mouth

Urine

Gut

Skin

Outside infl uence. Students prepare to sample air outside a class-room in China as part of an indoor ecology study.

Bathroom biogeography. By swabbing different surfaces in public restrooms, researchers determined that microbes vary in where they come from depend-ing on the surface (chart).

Published by AAAS

on

Febr

uary

9, 2

012

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

41Monday, January 28, 13

Page 54: Eisen Lecture for Ian Korf genomics course

Era III: Genome Sequencing

42

Era III:Genome Sequencing

Monday, January 28, 13

Page 55: Eisen Lecture for Ian Korf genomics course

1st Genome Sequence

Fleischmann et al. 1995 43

Monday, January 28, 13

Page 56: Eisen Lecture for Ian Korf genomics course

Genomes Revolutionized Microbiology

• Predictions of metabolic processes

• Better vaccine and drug design

• New insights into mechanisms of evolution

• Genomes serve as template for functional studies

• New enzymes and materials for engineering and synthetic biology

44Monday, January 28, 13

Page 57: Eisen Lecture for Ian Korf genomics course

45Monday, January 28, 13

Page 58: Eisen Lecture for Ian Korf genomics course

Metabolic Predictions

46Monday, January 28, 13

Page 59: Eisen Lecture for Ian Korf genomics course

Lateral Gene Transfer

Perna et al. 200347

Monday, January 28, 13

Page 60: Eisen Lecture for Ian Korf genomics course

Network of Life

Figure from Barton, Eisen et al. “Evolution”, CSHL Press.

Based on tree from Pace NR, 2003.

Archaea

Eukaryotes

Bacteria

48Monday, January 28, 13

Page 61: Eisen Lecture for Ian Korf genomics course

Using the Core

49Monday, January 28, 13

Page 62: Eisen Lecture for Ian Korf genomics course

Whole Genome Phylogeny

Whole genome tree built using AMPHORAby Martin Wu and Dongying Wu

50Monday, January 28, 13

Page 63: Eisen Lecture for Ian Korf genomics course

Microbial genomes

From http://genomesonline.org 51Monday, January 28, 13

Page 64: Eisen Lecture for Ian Korf genomics course

GEBA as example

52Monday, January 28, 13

Page 65: Eisen Lecture for Ian Korf genomics course

Phylogenetic Diversity

• Phylogenetic diversity poorly sampled

• GEBA project at DOE-JGI correcting this

53Monday, January 28, 13

Page 66: Eisen Lecture for Ian Korf genomics course

54Monday, January 28, 13

Page 67: Eisen Lecture for Ian Korf genomics course

http://www.jgi.doe.gov/programs/GEBA/pilot.html 55Monday, January 28, 13

Page 68: Eisen Lecture for Ian Korf genomics course

GEBA Lesson 1: rRNA utility in IDing novel genomes

From Wu et al. 2009 Nature 462, 1056-1060 56Monday, January 28, 13

Page 70: Eisen Lecture for Ian Korf genomics course

GEBA Lesson 3: Phylogenetic sampling improves annotation

• Took 56 GEBA genomes and compared results vs. 56 randomly sampled new genomes

• Better definition of protein family sequence “patterns”• Greatly improves “comparative” and “evolutionary”

based predictions• Conversion of hypothetical into conserved hypotheticals• Linking distantly related members of protein families• Improved non-homology prediction

58Monday, January 28, 13

Page 71: Eisen Lecture for Ian Korf genomics course

GEBA Lesson 4 : Metadata Important

59Monday, January 28, 13

Page 72: Eisen Lecture for Ian Korf genomics course

GEBA Lesson 5:Improves discovering new genetic diversity

60Monday, January 28, 13

Page 73: Eisen Lecture for Ian Korf genomics course

Protein Family Rarefaction Curves

• Take data set of multiple complete genomes

• Identify all protein families using MCL

• Plot # of genomes vs. # of protein families

61Monday, January 28, 13

Page 79: Eisen Lecture for Ian Korf genomics course

Synapomorphies exist

Wu et al. 2009 Nature 462, 1056-106063

Monday, January 28, 13

Page 80: Eisen Lecture for Ian Korf genomics course

III: Epidemiology & Forensics

64Monday, January 28, 13

Page 81: Eisen Lecture for Ian Korf genomics course

Era IV: Genomes in the environment

65

Era IV:Genomes in the Environment

Monday, January 28, 13

Page 82: Eisen Lecture for Ian Korf genomics course

Marine Microbe Background

• rRNA PCR studies of marine microbes have been extensive

• Comparative analysis had revealed many lineages, some very novel, some less so, that were dominant in many, if not all, open ocean samples

• Lineages given names based on specific clones: e.g., SAR11, SAR86, etc

66Monday, January 28, 13

Page 83: Eisen Lecture for Ian Korf genomics course

INSIGHT REVIEW NATURE|Vol 437|15 September 2005

344

community. As discussed below, some opportunistic strains thatexploit transient conditions may fall into this category.

The interpretation of the Sargasso Sea environmental sequencedata is already inspiring debate16. 16S rRNA gene sequence data fromthe Sargasso Sea WGS data set are shown in Fig. 2 without thesequences of Burkholdaria and Shewanella, which are rare in the Sar-gasso Sea ecosystem and have been questioned as possible contami-nants16. Naming microbial plankton by clade, as shown in Fig. 2, is aconvention used by most oceanographers that is based on evolution-ary principles. With few exceptions, the Sargasso Sea data fall into pre-viously named microbial plankton clades. However, Venter hasemphasized the new diversity shown by the data, concluding that 1,800‘genomic’ species of bacteria and 145 new ‘phylotypes’ inhabited thesamples recovered from the Sargasso Sea18. To reach this conclusion heapplied a rule-of-thumb which assumes that 16S rRNA gene sequencesthat are less than 97% similar originate from different species. As wediscuss further below, the origins of 16S rRNA sequence diversitywithin the named microbial plankton clades is a hot issue. But, how-ever they are interpreted, the high reliability of raw WGS sequencedata will be very useful for understanding the mechanisms of micro-bial evolution in the oceans.

Although most of the major microbial plankton clades have cosmopolitan distributions, new marine microbial plankton cladescontinue to emerge from studies that focus on unique hydrographicfeatures. For example, the new Archaea Crenarchaeota and Eury-archaeota were discovered at the brine–seawater interface of the Sha-ban Deep, in the Red Sea19.

Patterns in time and spaceMolecular biology has filled in some of the blanks about the naturalhistory of marine microbial plankton. As genetic markers becameavailable for ecological studies, it soon emerged that some of the dom-inant microbial plankton clades are vertically stratified (Fig. 1). Earlyindications of these patterns came from the distributions of rRNA geneclones among libraries collected from different depths20–23. Althoughthe study of microbial community stratification is far from complete,in many cases (marine unicellular cyanobacteria, SAR11, SAR202,SAR406, SAR324, group I marine Archaea) the vertical stratificationof populations has been confirmed by alternative experimentalapproaches21–27. The obvious interpretation is that many of thesegroups are specialized to exploit vertical patterns in physical, chemi-cal and biological factors. A clear example is the unicellular marine

cyanobacteria. As obligate phototrophs, these cyanobacteria are con-fined to the photic zone. A similar pattern is found in the SAR86 cladeof !-Proteobacteria. Proteorhodopsin genes have been found in frag-ments of SAR86 genomes, suggesting that this clade has the potentialfor phototrophic metabolism.

Many of the enigmatic microbial groups for which no metabolicstrategy has been identified are also stratified. The boundary betweenthe photic zone and the dark upper mesopelagic is particularly striking — below the photic zone the abundance of picophytoplank-ton and SAR86 declines sharply, and marine group I Archaea, SAR202,SAR406 and SAR324 all assume a prevalent status21–24,26,27. The impli-cations of these observations are clear: the upper mesopelagic com-munity is almost certainly specialized to harvest resources descendingfrom the photic zone. However, with the exception of the marinegroup I Archaea, very little specific information is available about theindividual activities of the upper-mesopelagic groups.

There are also significant differences between coastal and oceangyre microbial plankton populations (Fig. 1; ref. 28). Typically, con-tinental shelves are far more productive than ocean gyres becausephysical processes such as upwelling and mixing bring nutrients tothe surface. As a result eukaryotic phytoplankton make up a largerfraction of the biomass in coastal seas, and species differ betweencoastal and ocean populations. Most of the bacterial groups found ingyres also occur in large numbers in coastal seas, but a number ofmicrobial plankton clades, particular members of the "-Proteobac-teria, have coastal ecotypes or appear to be predominantly confinedto coastal seas28.

One of the most enigmatic microbial groups in the ocean is themarine group I Archaea. Tantalizing geochemical evidence suggeststhat these organisms are chemoautotrophs29. In the 1990s, DeLong andFuhrman established that archaea are widely distributed and numeri-cally significant in the marine water column11,20,30. The marine group IArchaea are Crenarchaeotes. They predominantly occur in themesopelagic, but are found at the surface in the cold waters of thesouthern ocean during the winter. Fluorescence in situ hybridizationtechnology was used to demonstrate that marine group I Archaea pop-ulations comprise about 40% of the mesopelagic microbial communityover vast expanses of the ocean, making them one of the most abun-dant organisms on the planet24. All of the marine Archaea remainuncultured.

New data about microbial distributions has provided tantalizinghints about geochemical activity, but most progress on this question

CrenarchaeotaGroup I Archaea

EuryarchaeotaGroup II ArchaeaGroup III ArchaeaGroup IV Archaea

!-Proteobacteria* SAR11 - Pelagibacter ubique* Roseobacter cladeOCS116ß-Proteobacteria* OM43µ-ProteobacteriaSAR86* OMG Clade* Vibrionaeceae* Pseudoalteromonas* Marinomonas* Halomonadacae* Colwellia* Oceanospirillum !-Proteobacteria

Cyanobacteria* Marine Cluster A

(Synechococcus)* Prochlorococcus sp.

Lentisphaerae* Lentisphaera araneosa

Bacteroidetes

Marine Actinobacteria

FibrobacterSAR406

Planctobacteria

ChloroflexiSAR202

Archaea

Bacteria

Figure 1 | Schematic illustration of the phylogeny ofthe major plankton clades. Black letters indicatemicrobial groups that seem to be ubiquitous inseawater. Gold indicates groups found in the photiczone. Blue indicates groups confined to themesopelagic and surface waters during polar winters.Green indicates microbial groups associated withcoastal ocean ecosystems.

!"#$%&'())%#*+,-###./*/!0##*1""#23##4(56#,!

Nature Publishing Group© 2005

©!!""#!Nature Publishing Group!

!

NATURE|Vol 437|15 September 2005|doi:10.1038/nature04158 INSIGHT REVIEW

343

Molecular diversity and ecology of microbial planktonStephen J. Giovannoni1 & Ulrich Stingl1

The history of microbial evolution in the oceans is probably as old as the history of life itself. In contrast toterrestrial ecosystems, microorganisms are the main form of biomass in the oceans, and form some of thelargest populations on the planet. Theory predicts that selection should act more efficiently in largepopulations. But whether microbial plankton populations harbour organisms that are models of adaptivesophistication remains to be seen. Genome sequence data are piling up, but most of the key microbialplankton clades have no cultivated representatives, and information about their ecological activities is sparse.

cultivation of key organisms, metagenomics and ongoing biogeo-chemical studies. It seems very likely that the biology of the dominantmicrobial plankton groups will be unravelled in the years ahead.

Here we review current knowledge about marine bacterial andarchaeal diversity, as inferred from phylogenies of genes recoveredfrom the ocean water column, and consider the implications of micro-bial diversity for understanding the ecology of the oceans. Althoughwe leave protists out of the discussion, many of the same issues applyto them. Some of the studies we refer to extend to the abyssal ocean,but we focus principally on the surface layer (0–300 m) — the regionof highest biological activity.

Phylogenetic diversity in the oceanSmall-subunit ribosomal (RNA) genes have become universal phylo-genetic markers and are the main criteria by which microbial plank-ton groups are identified and named9. Most of the marine microbialgroups were first identified by sequencing rRNA genes cloned fromseawater10–14, and remain uncultured today. Soon after the first reportscame in, it became apparent that less than 20 microbial cladesaccounted for most of the genes recovered15. Figure 1 is a schematicillustration of the phylogeny of these major plankton clades. The taxonnames marked with asterisks represent groups for which cultured iso-lates are available.

The recent large-scale shotgun sequencing of seawater DNA is pro-viding much higher resolution 16S rRNA gene phylogenies and bio-geographical distributions for marine microbial plankton. Althoughthe main purpose of Venter’s Sorcerer II expedition is to gather whole-genome shotgun sequence (WGS) data from planktonic microorgan-isms16, thousands of water-column rRNA genes are part of theby-catch. The first set of collections, from the Sargasso Sea, haveyielded 1,184 16S rRNA gene fragments. These data are shown in Fig. 2, organized by clade structure. Such data are a rich scientificresource for two reasons. First, they are not tainted by polymerasechain reaction (PCR) artefacts; PCR artefacts rarely interfere with thecorrect placement of genes in phylogenetic categories, but they are amajor problem for reconstructing evolutionary patterns at the popu-lation level17. Second, the enormous number of genes provided by theSorcerer II expedition is revealing the distribution patterns and abun-dance of microbial groups that compose only a small fraction of the

Certain characteristics of the ocean environment — the prevailinglow-nutrient state of the ocean surface, in particular — mean it issometimes regarded as an extreme ecosystem. Fixed forms of nitrogen,phosphorus and iron are often at very low or undetectable levels in theocean’s circulatory gyres, which occur in about 70% of the oceans1.Photosynthesis is the main source of metabolic energy and the basis ofthe food chain; ocean phytoplankton account for nearly 50% of globalcarbon fixation, and half of the carbon fixed into organic matter israpidly respired by heterotrophic microorganisms. Most cells are freelysuspended in the mainly oxic water column, but some attach to aggre-gates. In general, these cells survive either by photosynthesizing or byoxidizing dissolved organic matter (DOM) or inorganic compounds,using oxygen as an electron acceptor.

Microbial cell concentrations are typically about 105 cells ml!1 inthe ocean surface layer (0–300 m) — thymidine uptake into microbialDNA indicates average growth rates of about 0.15 divisions per day(ref. 2). Efficient nutrient recycling, in which there is intense competi-tion for scarce resources, sustains this growth, with predation byviruses and protozoa keeping populations in check and driving highturnover rates3. Despite this competition, steady-state dissolvedorganic carbon (DOC) concentrations are many times higher thancarbon sequestered in living microbial biomass4. However, the averageage of the DOC pool in the deep ocean, of about 5,000 years5 (deter-mined by isotopic dating), suggests that much of the DOM is refrac-tory to degradation. Although DOM is a huge resource, rivallingatmospheric CO2 as a carbon pool6 , chemists have been thwarted bythe complexity of DOM and have characterized it only in broad terms7.

The paragraphs above capture prominent features of the oceanenvironment, but leave out the complex patterns of physical, chemicaland biological variation that drive the evolution and diversification ofmicroorganisms. For example, members of the genus Vibrio — whichinclude some of the most common planktonic bacteria that can be iso-lated on nutrient agar plates — readily grow anaerobically by fermen-tation. The life cycles of some Vibrio species have been shown toinclude anoxic stages in association with animal hosts, but the broadpicture of their ecology in the oceans has barely been characterized8.The story is similar for most of the microbial groups described below:the phylogenetic map is detailed, but the ecological panorama is thinlysketched. New information is rapidly flowing into the field from the

1Department of Microbiology, Oregon State University, Corvallis, Oregon 97331, USA.

!"#$%&'())%#*+,-###./*/!0##*1""#23##4(56#*

Nature Publishing Group© 2005

©!!""#!Nature Publishing Group!

!

67Monday, January 28, 13

Page 84: Eisen Lecture for Ian Korf genomics course

NATURE|Vol 437|15 September 2005 INSIGHT REVIEW

345

near the tips of branches and could be attributed to neutral mutationsaccumulating in clonal populations. Thompson et al. went a step fur-ther by examining sequence divergence and genome variability in a setof 232 Vibrio splendidus isolates taken from the same coastal locationat different times35. The isolates differed by less than 1% in rRNA genesequences, but showed extensive variation in genome size and allelicdiversity. These results could explain why Venter’s group found marinemicrobial genomes difficult to assemble from shotgun sequence data.

However, Kimura predicted that selection has more opportunity toact on small changes in fitness as population size increases, and there-fore very large, stable populations should be more highly perfected byselection36. More specifically, Kimura coined the term ‘effective popu-lation size’ to refer to the minimal size reached by a population that isundergoing fluctuations. For these marine microbes, it may be thatwhere large populations have not been through recent episodes ofpurifying selection, they are able to maintain very large reservoirs ofneutral genetic variation. If this hypothesis is correct, then, at leastwithin ecotypes of microbial plankton, one would expect to find a coreset of genes conferring relatively conserved phenotype.

The SAR11 clade provided the earliest demonstration that the sub-clades of environmental gene clusters could be ecotypes26. ProbingrRNA revealed the presence of a surface (IA) and a deep sub-clade (II),but failed to identify the niche of a third sub-clade (IB; ref. 26). Morerecently, the niche of SAR11 subclade IB emerged in a study of thetransition between spring-bloom and summer-stratified conditions inthe western Sargasso Sea27. Nonmetric multidimensional scalingrevealed that the IB subclade occurs throughout the water column inthe spring, apparently giving way to the more specialized IA and IIsubclades when the water column becomes thermally stratified.

The ecotype concept continues to expand with the recognition thatmany microbial groups can be subdivided according to their distribu-tions in the water column. The unicellular cyanobacteria are by far thebest example. Two ecotypes of Prochlorococcus can readily be differ-entiated by their chlorophyll b/chlorophyll a ratios — a high-light-adapted (high-b/a) lineage, and a low-light-adapted (low-b/a) lineage.Phylogenetic evidence from internal transcribed spacers (ITS) sug-gests that the high-b/a strains can be differentiated into four geneti-cally distinct lineages. The ITS-based phylogenies indicate that MarineCluster A Synechococcus can be subdivided into six clades, three of which can be associated with adaptively important phenotypic characteristics (motility, chromatic adaptation, and lack of phyco-urobilin)37,38. So far, the genome sequences available have providedample support for the hypothesis that these ecotypes differ in charac-teristics that affect their ability to compete. Notably, the low-b/a strainSS120 has a much smaller genome than the others and can use onlyammonium and amino acids for nitrogen sources39. At the otherextreme, Synechococcus WH8102 can use ammonium, urea, nitrite,nitrate, cyanate, peptides and amino acids as sources of nitrogen. It isinteresting to note that Marine Cluster A Synechococcus populationsseem to prosper during periods of upwelling and vertical mixing —whereby nutrients are supplied but also cause chaotic, transitionalconditions. Thus, as observed in the SAR11 clade, there seem to beseasonal specialists and stratification specialists in the marine unicel-lular cyanobacteria.

The observation that the major microbial plankton clades havediverged into ecotypes is powerful evidence that selection is creatingfunctionally and genetically unique entities, despite the confoundinginfluence of neutral variation, which causes relatively marked diver-gence in genome sequences. Although the unicellular marinecyanobacteria are a good model for what the future may hold, thedebate about diversity is far from over.

Old paradigms challenged by new forms of phototrophyThe new millennium arrived in tandem with discoveries of new formsof phototrophy in the ocean surface, which in turn fundamentallychanged perspectives on microbial food webs. Béjà et al. reported the

has come either from cultures or from approaches designed to yieldinformation from experiments performed on native populations. Flu-orescence-activated cell-sorting and in situ hybridization have bothbeen used to separate populations and measure their uptake ofradioactive substrates31–33. Testing hypotheses originating fromgenome sequences, oceanographers were surprised to find that theunicellular marine bacteria, particularly Prochlorococcus, can assimi-late free amino acids — it had previously been thought that they relysolely on inorganic nitrogen32.

The species questionThe question of how to name microbial plankton species is not a triv-ial matter. For oceanographers the issue is: where should the lines bedrawn so that organisms with different properties relevant to geo-chemistry are given unique names? From an evolutionary perspective,the question might be phrased differently: how does one demarcatecell populations that use the same resources and possess the samesuites of adaptations inherited from a common ancestor? Confusionarises from the fact that there is no general agreement about the defi-nition of a microbial ‘species’. The ‘97 % rule’ is simple to apply but doesnot take into account the complex structure of microbial clades. Forexample, the unicellular marine cyanobacteria form a shallow cladethat would constitute a single species by the 97% rule, but all agree thatthis clade contains several species with distinct phenotypes. Cladessuch as SAR86 and SAR11 are far more diverse, but can clearly bedivided into subclades. One theory is that some of these ‘bushy’ sub-clades are ecotypes — populations with shared characters and uniqueniches34.

Acinas and co-workers studied clade structure by using clonelibraries prepared using PCR methods that reduced sequence arte-facts17. They concluded that most sequence variation was clustered

SAR11 su

bgroups I

a + Ib

SAR86 (!-P

roteobacte

ria)

SAR11 su

bgroup II

Marin

e Picophyto

plankto

n

Uncultu

red "-Pro

teobacteria

Clade

SAR406 (Fibro

bacter)

Bactero

idetes

SAR324 (#-P

roteobacte

ria)

Marin

e Acti

nobacteria

Altero

monas/Pse

udoaltero

monas

% o

f 16S

rRN

A s

eque

nces

0

5

10

15

20

25

30

35

SAR116 ("

-Pro

teobacteria

)

Rheinheimera

Roseobacte

r clade

SAR202 (Chloro

flexi)

Phylogenetic clade

Figure 2 | 16S rRNA genes from the Sargasso Sea metagenome data set,organized by clades. The clades are shown in rank order according to geneabundance. To create the figure, 16S gene fragments were recovered fromthe data set by the BLAST program using 14 full sequences of differentprokaryotic phyla as query. The resulting 934 sequences werephylogenetically analysed using the program package ARB. Venter et al.correctly reported more 16S rRNA gene fragments (1,184) because theiranalysis included smaller fragments that are excluded from the set of 934sequences used in the analysis shown here. Genes belonging to the generaBurkholdaria and Shewanella were omitted from the analysis because ofsuggestions that they are contaminants72.

!"#$%&'())%#*+,-###./*/!0##*1""#23##4(56#,,

Nature Publishing Group© 2005

©!!""#!Nature Publishing Group!

!

68Monday, January 28, 13

Page 85: Eisen Lecture for Ian Korf genomics course

Delong Lab

Restriction mapping. Large genomic fragments isolated from fosmid cloneswere mapped by partial and double digestion with various restriction endonucle-ases. When the subclone sizes exceeded 10 kb, the F-factor-based vectorpBAC108L (30) was used to accommodate the fosmid subfragments. Partialdigestions were performed by adding 2.5 U of restriction enzyme to 1 ⇥g ofNotI-digested clone DNA in a 30-⇥l reaction mixture. The reaction mixture wasincubated at 37⇤C, and 10-⇥l aliquots were removed at 10, 40, and 60 min.Restriction digestions were terminated by adding 1 ⇥l of 0.5 M EDTA to thereaction mixtures and placing the tubes on ice. The partially digested DNA wasseparated by pulsed-field gel electrophoresis as described above except using a 1-to 3-s ramped switch time at 100 V for 16 h. The sizes of the separated fragmentswere determined relative to those of known standards. The distances of therestriction sites relative to the terminal T7 and SP6 promoter sites on the excisedcassette were determined by end labeling 10 pmol of T7- or SP6-specific oligo-nucleotides with [⌅-32P]ATP (7,000 Ci/mmol) and hybridizing with Southernblots of the gels.

Southern blots of agarose gels containing fosmid and pBAC clones digestedwith two or more restriction enzymes were probed with labeled T7 and SP6oligonucleotides as well as random-prime-labeled subclones and PCR fragmentscarrying gene sequences identified from the shotgun sequencing described above.This information was correlated with the size estimates from the partial diges-tions to generate physical and genetic maps of the fosmids and their subclones.

Phylogenetic analysis. Sequence alignment and DeSoete distance (9) analyseswere performed on a Sun Sparc 10 workstation using GDE 2.2 and Treetool 1.0,obtained from the ribosomal database project (RDP) (23). DeSoete least squaresdistance analyses (9) were performed by using pairwise evolutionary distances,calculated by using the correction of Olsen to account for empirical base fre-quencies (34). Reference sequences were obtained from the RDP, version 4.0(23). Maximum likelihood analyses (10) of ssu rRNA sequences were performedby using fastDNAml 1.0 (25), obtained from the RDP. For distance analyses ofthe inferred amino acid sequence of EF2, evolutionary distances were estimatedby using the Phylip program (12) Protdist, and tree topology was inferred by theFitch-Margoliash method, using random taxon addition and global branch swap-ping. For maximum parsimony analyses of protein sequences, the Phylip pro-gram Protpars was used with random taxon addition and ordinary parsimonyoptions.

Nucleotide sequence accession numbers. Partial sequences reported in Table1 have been submitted to GenBank under the following accession numbers:U40238, U40239, U40240, U40241, U40242, U40243, U40244, and U40245. Thenucleotide sequences encoding ssu rRNA and EF2 have been submitted toGenBank under accession numbers U39635 and U41261.

RESULTS

Figure 1 shows an overview of the procedures used to con-struct an environmental library from the mixed picoplanktonsample. Our goal was to construct a stable, large insert DNAlibrary representing picoplankton genomic DNA, in order togain information about the genetic and physiological potentialof one constituent group in this community, the planktonicmarine Crenarchaeota. Agarose plugs containing high-molecu-lar-weight picoplankton DNA were prepared by concentratingcells from 30 liters of seawater, using hollow fiber filtration.These agarose plugs, representing picoplankton collected froma variety of sites and depths in the eastern North Pacific, werescreened for the presence of archaebacteria by using botheubacterium-biased (to test for positive amplification) and ar-chaeon-biased rDNA primers. PCR amplification results fromseveral of the agarose plugs (data not shown) indicated thepresence of significant amounts of archaeal DNA. Quantitativehybridization experiments using rRNA extracted from onesample, collected at a depth of 200 m off the Oregon coast,indicated that planktonic archaea in this assemblage comprisedapproximately 4.7% of the total picoplankton biomass (thissample corresponds to ‘‘PAC1’’-200 m in Table 1 of reference8). Results from archaeon-biased rDNA PCR amplificationperformed on agarose plug lysates confirmed the presence ofrelatively large amounts of archaeal DNA in this sample. Aga-rose plugs prepared from this picoplankton sample were cho-sen for subsequent fosmid library preparation. Each 1-ml aga-rose plug from this site contained approximately 7.5 � 109

cells; therefore, approximately 5.4 � 108 cells were present inthe 72-⇥l slice used in the preparation of the partially digestedDNA.

Recombinant fosmids, each containing ca. 40 kb of pico-plankton DNA insert, yielded a library of 3,552 fosmid clones,containing approximately 1.4 � 108 bp of cloned DNA. All ofthe clones examined contained inserts ranging in size from 38to 42 kbp (Fig. 2). Both the multiplex PCR (Fig. 3) and thehybridization experiments suggested that well B7 on microtiter

FIG. 1. Flowchart depicting the construction and screening of an environ-mental library from a mixed picoplankton sample. MW, molecular weight;PFGE, pulsed-field gel electrophoresis.

FIG. 2. Pulsed-field gel showing the separation of selected fosmid clonesdigested with NotI and BamHI. The pFOS1 vector band is at 7.2 kbp. The toptwo bands of clone 4B7 are doublets.

VOL. 178, 1996 GENOMIC FRAGMENTS FROM PLANKTONIC MARINE ARCHAEA 593

at U

NIV

OF

CA

LIF

DA

VIS

on

Ma

y 1

8, 2

01

0

jb.a

sm

.org

Do

wn

loa

de

d fro

m

69Monday, January 28, 13

Page 86: Eisen Lecture for Ian Korf genomics course

Delong Lab

dish 4 contained a clone encoding a 16S rDNA gene specific tothe archaea. This clone, designated 4B7, was selected for moredetailed examination. Using the cloned 4B7 fragment as aprobe, we did not detect other cloned fragments in the fosmidlibrary that contained overlapping regions (Fig. 4). No otherarchaeal ssu rRNA genes were detected in the library.

Sequence analysis and identification of protein- and rRNA-encoding genes. Fragments of archaeal DNA contained in the4B7 insert were subcloned by digesting the fosmid with eitherEcoRI, EcoRI plus BamHI, or SpeI. The resulting restrictionfragments were recovered in Bluescript vector (Stratagene),and the distal 200 to 300 nucleotides of each subclone weresequenced by using M13 forward and reverse sequencing prim-ers. Of a total of 18 subclones analyzed in this fashion, 6 (33%)contained nucleotide sequences with significant identity to pre-viously characterized genes archived in the National Center forBiotechnology Information nucleic acid or protein nonredun-dant database. Table 1 shows the rank order similarity ofsubclone sequences which share significant similarity to knownsequences, based on Poisson probabilities of random homol-ogy [P(n)] (1, 17). Putative genes contained on fosmid 4B7identified in this fashion include EF2, glutamate 1-semialde-hyde aminotransferase (GSAT), RNA helicase, DnaJ, ssurRNA, and large subunit (lsu) rRNA. Three of these nucle-otide sequences (EF2, lsu rRNA, and ssu rRNA) are mostsimilar to known archaeal homologs. Relatively high P(n) val-ues obtained with the DnaJ (subclone 22K-R) and lsu rRNA(Spe14-R) sequences are due to the fact that only a smallportion of each of these subcloned fragments encodes theindicated homolog.

Gene organization. The results of the sequence analyses ofsubclones containing the 16S-23S rRNA operon, GSAT, EF2,and RNA helicase were used to confirm the locations of re-striction sites determined by partial and double digestions. Thesequenced genes mapped to the distal BamHI-NotI fragmentsof the 38.5-kbp insert in clone 4B7 (Fig. 5). The GSAT geneand the 16S-23S operon reside on one of these fragments andare transcribed in the same direction. We have not foundevidence for a 5S rRNA gene encoded on the 4B7 clone. Thegenes encoding RNA helicase and EF2 were on the distalBamHI-NotI fragment. The distal location of the archaeo-plankton EF2 gene relative to the rRNA operon on clone 4B7,as well as the striking similarities of fosmid-encoded EF2 andssu and lsu rRNA genes to archaeal homologs (Table 1), pro-vides evidence that the 4B7 fosmid insert represents a contig-uous genomic fragment from a planktonic marine archaeon.

Phylogenetic analysis of fosmid-encoded ssu rRNA and EF2.Southern blot hybridization using probes produced from sub-

C

FIG. 3. Multiplex PCR analysis of the fosmid bacterioplankton DNA library.(A and B) Agarose gel electrophoreses of fosmid minipreparations pooled fromeach microtiter dish in the library (dishes 1 to 37) and amplified with archaeon-biased ssu rRNA-specific primers. Lane M, 1-kb molecular size marker (Be-thesda Research Laboratories); lane ⇥, positive control containing archaealgenomic DNA (Haloferax volcanii); lane �, negative control containing eubac-terial genomic DNA (Shewanella putrefaciens). (C) Agarose gel electrophoresisof archaeon-biased ssu rRNA PCR amplifications of fosmid clones pooled fromcolumns (1 to 12) and rows (A to H) of microtiter dish 4 (positive reaction inpanel A). Positive reactions were detected in microtiter dish 4, row B, column 7(clone 4B7).

FIG. 4. High-density filter replica of 2,304 fosmid clones containing approx-imately 92 million bp of DNA cloned from the mixed picoplankton community.The filter was probed with the labeled insert from clone 4B7 (dark spot). Thelack of other hybridizing clones suggests that contigs of 4B7 are absent from thisportion of the library. Similar experiments with the remainder of the libraryyielded similar results.

594 STEIN ET AL. J. BACTERIOL.

at U

NIV

OF

CA

LIF

DA

VIS

on

Ma

y 1

8, 2

01

0

jb.a

sm

.org

Do

wn

loa

de

d fro

m

70Monday, January 28, 13

Page 87: Eisen Lecture for Ian Korf genomics course

Delong Lab

tion with multiple sequence alignments, indi-cates that the majority of active site residuesare well conserved between proteorhodopsinand archaeal bacteriorhodopsins (15).

A phylogenetic comparison with archaealrhodopsins placed proteorhodopsin on an in-dependent long branch, with moderate statis-tical support for an affiliation with sensoryrhodopsins (16) (Fig. 1B). The finding ofarchaeal-like rhodopsins in organisms as di-verse as marine proteobacteria and eukarya(6) suggests a potential role for lateral genetransfer in their dissemination. Available ge-nome sequence data are insufficient to iden-tify the evolutionary origins of the proteo-rhodopsin genes. The environments fromwhich the archaeal and bacterial rhodopsinsoriginate are, however, strikingly different.Proteorhodopsin is of marine origin, whereasthe archaeal rhodopsins of extreme halophilesexperience salinity 4 to 10 times greater thanthat in the sea (14).

Functional analysis. To determinewhether proteorhodopsin binds retinal, weexpressed the protein in Escherichia coli(17). After 3 hours of induction in the pres-ence of retinal, cells expressing the proteinacquired a reddish pigmentation (Fig. 3A).When retinal was added to the membranes ofcells expressing the proteorhodopsin apopro-tein, an absorbance peak at 520 nm wasobserved after 10 min of incubation (Fig.3B). On further incubation, the peak at 520nm increased and had a !100-nm half-band-width. The 520-nm pigment was generatedonly in membranes containing proteorhodop-sin apoprotein, and only in the presence ofretinal, and its !100-nm half-bandwidth istypical of retinylidene protein absorptionspectra found in other rhodopsins. The red-shifted "max of retinal ("max # 370 nm in thefree state) is indicative of a protonated Schiffbase linkage of the retinal, presumably to thelysine residue in helix G (18).

Light-mediated proton translocation was de-termined by measuring pH changes in a cellsuspension exposed to light. Net outward trans-port of protons was observed solely in proteor-hodopsin-containing E. coli cells and only inthe presence of retinal and light (Fig. 4A).Light-induced acidification of the medium wascompletely abolished by the presence of a 10$M concentration of the protonophore carbonylcyanide m-chlorophenylhydrazone (19). Illumi-nation generated a membrane electrical poten-tial in proteorhodopsin-containing right-side-out membrane vesicles, in the presence of reti-nal, reaching –90 mV 2 min after light onset(20) (Fig. 4B). These data indicate that proteo-rhodopsin translocates protons and is capable ofgenerating membrane potential in a physiolog-ically relevant range. Because these activitieswere observed in E. coli membranes containingoverexpressed protein, the levels of proteorho-dopsin activity in its native state remain to be

determined. The ability of proteorhodopsin togenerate a physiologically significant mem-brane potential, however, even when heterolo-gously expressed in nonnative membranes, isconsistent with a postulated proton-pumpingfunction for proteorhodopsin.

Archaeal bacteriorhodopsin, and to a less-er extent sensory rhodopsins (21), can bothmediate light-driven proton-pumping activi-ty. However, sensory rhodopsins are general-ly cotranscribed with genes encoding theirown transducer of light stimuli [for example,Htr (22, 23)]. Although sequence analysis ofproteorhodopsin shows moderate statisticalsupport for a specific relationship with sen-

sory rhodopsins, there is no gene for an Htr-like regulator adjacent to the proteorhodopsingene. The absence of an Htr-like gene inclose proximity to the proteorhodopsin genesuggests that proteorhodopsin may functionprimarily as a light-driven proton pump. It ispossible, however, that such a regulatormight be encoded elsewhere in the proteobac-terial genome.

To further verify a proton-pumping func-tion for proteorhodopsin, we characterizedthe kinetics of its photochemical reaction cy-cle. The transport rhodopsins (bacteriorho-dopsins and halorhodopsins) are character-ized by cyclic photochemical reaction se-

Fig. 1. (A) Phylogenetic tree of bacterial 16S rRNA gene sequences, including that encoded on the130-kb bacterioplankton BAC clone (EBAC31A08) (16). (B) Phylogenetic analysis of proteorhodop-sin with archaeal (BR, HR, and SR prefixes) and Neurospora crassa (NOP1 prefix) rhodopsins (16).Nomenclature: Name_Species.abbreviation_Genbank.gi (HR, halorhodopsin; SR, sensory rhodopsin;BR, bacteriorhodopsin). Halsod, Halorubrum sodomense; Halhal, Halobacterium salinarum (halo-bium); Halval, Haloarcula vallismortis; Natpha, Natronomonas pharaonis; Halsp, Halobacterium sp;Neucra, Neurospora crassa.

R E S E A R C H A R T I C L E S

www.sciencemag.org SCIENCE VOL 289 15 SEPTEMBER 2000 1903

on

Ma

y 1

8,

20

10

w

ww

.sc

ien

ce

ma

g.o

rgD

ow

nlo

ad

ed

fro

m

71Monday, January 28, 13

Page 88: Eisen Lecture for Ian Korf genomics course

Delong Lab

tion with multiple sequence alignments, indi-cates that the majority of active site residuesare well conserved between proteorhodopsinand archaeal bacteriorhodopsins (15).

A phylogenetic comparison with archaealrhodopsins placed proteorhodopsin on an in-dependent long branch, with moderate statis-tical support for an affiliation with sensoryrhodopsins (16) (Fig. 1B). The finding ofarchaeal-like rhodopsins in organisms as di-verse as marine proteobacteria and eukarya(6) suggests a potential role for lateral genetransfer in their dissemination. Available ge-nome sequence data are insufficient to iden-tify the evolutionary origins of the proteo-rhodopsin genes. The environments fromwhich the archaeal and bacterial rhodopsinsoriginate are, however, strikingly different.Proteorhodopsin is of marine origin, whereasthe archaeal rhodopsins of extreme halophilesexperience salinity 4 to 10 times greater thanthat in the sea (14).

Functional analysis. To determinewhether proteorhodopsin binds retinal, weexpressed the protein in Escherichia coli(17). After 3 hours of induction in the pres-ence of retinal, cells expressing the proteinacquired a reddish pigmentation (Fig. 3A).When retinal was added to the membranes ofcells expressing the proteorhodopsin apopro-tein, an absorbance peak at 520 nm wasobserved after 10 min of incubation (Fig.3B). On further incubation, the peak at 520nm increased and had a !100-nm half-band-width. The 520-nm pigment was generatedonly in membranes containing proteorhodop-sin apoprotein, and only in the presence ofretinal, and its !100-nm half-bandwidth istypical of retinylidene protein absorptionspectra found in other rhodopsins. The red-shifted "max of retinal ("max # 370 nm in thefree state) is indicative of a protonated Schiffbase linkage of the retinal, presumably to thelysine residue in helix G (18).

Light-mediated proton translocation was de-termined by measuring pH changes in a cellsuspension exposed to light. Net outward trans-port of protons was observed solely in proteor-hodopsin-containing E. coli cells and only inthe presence of retinal and light (Fig. 4A).Light-induced acidification of the medium wascompletely abolished by the presence of a 10$M concentration of the protonophore carbonylcyanide m-chlorophenylhydrazone (19). Illumi-nation generated a membrane electrical poten-tial in proteorhodopsin-containing right-side-out membrane vesicles, in the presence of reti-nal, reaching –90 mV 2 min after light onset(20) (Fig. 4B). These data indicate that proteo-rhodopsin translocates protons and is capable ofgenerating membrane potential in a physiolog-ically relevant range. Because these activitieswere observed in E. coli membranes containingoverexpressed protein, the levels of proteorho-dopsin activity in its native state remain to be

determined. The ability of proteorhodopsin togenerate a physiologically significant mem-brane potential, however, even when heterolo-gously expressed in nonnative membranes, isconsistent with a postulated proton-pumpingfunction for proteorhodopsin.

Archaeal bacteriorhodopsin, and to a less-er extent sensory rhodopsins (21), can bothmediate light-driven proton-pumping activi-ty. However, sensory rhodopsins are general-ly cotranscribed with genes encoding theirown transducer of light stimuli [for example,Htr (22, 23)]. Although sequence analysis ofproteorhodopsin shows moderate statisticalsupport for a specific relationship with sen-

sory rhodopsins, there is no gene for an Htr-like regulator adjacent to the proteorhodopsingene. The absence of an Htr-like gene inclose proximity to the proteorhodopsin genesuggests that proteorhodopsin may functionprimarily as a light-driven proton pump. It ispossible, however, that such a regulatormight be encoded elsewhere in the proteobac-terial genome.

To further verify a proton-pumping func-tion for proteorhodopsin, we characterizedthe kinetics of its photochemical reaction cy-cle. The transport rhodopsins (bacteriorho-dopsins and halorhodopsins) are character-ized by cyclic photochemical reaction se-

Fig. 1. (A) Phylogenetic tree of bacterial 16S rRNA gene sequences, including that encoded on the130-kb bacterioplankton BAC clone (EBAC31A08) (16). (B) Phylogenetic analysis of proteorhodop-sin with archaeal (BR, HR, and SR prefixes) and Neurospora crassa (NOP1 prefix) rhodopsins (16).Nomenclature: Name_Species.abbreviation_Genbank.gi (HR, halorhodopsin; SR, sensory rhodopsin;BR, bacteriorhodopsin). Halsod, Halorubrum sodomense; Halhal, Halobacterium salinarum (halo-bium); Halval, Haloarcula vallismortis; Natpha, Natronomonas pharaonis; Halsp, Halobacterium sp;Neucra, Neurospora crassa.

R E S E A R C H A R T I C L E S

www.sciencemag.org SCIENCE VOL 289 15 SEPTEMBER 2000 1903

on

Ma

y 1

8,

20

10

w

ww

.sc

ien

ce

ma

g.o

rgD

ow

nlo

ad

ed

fro

m

72Monday, January 28, 13

Page 89: Eisen Lecture for Ian Korf genomics course

73Monday, January 28, 13

Page 90: Eisen Lecture for Ian Korf genomics course

Figure 3. Phylogenetic tree based on the amino acid sequences of 25 archaeal rhodopsins. (a) NJ-tree. The numbers at each node are clustering probabilities generated by bootstrap resampling 1000 times. D1 and D2 represent gene duplication points. The four shaded rectangles indicate the speciation dates when halobacteria speciation occurred at the genus level. (b) ML-tree. Log likelihood value for ML-tree was −6579.02 (best score) and that for topology of the NJ-tree was −6583.43. The stippled bars indicate the 95% confidence limits. Both trees were tentatively rooted at the mid-point of the longest distance, although true root positions are unknown.

From Ihara et al. 1999 74Monday, January 28, 13

Page 91: Eisen Lecture for Ian Korf genomics course

quences (photocycles) that are typically !20ms, whereas sensory rhodopsins are slow-cycling pigments with photocycle half-times"300 ms (3). This large kinetic difference isfunctionally important, because a rapid pho-tocycling rate is advantageous for efficiention pumping, whereas a slower cycle pro-vides more efficient light detection becausesignaling states persist for longer times. Toassess the photochemical reactivity of prote-orhodopsin and its kinetics, we subjectedmembranes containing the pigment to a 532-nm laser flash and analyzed flash-inducedabsorption changes in the 50-#s to 10-s timewindow. We observed transient flash-in-duced absorption changes in the early timesin this range (Fig. 5). Transient depletionoccurred near the absorption maximum of thepigment (500-nm trace, Fig. 5, top panel),and transient absorption increase was detect-ed at 400 nm and 590 nm, indicating a func-tional photocyclic reaction pathway. The ab-sorption difference spectrum shows that with-in 0.5 ms, an intermediate with maximal ab-sorption near 400 nm is produced (Fig. 5,bottom panel), which is typical of unproto-nated Schiff base forms (M intermediates) ofretinylidene pigments. The 5-ms minus 0.5-ms difference spectrum shows that after Mdecay, an intermediate species that is red-shifted from the unphotolyzed 520-nm stateappears, which is analogous to the final in-termediate (O) in bacteriorhodopsin. The de-cay of proteorhodopsin O is the rate-limitingstep in the photocycle and is fit well by asingle exponential process of 15 ms, with anupward baseline shift of 13% of the initialamplitude. A possible explanation is hetero-geneity in the proteorhodopsin population,with 87% of the molecules exhibiting a 15-ms photocycle and 13% exhibiting a slowerrecovery. An alternative explanation is thatphotocycle complexity such as branchingproduces a biphasic O decay. Consistent withthis alternative, the O recovery is fit equallywell as a two-exponential process with a fastcomponent, with a 9-ms half-time (61% ofthe total amplitude) and a slow componentwith a 45-ms half-time (39% of the totalamplitude). In either case, the rapid photo-cycle rate, which is a distinguishing charac-teristic of ion pumps, provides additionalstrong evidence that proteorhodopsin func-tions as a transporter rather than as a sensoryrhodopsin.

Implications. The $-proteobacteria thatharbor the proteorhodopsin are widely dis-tributed in the marine environment. Thesebacteria have been frequently detected in cul-ture-independent surveys (24) in coastal andoceanic regions of the Atlantic and PacificOceans, as well as in the Mediterranean Sea(8, 25–29). In addition to its widespread dis-tribution, preliminary data also suggest thatthis $-proteobacterial group is abundant (30,

Fig. 2. Secondarystructure of proteo-rhodopsin. Single-letter amino acidcodes are used (33),and the numberingis as in bacteriorho-dopsin. Predictedretinal binding pock-et residues aremarked in red.

Fig. 3. (A) Proteorhodopsin-expressing E. coli cell suspension (%) compared to control cells (&),both with all-trans retinal. (B) Absorption spectra of retinal-reconstituted proteorhodopsin in E. colimembranes (17). A time series of spectra is shown for reconstituted proteorhodopsin membranes(red) and a negative control (black). Time points for spectra after retinal addition, progressing fromlow to high absorbance values, are 10, 20, 30, and 40 min.

Fig. 4. (A) Light-driven transport of protons by aproteorhodopsin-expressing E. coli cell suspension.The beginning and cessation of illumination (with

yellow light "485 nm) is indicated by arrows labeled ON and OFF, respectively. The cells weresuspended in 10 mM NaCl, 10 mM MgSO4!7H2O, and 100 #M CaCl2. (B) Transport of

3H%-labeledtetraphenylphosphonium ([3H%]TPP) in E. coli right-side-out vesicles containing expressed proteorho-dopsin, reconstituted with (squares) or without (circles) 10 #M retinal in the presence of light (opensymbols) or in the dark (solid symbols) (20).

R E S E A R C H A R T I C L E S

15 SEPTEMBER 2000 VOL 289 SCIENCE www.sciencemag.org1904

on

Ma

y 1

8,

20

10

w

ww

.sc

ien

ce

ma

g.o

rgD

ow

nlo

ad

ed

fro

m

75Monday, January 28, 13

Page 92: Eisen Lecture for Ian Korf genomics course

quences (photocycles) that are typically !20ms, whereas sensory rhodopsins are slow-cycling pigments with photocycle half-times"300 ms (3). This large kinetic difference isfunctionally important, because a rapid pho-tocycling rate is advantageous for efficiention pumping, whereas a slower cycle pro-vides more efficient light detection becausesignaling states persist for longer times. Toassess the photochemical reactivity of prote-orhodopsin and its kinetics, we subjectedmembranes containing the pigment to a 532-nm laser flash and analyzed flash-inducedabsorption changes in the 50-#s to 10-s timewindow. We observed transient flash-in-duced absorption changes in the early timesin this range (Fig. 5). Transient depletionoccurred near the absorption maximum of thepigment (500-nm trace, Fig. 5, top panel),and transient absorption increase was detect-ed at 400 nm and 590 nm, indicating a func-tional photocyclic reaction pathway. The ab-sorption difference spectrum shows that with-in 0.5 ms, an intermediate with maximal ab-sorption near 400 nm is produced (Fig. 5,bottom panel), which is typical of unproto-nated Schiff base forms (M intermediates) ofretinylidene pigments. The 5-ms minus 0.5-ms difference spectrum shows that after Mdecay, an intermediate species that is red-shifted from the unphotolyzed 520-nm stateappears, which is analogous to the final in-termediate (O) in bacteriorhodopsin. The de-cay of proteorhodopsin O is the rate-limitingstep in the photocycle and is fit well by asingle exponential process of 15 ms, with anupward baseline shift of 13% of the initialamplitude. A possible explanation is hetero-geneity in the proteorhodopsin population,with 87% of the molecules exhibiting a 15-ms photocycle and 13% exhibiting a slowerrecovery. An alternative explanation is thatphotocycle complexity such as branchingproduces a biphasic O decay. Consistent withthis alternative, the O recovery is fit equallywell as a two-exponential process with a fastcomponent, with a 9-ms half-time (61% ofthe total amplitude) and a slow componentwith a 45-ms half-time (39% of the totalamplitude). In either case, the rapid photo-cycle rate, which is a distinguishing charac-teristic of ion pumps, provides additionalstrong evidence that proteorhodopsin func-tions as a transporter rather than as a sensoryrhodopsin.

Implications. The $-proteobacteria thatharbor the proteorhodopsin are widely dis-tributed in the marine environment. Thesebacteria have been frequently detected in cul-ture-independent surveys (24) in coastal andoceanic regions of the Atlantic and PacificOceans, as well as in the Mediterranean Sea(8, 25–29). In addition to its widespread dis-tribution, preliminary data also suggest thatthis $-proteobacterial group is abundant (30,

Fig. 2. Secondarystructure of proteo-rhodopsin. Single-letter amino acidcodes are used (33),and the numberingis as in bacteriorho-dopsin. Predictedretinal binding pock-et residues aremarked in red.

Fig. 3. (A) Proteorhodopsin-expressing E. coli cell suspension (%) compared to control cells (&),both with all-trans retinal. (B) Absorption spectra of retinal-reconstituted proteorhodopsin in E. colimembranes (17). A time series of spectra is shown for reconstituted proteorhodopsin membranes(red) and a negative control (black). Time points for spectra after retinal addition, progressing fromlow to high absorbance values, are 10, 20, 30, and 40 min.

Fig. 4. (A) Light-driven transport of protons by aproteorhodopsin-expressing E. coli cell suspension.The beginning and cessation of illumination (with

yellow light "485 nm) is indicated by arrows labeled ON and OFF, respectively. The cells weresuspended in 10 mM NaCl, 10 mM MgSO4!7H2O, and 100 #M CaCl2. (B) Transport of

3H%-labeledtetraphenylphosphonium ([3H%]TPP) in E. coli right-side-out vesicles containing expressed proteorho-dopsin, reconstituted with (squares) or without (circles) 10 #M retinal in the presence of light (opensymbols) or in the dark (solid symbols) (20).

R E S E A R C H A R T I C L E S

15 SEPTEMBER 2000 VOL 289 SCIENCE www.sciencemag.org1904

on

Ma

y 1

8,

20

10

w

ww

.sc

ien

ce

ma

g.o

rgD

ow

nlo

ad

ed

fro

m

76Monday, January 28, 13

Page 93: Eisen Lecture for Ian Korf genomics course

!"##"$% #& '(#)$"

!"#$%& ' ()* +,, ' ,+ -$!& .//, ' 0001234567189: *+*

;694796<9=9;>?2 :9@785@7> 367 ;67>724 ?2 4<7 :7:A6327> 9B 234?C7:36?27 A38476?9;@32D4921

#<7 E3><F;<949@G>?> =343 ;76:?4 7>4?:34?92 9B 4<7 87@@5@36 892F8724634?92 9B ;694796<9=9;>?21 ">>5:?2H I,J 4<7 E3>< G?7@= 9B 4<772C?692:7243@ ;?H:724 ?> >?:?@36 49 4<34 9B ;694796<9=9;>?27K;67>>7= ?2 !" #$%& I4<34 ?>L ,/1MN 3A>96;4?92 8<32H7 34 M// 2:;76 3A>96;4?92 52?4 9B ;?H:724 34 M.O 2:JL 07 83@85@347 /1/PP3A>96;4?92 52?4> 9B ;694796<9=9;>?2 34 M.O 2: I/1PM!H 9B ;69F4796<9=9;>?2 ;6947?2 ;76 @?467 9B >73 03476J1 Q@3>< >;78469>89;G 03>2787>>36G 49 =74784 4<7 ;694796<9=9;>?2 ;?H:724> A7835>7 4<767 ?>:58< H673476 3A>96;4?92 ?2 4<7 C?>?A@7 632H7 AG 94<76 ;?H:724> ?24<7 >3:;@7L 0<?8< ><90 ;73D> 9B ,1RL ,1PL /1M. 32= /1OP 3A>96;4?9252?4>L 34 +POL +RSL R+O 32= ROP 2:L 67>;784?C7@G1

">>5:?2H 3@>9 I.J 3 :9@36 3A>96;4?92 897BT8?724 9B M/L///U!,

8:!, 34 4<7 3A>96;4?92 :3K?:5:L IPJ E5967>87287 &' (&)* <GA6?=?FV34?92 89524> 9B 3 4943@ 9B M1R " ,/,/ W"%SR 87@@> ?2 4<7 89287246347=>3:;@7 I!,/N 9B 4<7 4943@ A38476?3X >77U74<9=>JL I+J M/N 6789C76G9B :7:A6327> B69: 4<7 87@@>L 32= IMJ 4<34 4<7 5285@4?C347= W"%SRH695; ?> 4<7 ;6?28?;3@ A38476?9;@32D492 H695; 5>?2H 4<7>7 ;?H:724>L07 83@85@347 4<34 4<767 367 .1+ " ,/+ ;694796<9=9;>?2 :9@785@7> ;76W"%SR 87@@1

#<?> C3@57 ?> 92 4<7 96=76 9B 4<7 8928724634?92 9B A38476?96<9F=9;>?2 ?2 3 +" (,%&',-*. 87@@L ?2 0<?8< >5A>4324?3@ ;964?92> 9B 4<7:7:A6327 >56B387 3673 9B 4<7 87@@ 892>?>4 9B A38476?96<9=9;>?2 ?2 34?H<4@G ;38D7= 86G>43@@?27 3663GS1 Q96 89:;36?>92L .1+ " ,/+ A38476F?96<9=9;>?2 :9@785@7> =72>7@G ;38D7= ?2 4<7 ;56;@7F:7:A6327@344?87 095@= 89C76 3 /1RF!: =?3:7476L E34 8?685@36 3673 9B :7:FA6327Y3 >?H2?T8324 ;964?92 9B 4<7 >56B387 9B 3 >?2H@7 87@@P1 #<?>25:A76 9B:9@785@7> ?> >5BT8?724 49 ;69=587 >5A>4324?3@ 3:9524> 9B"#Z 52=76 ?@@5:?234?92[1 #<767B967L 4<7 <?H< =72>?4G 9B ;694796<9F=9;>?2 ?2 4<7 W"%SR :7:A6327 ?2=?8347= AG 956 83@85@34?92>>4692H@G >5HH7>4> 4<34 4<?> ;6947?2 <3> 3 >?H2?T8324 69@7 ?2 4<7

;<G>?9@9HG 9B 4<7>7 A38476?3 &' (&)*1#9 7K;@967 4<7 ;9>>?A@7 7K?>47287 9B 94<76 ;694796<9=9;>?2>L 07

>867727= 4<7 >3:7 :?K7=F;9;5@34?92 A38476?3@ 364?T8?3@ 8<69:9F>9:7 I\"]J @?A636G,/ ?2 0<?8< ;694796<9=9;>?2 03> ?2?4?3@@G =?>F89C767=L 0?4< 292F=7H7276347 ;9@G:763>7 8<3?2 67384?92 IZ]%J;6?:76>,1 W7C763@ 3==?4?923@ ;694796<9=9;>?2F89243?2?2H \"]8@927> 0767 B952= ?2 4<7 @?A636G1 #<7>7 ;694796<9=9;>?2> 0767>?:?@36L A54 =?= =?BB76 ?2 4<7?6 3:?29F38?= >7^57287> 0<72 89:F;367= 0?4< 4<7 96?H?23@ ;694796<9=9;>?2 IB96 7K3:;@7L >77 8@927>P,"SL R+"M 32= +/&SX Q?H> P 32= +J1 $>?2H 4<7 >3:7 292F=7H7276347;6?:76>L 07 895@= 3@>9 3:;@?BG AG Z]% ;694796<9=9;>?2 H727> B69:A38476?9;@32D492_!" 7K46384>L ?28@5=?2H 4<9>7 B69:U924767G \3GIU\ 8@927>X Q?H1 PJL 4<7 W954<762 98732 IZ3@:76 >434?92X Z"*8@927>J 32= 03476> 9B 4<7 872463@ !964< Z38?T8 98732 I`303??)8732 #?:7 >76?7> >434?92,,X `)# 8@927>J1

a7 =747847= ,M =?BB76724 C36?324> 9B ;694796<9=9;>?2 ?2 4<7 Z]%FH7276347= U924767G \3G ;694796<9=9;>?2 H727 @?A636GL B3@@?2H ?2494<677 8@5>476> IQ?H1 PJ 4<34 ><367 34 @73>4 [ON ?=724?4G 9C76 .+S 3:?2938?=> IQ?H1 +J 32= [PN ?=724?4G 34 4<7 _!" @7C7@1 #09 ;694796<9F=9;>?2 H727> B69: U924767GL +/&S 32= R+"ML 0767 7K;67>>7= ?2!" #$%& 32= ;69=587= 3A>96;4?92 >;78463 C76G >?:?@36 49 4<7 96?H?23@;694796<9=9;>?2, I8@927 P,"SJ ?>9@347= B69: 4<7 >3:7 03476> I=343294 ><902J1

%7:36D3A@GL 3@@ 4<7 ;694796<9=9;>?2 H727> 4<34 0767 3:;@?T7= AGZ]% B69: "243684?8 :36?27 A38476?9;@32D492 0767 =?BB76724 B69:4<9>7 9B U924767G \3G IQ?H1 PJL ><36?2H 3 :3K?:5: 9B OSN ?=724?4G9C76 .+S 3:?29 38?=> 0?4< 4<7 U924767G 8@3=7 IQ?H1 +J1 #<7 8<32H7>?2 3:?29F38?= >7^57287> 0767 294 67>46?847= 49 4<7 <G=69;<?@?8

Laserflash

Untreated membranes

Hydroxylamine-treated membranes

Retinal-reconstituted membranes

10–3

AU

10–1 s

!"#$%& ' !"#$% &"#'()*+,-$+ .%"*#)$*.# ". /00 *1 23 " 42*.$%$5 6"5 7"-.$%)289"*:.2*

1$17%"*$ 8%$8"%".)2*; <28= 7$32%$ "++).)2* 23 '5+%2>59"1)*$? 1)++9$= "3.$% 0;@4

'5+%2>59"1)*$ .%$".1$*. ". 8A B;0= CD !E= F).' /00(*1 )99,1)*".)2* 32% G01)*? 72..21=

"3.$% -$*.%)3,H)*H .F)-$ F).' %$#,#8$*#)2* )* C0014 8'2#8'".$ 7,33$%= 8A B;0= 32992F$+

75 "++).)2* 23 /!4 "99(!"#$% %$.)*"9 "*+ )*-,7".)2* 32% C ';

MB 0m2

MB 0m1

MB 20m2

MB 20m5

MB 40m12

MB 100m9

HOT 75m3

HOT 75m8

0.01

Mon

tere

y B

ay a

nd s

hallo

w H

OT

Ant

arct

ica

and

deep

HO

T

HOT 75m1

PAL B1

PAL B5PAL B6

PAL E7

PAL E1PAL B7

PAL B2PAL B8

MB 100m10

MB 100m5

MB 100m7

MB 20m12

MB 40m1

MB 40m5BAC 40E8

BAC 31A8

PAL E6

BAC 64A5

HOT 0m1

HOT 75m4

!"#$%& ( I'592H$*$.)- "*"95#)# 23 .'$ )*3$%%$+ "1)*2("-)+ #$J,$*-$ 23 -92*$+

8%2.$2%'2+28#)* H$*$#; K)#."*-$ "*"95#)# 23 @@0 82#).)2*# F"# ,#$+ .2 -"9-,9".$ .'$ .%$$

75 *$)H'72,%(L2)*)*H ,#)*H .'$ I",8M$"%-' 8%2H%"1 23 .'$ N)#-2*#)* I"-:"H$ O$%#)2*

C0;0 PQ$*$.)-# E218,.$% Q%2,8? 4"+)#2*= N)#-2*#)*R; &' %#()$#"*+ 7"-.$%)2%'2+28#)*

F"# ,#$+ "# "* 2,.H%2,8= "*+ )# *2. #'2F*; M-"9$ 7"% %$8%$#$*.# *,17$% 23 #,7#.).,.)2*#

8$% #).$; 629+ *"1$# )*+)-".$ .'$ 8%2.$2%'2+28#)*# .'". F$%$ #8$-.%"995 -'"%"-.$%)S$+ )*

.')# #.,+5;

© 2001 Macmillan Magazines Ltd

77Monday, January 28, 13

Page 94: Eisen Lecture for Ian Korf genomics course

!"##"$% #& '(#)$"

*++ !"#$%& ' ()* +,, ' ,+ -$!& .//, ' 0001234567189:

;99<= >54 0767 =<673? 9@76 4A7 724B67 <6947B2C B28;5?B2D 8A32D7=2736 4A7 674B23;E>B2?B2D ?9:3B2 FGBD1 +H1 ";=9C 4A767 03= 32B2=764B92 9I 927 3:B29 38B? B2 4A7 "243684B8 <694796A9?9<=B2=C67;34B@7 49 4A9=7 9I 4A7 J924767K 8;3?7 FGBD1 +H1L7 7M<67==7? 4A7 <3;&N <694796A9?9<=B2 D727 I69: "243684B83 B2

!" #$%& 87;;=1 "??B4B92 9I 674B23; 49 :7:>6327= 9I 87;;= 89243B2B2D"243684B8 <694796A9?9<=B2 3<9<6947B2 <69?587? 32 3>=96>3287<73O 34 +P/ 2: FGBD1 QHR3 >;57 =ABI4 9I ST 2: I69: 4A7 Q.TE2:<73O 9>=76@7? I96 4A7 J924767K U3K <694796A9?9<=B2=1 #A7 "24E3684B8 <694796A9?9<=B2 =<78465: 7MAB>B47? @B>634B923; V27 =4658E4567C 3= 9>=76@7? <67@B95=;K B2 674B2K;B?727 <BD:724= I96 368A373;=72=96K 6A9?9<=B2 WWC 0AB8A A3= 3 @76K =B:B;36 >;57E=ABI47? F+XTE2:H 3>=96<4B92 :3MB:5:,.1 #A7 2736;K B?724B83; =A3<7= 9I 4A7=46584567? =<78463 =5DD7=4 =B:B;36 :78A32B=:= 9I 03@7;72D4A 67D5E;34B92 B2 4A7 >38476B3; 32? 368A373; 6A9?9<=B2=1 G564A76:967C <3;&N<694796A9?9<=B2 I5284B92= =B:B;36;K 49 B4= J924767K U3KA9:9;9D57=,C 7MAB>B4B2D ;BDA4E:7?B347? 4632=<964 9I <69492= B26BDA4E=B?7E954 !" #$%& @7=B8;7= F&1 !1 Y<5?B8A '( )%1C :325=86B<4 B2<67<3634B92H1L7 =867727? 3 I9=:B? ;B>636K I69: "243684B83 F&1 G1 Z7*92DC

52<5>;B=A7? ?343H 0B4A <694796A9?9<=B2 <6B:76=C 32? =7[57287?927 8;927 89243B2B2D 4A7 D7271 \67;B:B236K =7[57287 323;K=7= >3=7?92 ]32OB2D D727 96?76 32? B?724B4K B2?B8347 4A34C ?7=<B47 ?BII767287=B2 3:B29E38B? =7[57287 32? 3>=96<4B92 =<78463C 4A7 "243684B8<694796A9?9<=B2 B= ?76B@7? I69: 3 >38476B5: ABDA;K 67;347? 494A7 <694796A9?9<=B2E89243B2B2D Y"%XN >38476B3 9I J924767K U3K,/

F)1 U7_3C 52<5>;B=A7? ?343H1#AB64KE409 <694796A9?9<=B2 D727= I69: 4A7 a)# =434B92 0767

8;927? I69: =56I387 FQE:H 32? TQE: =3:<;7=1 J9=4 <694796A9?9<E=B2 8;927= FX/bH I69: 4A7 a)# =56I387 03476= >7;92D7? 49 4A7J924767K 8;3?7 32? 0767 B?724B83; F92 4A7 3:B29E38B? ;7@7;H 49 4A7<694796A9?9<=B2 D727 I69: U"c +/&X1 W2 892463=4C :9=4 9I 4A78;927= I69: 4A7 TQE: =3:<;7 FP/bH I7;; 0B4AB2 4A7 "243684B8 8;3?7FGBD1 SH1 #09 <694796A9?9<=B2 D727= I69: 4A7 a)# =434B92C a)#/:, 32? a)# TQ:+C 0767 7M<67==7? B2 !" #$%& 32? 4A7B6 3>=96<4B92=<78463 0767 7M3:B27?1 "= 7M<7847? I69: 4A7B6 <6B:36K =7[57287=C4A7 a)# <694796A9?9<=B2 8;5=476B2D 0B4A 4A7 "243684B8 8;3?7 D3@73 >;57 F+P/E2:H 3>=96<4B92 :3MB:5:C 0A7673= 4A7 <694796A9?9<E=B2 8;5=476B2D 0B4A 4A7 J924767K 8;3?7 D3@7 3 D6772 FQ.TE2:H3>=96<4B92 :3MB:5: FGBD1 QH1W2 4A7 9;BD9469<AB8 03476= 9I 4A7 872463; !964A \38BV8 DK67C :9=4

9I 4A7 ;BDA4 7276DK B= B2 4A7 >;57 632D7C 0B4A :3MB:3; B2472=B4K 2736+TQ 2: F67I1 ,SH1 #AB= 7276DK <73O B= :3B243B27? 9@76 ?7<4AC0A7673= 4A7 4943; 7276DK ?78673=7=1 "4 4A7 =56I387C 4A7 7276DK <73OB= @76K >693? 0B4A 3 A3;I >32?0B?4A >740772 +// 32? NQ/ 2:1 W2?77<76 03476 >7;90 Q/:C 4A7 <73O 236690= 32? 4A7 A3;I >32?0B?4AB= 67=46B847? 49 >740772 +Q/ 32? Q// 2: FGBD1 QH1 c92=B?76B2D 4A7?BII76724 03@7;72D4A= 3>=96>7? >K :7:>76= 9I 4A7 409 ?BII76724

<694796A9?9<=B2 8;3?7= FQ.T 2: @76=5= +P/ 2:d GBD1 QHC B4 =77:=4A34 4A7 >;57E=ABI47? <694796A9?9<=B2 @36B324= 367 >74476 3?3<47? 494A7 ;BDA4 3@3B;3>;7 B2 4A7B6 72@B692:7241 !943>;KC 69? 32? 8927@B=53; <BD:724 6A9?9<=B2= I69: 8;9=7;K 67;347? V=A =<78B7= B2 *3O7U3BO3;,+ 3;=9 7MAB>B4 3 >;57 =ABI4 B2 ?77<76 ?07;;B2D =<78B7=1 W23??B4B92C <67@B95= =45?B7= 9I 85;4B@347? *+$#,%$+$#$##-.,QC,N 32?/01'#,$#$##-.,TC 3= 07;; 3= 234563; 8K329>38476B3; 89::52B4B7=,XC3;=9 B2?B8347 4A7 <67=7287 9I =56I387E 32? ?77<E03476 D695<=3?3<47? 49 ?BII76724 ?7<4A=1)56 ?343 290 =A90 4A7 <67=7287 9I 89E7MB=4B2D =56I387E 32? ?77<E

03476 8;3?7= 9I <694796A9?9<=B2E>3=7? <A949469<A=C 0A9=7 7276DKED727634B2D <BD:724= 367 =<78463;;K 4527? 49 7B4A76 =A3;;90 96 ?77<7603476 ;BDA4 V7;?=1 \67=5:3>;KC 4AB= <69@B?7= =5>=74= 9I <694796A9E?9<=B2 D7274B8 @36B324= B2 :BM7? <9<5;34B92= 0B4A 3 =7;784B@73?@3243D7 34 ?BII76724 <9B24= 3;92D 4A7 ?7<4AE?7<72?724 ;BDA4

!"#$%& ' !"#$%&#' (#%)*+'*$ ,- &.,$',./,0,&1%* (+%*,2(3%0 1'4"'*3'15 6/' 1'3,*0(.7 1$."3$".' 1/,8* 9:,;'1 -,. $.(*1+'+:.(*' /'#%3'1< %1 0'.%='0 -.,+ /70.,&($/7 &#,$15 >'1%0"'1

&.'0%3$'0 $, -,.+ $/' .'$%*(#2:%*0%*) &,3?'$ (.' +(.?'0 %* .'05

0.06

0.05

0.04

0.03

0.02

0.01

1.0

0.8

0.6

0.4

0.2

0350 400

75 m

5 m

MontereyBay

HOT 0 m

HOT 75 m

Antarctica

450 500

Wavelength (nm)

Rel

ativ

e irr

adia

nce

Abs

orba

nce

550 600 650

!"#$%& ( @:1,.&$%,* 1&'3$.( ,- .'$%*(#2.'3,*1$%$"$'0 &.,$',./,0,&1%*1 %* !" #$%&

+'+:.(*'15 @##2'()*+ .'$%*(# 9A5B!!< 8(1 (00'0 $, +'+:.(*' 1"1&'*1%,*1 %* CDD+!

&/,1&/($' :"--'.E &F G5DE (*0 (:1,.&$%,* 1&'3$.( 8'.' .'3,.0'05 6,&E -,". 1&'3$.( -,.

&(#HI 9@*$(.3$%3(<E FJ6 GB+KE FJ6 D+CE (*0 L@M NC@O 9!,*$'.'7 L(7< ($ C / (-$'.

.'$%*(# (00%$%,*5 L,$$,+E 0,8*8'##%*) %..(0%(*3' -.,+ FJ6 1$($%,* +'(1".'0 ($ 1%;

8(='#'*)$/1 9KCAE KKNE KPDE BCDE BBB (*0 IIB *+< (*0 ($ $8, 0'&$/1E -,. $/' 1(+'

0'&$/1 (*0 0($' $/($ $/' FJ6 1(+&#'1 8'.' 3,##'3$'0 9D (*0 GB+<5 Q..(0%(*3' %1 &#,$$'0

.'#($%=' $, %..(0%(*3' ($ KPD *+5

© 2001 Macmillan Magazines Ltd

78Monday, January 28, 13

Page 95: Eisen Lecture for Ian Korf genomics course

!"##"$% #& '(#)$"

*++ !"#$%& ' ()* +,, ' ,+ -$!& .//, ' 0001234567189:

;99<= >54 0767 =<673? 9@76 4A7 724B67 <6947B2C B28;5?B2D 8A32D7=2736 4A7 674B23;E>B2?B2D ?9:3B2 FGBD1 +H1 ";=9C 4A767 03= 32B2=764B92 9I 927 3:B29 38B? B2 4A7 "243684B8 <694796A9?9<=B2=C67;34B@7 49 4A9=7 9I 4A7 J924767K 8;3?7 FGBD1 +H1L7 7M<67==7? 4A7 <3;&N <694796A9?9<=B2 D727 I69: "243684B83 B2

!" #$%& 87;;=1 "??B4B92 9I 674B23; 49 :7:>6327= 9I 87;;= 89243B2B2D"243684B8 <694796A9?9<=B2 3<9<6947B2 <69?587? 32 3>=96>3287<73O 34 +P/ 2: FGBD1 QHR3 >;57 =ABI4 9I ST 2: I69: 4A7 Q.TE2:<73O 9>=76@7? I96 4A7 J924767K U3K <694796A9?9<=B2=1 #A7 "24E3684B8 <694796A9?9<=B2 =<78465: 7MAB>B47? @B>634B923; V27 =4658E4567C 3= 9>=76@7? <67@B95=;K B2 674B2K;B?727 <BD:724= I96 368A373;=72=96K 6A9?9<=B2 WWC 0AB8A A3= 3 @76K =B:B;36 >;57E=ABI47? F+XTE2:H 3>=96<4B92 :3MB:5:,.1 #A7 2736;K B?724B83; =A3<7= 9I 4A7=46584567? =<78463 =5DD7=4 =B:B;36 :78A32B=:= 9I 03@7;72D4A 67D5E;34B92 B2 4A7 >38476B3; 32? 368A373; 6A9?9<=B2=1 G564A76:967C <3;&N<694796A9?9<=B2 I5284B92= =B:B;36;K 49 B4= J924767K U3KA9:9;9D57=,C 7MAB>B4B2D ;BDA4E:7?B347? 4632=<964 9I <69492= B26BDA4E=B?7E954 !" #$%& @7=B8;7= F&1 !1 Y<5?B8A '( )%1C :325=86B<4 B2<67<3634B92H1L7 =867727? 3 I9=:B? ;B>636K I69: "243684B83 F&1 G1 Z7*92DC

52<5>;B=A7? ?343H 0B4A <694796A9?9<=B2 <6B:76=C 32? =7[57287?927 8;927 89243B2B2D 4A7 D7271 \67;B:B236K =7[57287 323;K=7= >3=7?92 ]32OB2D D727 96?76 32? B?724B4K B2?B8347 4A34C ?7=<B47 ?BII767287=B2 3:B29E38B? =7[57287 32? 3>=96<4B92 =<78463C 4A7 "243684B8<694796A9?9<=B2 B= ?76B@7? I69: 3 >38476B5: ABDA;K 67;347? 494A7 <694796A9?9<=B2E89243B2B2D Y"%XN >38476B3 9I J924767K U3K,/

F)1 U7_3C 52<5>;B=A7? ?343H1#AB64KE409 <694796A9?9<=B2 D727= I69: 4A7 a)# =434B92 0767

8;927? I69: =56I387 FQE:H 32? TQE: =3:<;7=1 J9=4 <694796A9?9<E=B2 8;927= FX/bH I69: 4A7 a)# =56I387 03476= >7;92D7? 49 4A7J924767K 8;3?7 32? 0767 B?724B83; F92 4A7 3:B29E38B? ;7@7;H 49 4A7<694796A9?9<=B2 D727 I69: U"c +/&X1 W2 892463=4C :9=4 9I 4A78;927= I69: 4A7 TQE: =3:<;7 FP/bH I7;; 0B4AB2 4A7 "243684B8 8;3?7FGBD1 SH1 #09 <694796A9?9<=B2 D727= I69: 4A7 a)# =434B92C a)#/:, 32? a)# TQ:+C 0767 7M<67==7? B2 !" #$%& 32? 4A7B6 3>=96<4B92=<78463 0767 7M3:B27?1 "= 7M<7847? I69: 4A7B6 <6B:36K =7[57287=C4A7 a)# <694796A9?9<=B2 8;5=476B2D 0B4A 4A7 "243684B8 8;3?7 D3@73 >;57 F+P/E2:H 3>=96<4B92 :3MB:5:C 0A7673= 4A7 <694796A9?9<E=B2 8;5=476B2D 0B4A 4A7 J924767K 8;3?7 D3@7 3 D6772 FQ.TE2:H3>=96<4B92 :3MB:5: FGBD1 QH1W2 4A7 9;BD9469<AB8 03476= 9I 4A7 872463; !964A \38BV8 DK67C :9=4

9I 4A7 ;BDA4 7276DK B= B2 4A7 >;57 632D7C 0B4A :3MB:3; B2472=B4K 2736+TQ 2: F67I1 ,SH1 #AB= 7276DK <73O B= :3B243B27? 9@76 ?7<4AC0A7673= 4A7 4943; 7276DK ?78673=7=1 "4 4A7 =56I387C 4A7 7276DK <73OB= @76K >693? 0B4A 3 A3;I >32?0B?4A >740772 +// 32? NQ/ 2:1 W2?77<76 03476 >7;90 Q/:C 4A7 <73O 236690= 32? 4A7 A3;I >32?0B?4AB= 67=46B847? 49 >740772 +Q/ 32? Q// 2: FGBD1 QH1 c92=B?76B2D 4A7?BII76724 03@7;72D4A= 3>=96>7? >K :7:>76= 9I 4A7 409 ?BII76724

<694796A9?9<=B2 8;3?7= FQ.T 2: @76=5= +P/ 2:d GBD1 QHC B4 =77:=4A34 4A7 >;57E=ABI47? <694796A9?9<=B2 @36B324= 367 >74476 3?3<47? 494A7 ;BDA4 3@3B;3>;7 B2 4A7B6 72@B692:7241 !943>;KC 69? 32? 8927@B=53; <BD:724 6A9?9<=B2= I69: 8;9=7;K 67;347? V=A =<78B7= B2 *3O7U3BO3;,+ 3;=9 7MAB>B4 3 >;57 =ABI4 B2 ?77<76 ?07;;B2D =<78B7=1 W23??B4B92C <67@B95= =45?B7= 9I 85;4B@347? *+$#,%$+$#$##-.,QC,N 32?/01'#,$#$##-.,TC 3= 07;; 3= 234563; 8K329>38476B3; 89::52B4B7=,XC3;=9 B2?B8347 4A7 <67=7287 9I =56I387E 32? ?77<E03476 D695<=3?3<47? 49 ?BII76724 ?7<4A=1)56 ?343 290 =A90 4A7 <67=7287 9I 89E7MB=4B2D =56I387E 32? ?77<E

03476 8;3?7= 9I <694796A9?9<=B2E>3=7? <A949469<A=C 0A9=7 7276DKED727634B2D <BD:724= 367 =<78463;;K 4527? 49 7B4A76 =A3;;90 96 ?77<7603476 ;BDA4 V7;?=1 \67=5:3>;KC 4AB= <69@B?7= =5>=74= 9I <694796A9E?9<=B2 D7274B8 @36B324= B2 :BM7? <9<5;34B92= 0B4A 3 =7;784B@73?@3243D7 34 ?BII76724 <9B24= 3;92D 4A7 ?7<4AE?7<72?724 ;BDA4

!"#$%& ' !"#$%&#' (#%)*+'*$ ,- &.,$',./,0,&1%* (+%*,2(3%0 1'4"'*3'15 6/' 1'3,*0(.7 1$."3$".' 1/,8* 9:,;'1 -,. $.(*1+'+:.(*' /'#%3'1< %1 0'.%='0 -.,+ /70.,&($/7 &#,$15 >'1%0"'1

&.'0%3$'0 $, -,.+ $/' .'$%*(#2:%*0%*) &,3?'$ (.' +(.?'0 %* .'05

0.06

0.05

0.04

0.03

0.02

0.01

1.0

0.8

0.6

0.4

0.2

0350 400

75 m

5 m

MontereyBay

HOT 0 m

HOT 75 m

Antarctica

450 500

Wavelength (nm)

Rel

ativ

e irr

adia

nce

Abs

orba

nce

550 600 650

!"#$%& ( @:1,.&$%,* 1&'3$.( ,- .'$%*(#2.'3,*1$%$"$'0 &.,$',./,0,&1%*1 %* !" #$%&

+'+:.(*'15 @##2'()*+ .'$%*(# 9A5B!!< 8(1 (00'0 $, +'+:.(*' 1"1&'*1%,*1 %* CDD+!

&/,1&/($' :"--'.E &F G5DE (*0 (:1,.&$%,* 1&'3$.( 8'.' .'3,.0'05 6,&E -,". 1&'3$.( -,.

&(#HI 9@*$(.3$%3(<E FJ6 GB+KE FJ6 D+CE (*0 L@M NC@O 9!,*$'.'7 L(7< ($ C / (-$'.

.'$%*(# (00%$%,*5 L,$$,+E 0,8*8'##%*) %..(0%(*3' -.,+ FJ6 1$($%,* +'(1".'0 ($ 1%;

8(='#'*)$/1 9KCAE KKNE KPDE BCDE BBB (*0 IIB *+< (*0 ($ $8, 0'&$/1E -,. $/' 1(+'

0'&$/1 (*0 0($' $/($ $/' FJ6 1(+&#'1 8'.' 3,##'3$'0 9D (*0 GB+<5 Q..(0%(*3' %1 &#,$$'0

.'#($%=' $, %..(0%(*3' ($ KPD *+5

© 2001 Macmillan Magazines Ltd

79Monday, January 28, 13

Page 96: Eisen Lecture for Ian Korf genomics course

salinixanthin functions as a light-harvesting antenna, transferringenergy to the rhodopsin/retinal complex [21]. It was recentlysuggested that the ability of rhodopsins to bind salinixanthindepends on a single glycine amino acid [30], suggesting that otherrecently identified retinal proteins (e.g., proteorhodopsins) mightalso interact with carotenoid antennas, since some possess theidentical homologous glycine residue.What questions remain to be tackled for the second decade of

research on proteorhodopsin photophysiology? Maintenance ofenergy charge during respiratory stress or starvation, the most likelyphysiological mechanism explaining the results of Gomez-Con-sarnau et al. [16], is just one example of a life history strategybenefitting from proteorhodopsin. As Martinez and colleaguespointed out [17], in different physiological, ecological, phylogenetic,and genomic contexts, proteorhodopsin activity may benefitmicrobes in a variety of ways. Besides producing ATP from thelight-generated proton gradient, flagellar motility and activetransport of solutes into or out of the cell can make use of theproton motive force generated by proteorhodopsins [17]. Hetero-trophs adapted to either high or low nutrient concentrations areknown to contain and express proteorhodopsins. Whether highversus low nutrient adapted bacteria exhibit life-style–specificpatterns of proteorhodopsin photophysiology remains to bedetermined. Already it seems clear that two different high-nutrient–utilizing bacteria containing proteorhodopsin (vibrios andflavobacteria) exhibit fundamentally different light-dependent

growth strategies [13,16]. Understanding the diversity of interactionsamong proteorhodopsin-containing organisms in natural communi-ties represents yet another layer of complexity [22]. Finally,obtaining quantitative estimates of the total contribution ofproteorhodopsin photosystems to the overall energy flux inmicrobial food webs is a worthy goal, but extremely challenging.For chlorophyll-based oxygenic photosythesizers, fluorescence-basedassays, carbon dioxide uptake experiments, and oxygen evolutionmeasurements to constrain energy garnered from sunlight arereadily available. In contrast, light-dependent activity assays are notsimple, nor straightforwardly interpretable for proteorhodopsin-containing microorganisms. The dizzying array of phylogenetic andphysiological contexts in which proteorhodopsins are found (Table 1)also confounds any simple, universal approaches for quantifyingtheir energetic contributions in situ. Nevertheless, the future is brightfor both basic understanding as well as technological applications ofproteorhodopsins and the microbes that contain them. Theincreasing availability of cultivable and readily manipulated modelsystems, along with increasingly more sophisticated in situ studies inthe environment, promise to shed further light on the structure,function, and ecological significance of these ubiquitous andfascinating photoproteins.

Acknowledgments

We wish to thank John Spudich for commenting on the manuscript and JayMcCarren for help in preparing Table 1.

Figure 2. An artist’s rendition of the fundamental arrangement of proteorhodopsin in the cell membrane. Left panel: a cartoon (not toscale) of planktonic bacteria in the ocean water column. Right panel: a simple view of one potential proteorhodopsin energy circuit. (1)Proteorhodopsin – uses light energy to translocate protons across the cell membrane. (2) Extracellular protons – the excess extracellular protonscreate a proton motive force, that can energetically drive flagellar motility, transport processes, or ATP synthesis in the cell. (3) Proton-translocatingATPase – a multi-protein membrane-bound complex that can utilize the proton motive force to synthesize 5. Adenosine triphosphate (ATP, a centralhigh energy biochemical intermediate for the cell) from 4. Adenosine triphosphate (ADP, a lower energy biochemical intermediate). Illustration byKirsten Carlson, ! MBARI 2001.doi:10.1371/journal.pbio.1000359.g002

PLoS Biology | www.plosbiology.org 4 April 2010 | Volume 8 | Issue 4 | e1000359

80Monday, January 28, 13

Page 97: Eisen Lecture for Ian Korf genomics course

Community structure and metabolismthrough reconstruction of microbialgenomes from the environmentGene W. Tyson1, Jarrod Chapman3,4, Philip Hugenholtz1, Eric E. Allen1, Rachna J. Ram1, Paul M. Richardson4, Victor V. Solovyev4,Edward M. Rubin4, Daniel S. Rokhsar3,4 & Jillian F. Banfield1,2

1Department of Environmental Science, Policy and Management, 2Department of Earth and Planetary Sciences, and 3Department of Physics, University of California,Berkeley, California 94720, USA4Joint Genome Institute, Walnut Creek, California 94598, USA

...........................................................................................................................................................................................................................

Microbial communities are vital in the functioning of all ecosystems; however, most microorganisms are uncultivated, and theirroles in natural systems are unclear. Here, using random shotgun sequencing of DNA from a natural acidophilic biofilm, we reportreconstruction of near-complete genomes of Leptospirillum group II and Ferroplasma type II, and partial recovery of three othergenomes. This was possible because the biofilm was dominated by a small number of species populations and the frequency ofgenomic rearrangements and gene insertions or deletions was relatively low. Because each sequence read came from a differentindividual, we could determine that single-nucleotide polymorphisms are the predominant form of heterogeneity at the strain level.The Leptospirillum group II genome had remarkably few nucleotide polymorphisms, despite the existence of low-abundancevariants. The Ferroplasma type II genome seems to be a composite from three ancestral strains that have undergone homologousrecombination to form a large population of mosaic genomes. Analysis of the gene complement for each organism revealed thepathways for carbon and nitrogen fixation and energy generation, and provided insights into survival strategies in an extremeenvironment.

The study of microbial evolution and ecology has been revolutio-nized by DNA sequencing and analysis1–3. However, isolates havebeen the main source of sequence data, and only a small fraction ofmicroorganisms have been cultivated4–6. Consequently, focus hasshifted towards the analysis of uncultivated microorganisms viacloning of conserved genes5 and genome fragments directly fromthe environment7–9. To date, only a small fraction of genes have beenrecovered from individual environments, limiting the analysis ofmicrobial communities as networks characterized by symbioses,competition and partitioning of community-essential roles.Comprehensive genomic data would resolve organism-specificpathways and provide insights into population structure, speciationand evolution. So far, sequencing of whole communities has notbeen practical because most communities comprise hundreds tothousands of species10.

Acid mine drainage (AMD) is a worldwide environmentalproblem that arises largely from microbial activity11. Here, wefocused on a low-complexity AMD microbial biofilm growinghundreds of feet underground within a pyrite (FeS2) ore body

12–15.This represents a self-contained biogeochemical system character-ized by tight coupling between microbial iron oxidation andacidification due to pyrite dissolution11,16,17. Random shotgunsequencing of DNA from entire microbial communities is oneapproach for the recovery of the gene complement of uncultivatedorganisms, and for determining the degree of variability withinpopulations at the genome level. We used random shotgun sequen-cing of the biofilm to obtain the first reconstruction of multiplegenomes directly from a natural sample. The results provide novelinsights into community structure, and reveal the strategies thatunderpin microbial activity in this environment.

Initial characterization of the biofilmBiofilms growing on the surface of flowing AMD in the five-way region of the Richmond mine at Iron Mountain, California12,were sampled in March 2000. Screening using group-specific18

fluorescence in situ hybridization (FISH) revealed that all biofilmscontained mixtures of bacteria (Leptospirillum, Sulfobacillus and, ina few cases, Acidimicrobium) and archaea (Ferroplasma and othermembers of the Thermoplasmatales). The genome of one of thesearchaea, Ferroplasma acidarmanus fer1, isolated from the Richmondmine, has been sequenced previously (http://www.jgi.doe.gov/JGI_microbial/html/ferroplasma/ferro_homepage.html).A pink biofilm (Fig. 1a) typical of AMD communities was

selected for detailed genomic characterization (see SupplementaryInformation). The biofilm was dominated by Leptospirillum speciesand contained F. acidarmanus at a relatively low abundance (Fig. 1b,c). This biofilm was growing in pH 0.83, 42 8C, 317mM Fe, 14mMZn, 4mM Cu and 2mM As solution, and was collected from asurface area of approximately 0.05m2.A 16S ribosomal RNA gene clone library was constructed from

DNA extracted from the pink biofilm, and 384 clones were end-sequenced (see Supplementary Information). Results indicated thepresence of three bacterial and three archaeal lineages. The mostabundant clones are close relatives of L. ferriphilum19 and belongto Leptospirillum group II (ref. 13). Although 94% of the Lepto-spirillum group II clones were identical, 17 minor variants weredetected with up to 1.2% 16S rRNA gene-sequence divergence fromthe dominant type. Tightly defined groups (up to 1% sequencedivergence) related to Leptospirillum group III (ref. 13), Sulfobacillus,Ferroplasma (some identical to fer1), ‘A-plasma’15 and ‘G-plasma’15

were also detected. Leptospirillum group III, G-plasma andA-plasma have only recently been detected in culture-independentmolecular surveys. FISH-based quantification (Fig. 1c; seealso Supplementary Information) confirmed the dominance ofLeptospirillum group II in the biofilm.

Community genome sequencing and assemblyIn conventional shotgun sequencing projects of microbial isolates,all shotgun fragments are derived from clones of the same genome.When using the shotgun sequencing approach on genomes from an

articles

NATURE | doi:10.1038/nature02340 | www.nature.com/nature 1© 2004 Nature Publishing Group

81

Environmental Genome ShotgunSequencing of the Sargasso SeaJ. Craig Venter,1* Karin Remington,1 John F. Heidelberg,3

Aaron L. Halpern,2 Doug Rusch,2 Jonathan A. Eisen,3

Dongying Wu,3 Ian Paulsen,3 Karen E. Nelson,3 William Nelson,3

Derrick E. Fouts,3 Samuel Levy,2 Anthony H. Knap,6

Michael W. Lomas,6 Ken Nealson,5 Owen White,3

Jeremy Peterson,3 Jeff Hoffman,1 Rachel Parsons,6

Holly Baden-Tillson,1 Cynthia Pfannkoch,1 Yu-Hui Rogers,4

Hamilton O. Smith1

Wehave applied “whole-genome shotgun sequencing” tomicrobial populationscollected enmasse on tangential flow and impact filters from seawater samplescollected from the Sargasso Sea near Bermuda. A total of 1.045 billion base pairsof nonredundant sequencewas generated, annotated, and analyzed to elucidatethe gene content, diversity, and relative abundance of the organisms withinthese environmental samples. These data are estimated to derive from at least1800 genomic species based on sequence relatedness, including 148 previouslyunknown bacterial phylotypes. We have identified over 1.2 million previouslyunknown genes represented in these samples, including more than 782 newrhodopsin-like photoreceptors. Variation in species present and stoichiometrysuggests substantial oceanic microbial diversity.

Microorganisms are responsible for most of thebiogeochemical cycles that shape the environ-ment of Earth and its oceans. Yet, these organ-isms are the least well understood on Earth, asthe ability to study and understand the metabol-ic potential of microorganisms has been ham-pered by the inability to generate pure cultures.Recent studies have begun to explore environ-mental bacteria in a culture-independent man-ner by isolating DNA from environmental sam-ples and transforming it into large insert clones.For example, a previously unknown light-drivenproton pump, proteorhodopsin, was discoveredwithin a bacterial artificial chromosome (BAC)from the genome of a SAR86 ribotype (1), andsoil microbial DNA libraries have been construct-ed and screened for specific activities (2).

Here we have applied whole-genome shot-gun sequencing to environmental-pooled DNAsamples to test whether new genomic approach-es can be effectively applied to gene and spe-cies discovery and to overall environmental

characterization. To help ensure a tractable pilotstudy, we sampled in the Sargasso Sea, a nutrient-limited, open ocean environment. Further, weconcentrated on the genetic material captured onfilters sized to isolate primarily microbial inhabit-ants of the environment, leaving detailed analysisof dissolved DNA and viral particles on one endof the size spectrum and eukaryotic inhabitants onthe other, for subsequent studies.The Sargasso Sea. The northwest Sar-

gasso Sea, at the Bermuda Atlantic Time-seriesStudy site (BATS), is one of the best-studiedand arguably most well-characterized regionsof the global ocean. The Gulf Stream representsthe western and northern boundaries of thisregion and provides a strong physical boundary,separating the low nutrient, oligotrophic openocean from the more nutrient-rich waters of theU.S. continental shelf. The Sargasso Sea hasbeen intensively studied as part of the 50-yeartime series of ocean physics and biogeochem-istry (3, 4) and provides an opportunity forinterpretation of environmental genomic data inan oceanographic context. In this region, for-mation of subtropical mode water occurs eachwinter as the passage of cold fronts across theregion erodes the seasonal thermocline andcauses convective mixing, resulting in mixedlayers of 150 to 300 m depth. The introductionof nutrient-rich deep water, following thebreakdown of seasonal thermoclines into thebrightly lit surface waters, leads to the bloom-ing of single cell phytoplankton, including twocyanobacteria species, Synechococcus and Pro-

chlorococcus, that numerically dominate thephotosynthetic biomass in the Sargasso Sea.

Surface water samples (170 to 200 liters)were collected aboard the RV Weatherbird IIfrom three sites off the coast of Bermuda inFebruary 2003. Additional samples were col-lected aboard the SV Sorcerer II from “Hydro-station S” in May 2003. Sample site locationsare indicated on Fig. 1 and described in tableS1; sampling protocols were fine-tuned fromone expedition to the next (5). Genomic DNAwas extracted from filters of 0.1 to 3.0 !m, andgenomic libraries with insert sizes ranging from2 to 6 kb were made as described (5). Theprepared plasmid clones were sequenced fromboth ends to provide paired-end reads at the J.Craig Venter Science Foundation Joint Tech-nology Center on ABI 3730XL DNA sequenc-ers (Applied Biosystems, Foster City, CA).Whole-genome random shotgun sequencing ofthe Weatherbird II samples (table S1, samples 1 to4) produced 1.66 million reads averaging 818 bpin length, for a total of approximately 1.36 Gbp ofmicrobial DNA sequence. An additional 325,561sequences were generated from the Sorcerer IIsamples (table S1, samples 5 to 7), yielding ap-proximately 265 Mbp of DNA sequence.Environmental genome shotgun as-

sembly. Whole-genome shotgun sequencingprojects have traditionally been applied to iden-tify the genome sequence(s) from one particularorganism, whereas the approach taken here isintended to capture representative sequencefrom many diverse organisms simultaneously.Variation in genome size and relative abun-dance determines the depth of coverage of anyparticular organism in the sample at a givenlevel of sequencing and has strong implicationsfor both the application of assembly algorithmsand for the metrics used in evaluating the re-sulting assembly. Although we would expectabundant species to be deeply covered and wellassembled, species of lower abundance may berepresented by only a few sequences. For asingle genome analysis, assembly coveragedepth in unique regions should approximate aPoisson distribution. The mean of this distribu-tion can be estimated from the observed data,looking at the depth of coverage of contigsgenerated before any scaffolding. The assem-bler used in this study, the Celera Assembler(6), uses this value to heuristically identifyclearly unique regions to form the backbone ofthe final assembly within the scaffolding phase.However, when the starting material consists ofa mixture of genomes of varying abundance, athreshold estimated in this way would classifysamples from the most abundant organism(s) asrepetitive, due to their greater-than-averagedepth of coverage, paradoxically leaving themost abundant organisms poorly assembled.We therefore used manual curation of an initial

1The Institute for Biological Energy Alternatives, 2TheCenter for the Advancement of Genomics, 1901 Re-search Boulevard, Rockville, MD 20850, USA. 3TheInstitute for Genomic Research, 9712 Medical CenterDrive, Rockville, MD 20850, USA. 4The J. Craig VenterScience Foundation Joint Technology Center, 5 Re-search Place, Rockville, MD 20850, USA. 5University ofSouthern California, 223 Science Hall, Los Angeles, CA90089–0740, USA. 6Bermuda Biological Station forResearch, Inc., 17 Biological Lane, St George GE 01,Bermuda.

*To whom correspondence should be addressed. E-mail: [email protected]

RESEARCH ARTICLE

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org66

Monday, January 28, 13

Page 98: Eisen Lecture for Ian Korf genomics course

Shotgun metagenomics

shotgunsequence

Metagenomics82

Monday, January 28, 13

Page 99: Eisen Lecture for Ian Korf genomics course

Acid Mine Drainage 2004

environmental sample, however, variation within each speciespopulation might complicate assembly. If intraspecies variation isdominated by limited local polymorphism or homologous recom-bination, it should be possible to define a composite genome foreach species population. Conversely, if the genomic heterogeneitywithin a species is dominated by large rearrangements, deletions, orinsertions, it may be impossible to define composite genomes forspecies populations from natural communities.A small insert plasmid library (average insert size 3.2 kilobases

(kb)) was constructed from the biofilm DNA for random shotgunsequencing (see Supplementary Information). A total of 76.2million base pairs (bp) of DNA sequence was generated from103,462 high-quality reads (averaging 737 bp per read). Analysisof raw shotgun data (Supplementary Figs S1–5) indicated thepresence of both bacterial and archaeal genomes at sequencecoverages of up to 10£, which would be sufficient to produce ahigh-quality assembly from a conventional microbial genomeproject20,21. The shotgun data set was assembled with JAZZ, awhole-genome shotgun assembler22. Anticipating polymorphisms,we permitted alignment discrepancies beyond those expected fromsequencing error if they were consistent with end-pairing con-straints. Over 85% of the shotgun reads were assembled intoscaffolds longer than 2 kb (a scaffold is a reconstructed genomicregion that may contain gaps of a known size range). The combinedlength of the 1,183 scaffolds is 10.83 megabases (Mb). The assemblyis internally self consistent, with 97.2% of end pairs from the sameclone assembled with the appropriate orientation and separation, asexpected for a low rate of mispairing error (tracking and chimaericclones).The first step in assignment of scaffolds to organism types was to

separate the scaffolds by average G!C content. These were sub-sequently subdivided using read depth (coverage). Dinucleotidefrequencies did not allow for further subdivision. Notably, separa-tion of scaffolds into low G!C (,43.5%; Supplementary Fig. S3a)and high G!C ($43.5%) content ‘bins’ was not significantlycompromised by local heterogeneities in G!C content becausethe scaffolds were binned after assembly. As the scaffolds aretypically tens of kilobases long, local fluctuations in G!C contentare averaged over the length of each scaffold, allowing, in most cases(.99%), clear assignment to bins of high or low G!C content.

The high G!C scaffolds at approximately 10£ coverage (70scaffolds up to 137 kb in length, totalling 2.23Mb) were identifiedby the presence of a single 16S rRNA gene as belonging to thegenome of a Leptospirillum group II species. The average G!Ccontent (55.8%) is comparable to the G!C content (54.9–58%) ofL. ferriphilum19. The total high G!C scaffold length is close to theestimated genome size of Leptospirillum ferrooxidans23 (1.9Mb).This suggests that essentially the entire Leptospirillum group IIgenome was recovered from the community DNA.

The low G!C scaffolds at approximately 10£ coverage wereassembled into 59 scaffolds of up to 138 kb in length, totalling1.82Mb. The single 16S rRNA gene identified in these scaffolds was99% identical to that of the fer1 isolate; however, alignment of thescaffolds to the fer1 genome revealed an average of 22% divergenceat the nucleotide level (Supplementary Fig. S6). The total scaffoldlength is close to the genome size of fer1 (1.9Mb; Allen et al.,unpublished data), and local gene order and content are highlyconserved (Supplementary Fig. S7). Therefore, these 59 scaffoldsrepresent a nearly complete genome of a previously unknown,uncultured Ferroplasma species distinct from fer1. We designatethis as Ferroplasma type II. The dominance of this organism typewas unexpected before the genomic analysis.

We assigned the roughly 3£ coverage, high G!C scaffolds toLeptospirillum group III on the basis of rRNAmarkers (474 scaffoldsup to 31 kb, totalling 2.66Mb). Comparison of these scaffolds withthose assigned to Leptospirillum group II indicates significantsequence divergence and only locally conserved gene order, con-firming that the scaffolds belong to a relatively distant relative ofLeptospirillum group II. A partial 16S rRNA gene sequence fromSulfobacillus thermosulfidooxidans was identified in the un-assembled reads, suggesting very low coverage of this organism. Ifany Sulfobacillus scaffolds .2 kb were assembled, they would begrouped with the Leptospirillum group III scaffolds.

We compared the 3£ coverage, low G!C scaffolds (580 scaffolds,4.12Mb) to the fer1 genome in order to assign them to organismtypes (Supplementary Fig. S6). Scaffolds with $96% nucleotideidentity to fer1 were assigned to an environmental Ferroplasma typeI genome (170 scaffolds up to 47 kb in length and comprising1.48Mb of sequence). The remaining low-coverage, low G!Cscaffolds are tentatively assigned to G-plasma. The largest scaffoldin this bin (62 kb) contains the G-plasma 16S rRNA gene. The 410scaffolds assigned to G-plasma comprise 2.65Mb of sequence. Apartial 16S rRNAgene sequence fromA-plasmawas identified in theunassembled reads, suggesting low coverage of this organism. Anyscaffolds from A-plasma.2 kb would be included in the G-plasmabin. Although eukaryotes are present in the AMD system, they werein low abundance in the biofilm studied. So far, no scaffolds fromeukaryotes have been detected.

As independent evidence that the Leptospirillum group II andFerroplasma type II genomes are nearly complete, we located a fullcomplement of transfer RNA synthetases in each genome data set.An almost complete set of these genes was also recovered fromLeptospirillum group III. TheG-plasma bin containsmore than a fullset of tRNA synthetases, consistent with inclusion of some A-plasmascaffolds. In addition, we established that the Leptospirillumgroup II, Leptospirillum group III, Ferroplasma type I, Ferroplasmatype II and G-plasma bins contained only one set of rRNA genes.

Figure 1 The pink biofilm. a, Photograph of the biofilm in the Richmond mine (hand

included for scale). b, FISH image of a. Probes targeting bacteria (EUBmix; fluoresceinisothiocyanate (green)) and archaea (ARC915; Cy5 (blue)) were used in combination with a

probe targeting the Leptospirillum genus (LF655; Cy3 (red)). Overlap of red and green

(yellow) indicates Leptospirillum cells and shows the dominance of Leptospirillum.

c, Relative microbial abundances determined using quantitative FISH counts.

articles

NATURE | doi:10.1038/nature02340 | www.nature.com/nature2 © 2004 Nature Publishing Group 83Monday, January 28, 13

Page 100: Eisen Lecture for Ian Korf genomics course

inputs of fixed carbon or nitrogen from external sources. As withLeptospirillum group I, both Leptospirillum group II and III have thegenes needed to fix carbon by means of the Calvin–Benson–Bassham cycle (using type II ribulose 1,5-bisphosphate carboxy-lase–oxygenase). All genomes recovered from the AMD system

contain formate hydrogenlyase complexes. These, in combinationwith carbon monoxide dehydrogenase, may be used for carbonfixation via the reductive acetyl coenzyme A (acetyl-CoA) pathwayby some, or all, organisms. Given the large number of ABC-typesugar and amino acid transporters encoded in the Ferroplasma type

Figure 4 Cell metabolic cartoons constructed from the annotation of 2,180 ORFs

identified in the Leptospirillum group II genome (63% with putative assigned function) and

1,931 ORFs in the Ferroplasma type II genome (58% with assigned function). The cell

cartoons are shown within a biofilm that is attached to the surface of an acid mine

drainage stream (viewed in cross-section). Tight coupling between ferrous iron oxidation,

pyrite dissolution and acid generation is indicated. Rubisco, ribulose 1,5-bisphosphate

carboxylase–oxygenase. THF, tetrahydrofolate.

articles

NATURE | doi:10.1038/nature02340 | www.nature.com/nature 5© 2004 Nature Publishing Group

84Monday, January 28, 13

Page 101: Eisen Lecture for Ian Korf genomics course

Environmental Genome ShotgunSequencing of the Sargasso SeaJ. Craig Venter,1* Karin Remington,1 John F. Heidelberg,3

Aaron L. Halpern,2 Doug Rusch,2 Jonathan A. Eisen,3

Dongying Wu,3 Ian Paulsen,3 Karen E. Nelson,3 William Nelson,3

Derrick E. Fouts,3 Samuel Levy,2 Anthony H. Knap,6

Michael W. Lomas,6 Ken Nealson,5 Owen White,3

Jeremy Peterson,3 Jeff Hoffman,1 Rachel Parsons,6

Holly Baden-Tillson,1 Cynthia Pfannkoch,1 Yu-Hui Rogers,4

Hamilton O. Smith1

Wehave applied “whole-genome shotgun sequencing” tomicrobial populationscollected enmasse on tangential flow and impact filters from seawater samplescollected from the Sargasso Sea near Bermuda. A total of 1.045 billion base pairsof nonredundant sequencewas generated, annotated, and analyzed to elucidatethe gene content, diversity, and relative abundance of the organisms withinthese environmental samples. These data are estimated to derive from at least1800 genomic species based on sequence relatedness, including 148 previouslyunknown bacterial phylotypes. We have identified over 1.2 million previouslyunknown genes represented in these samples, including more than 782 newrhodopsin-like photoreceptors. Variation in species present and stoichiometrysuggests substantial oceanic microbial diversity.

Microorganisms are responsible for most of thebiogeochemical cycles that shape the environ-ment of Earth and its oceans. Yet, these organ-isms are the least well understood on Earth, asthe ability to study and understand the metabol-ic potential of microorganisms has been ham-pered by the inability to generate pure cultures.Recent studies have begun to explore environ-mental bacteria in a culture-independent man-ner by isolating DNA from environmental sam-ples and transforming it into large insert clones.For example, a previously unknown light-drivenproton pump, proteorhodopsin, was discoveredwithin a bacterial artificial chromosome (BAC)from the genome of a SAR86 ribotype (1), andsoil microbial DNA libraries have been construct-ed and screened for specific activities (2).

Here we have applied whole-genome shot-gun sequencing to environmental-pooled DNAsamples to test whether new genomic approach-es can be effectively applied to gene and spe-cies discovery and to overall environmental

characterization. To help ensure a tractable pilotstudy, we sampled in the Sargasso Sea, a nutrient-limited, open ocean environment. Further, weconcentrated on the genetic material captured onfilters sized to isolate primarily microbial inhabit-ants of the environment, leaving detailed analysisof dissolved DNA and viral particles on one endof the size spectrum and eukaryotic inhabitants onthe other, for subsequent studies.The Sargasso Sea. The northwest Sar-

gasso Sea, at the Bermuda Atlantic Time-seriesStudy site (BATS), is one of the best-studiedand arguably most well-characterized regionsof the global ocean. The Gulf Stream representsthe western and northern boundaries of thisregion and provides a strong physical boundary,separating the low nutrient, oligotrophic openocean from the more nutrient-rich waters of theU.S. continental shelf. The Sargasso Sea hasbeen intensively studied as part of the 50-yeartime series of ocean physics and biogeochem-istry (3, 4) and provides an opportunity forinterpretation of environmental genomic data inan oceanographic context. In this region, for-mation of subtropical mode water occurs eachwinter as the passage of cold fronts across theregion erodes the seasonal thermocline andcauses convective mixing, resulting in mixedlayers of 150 to 300 m depth. The introductionof nutrient-rich deep water, following thebreakdown of seasonal thermoclines into thebrightly lit surface waters, leads to the bloom-ing of single cell phytoplankton, including twocyanobacteria species, Synechococcus and Pro-

chlorococcus, that numerically dominate thephotosynthetic biomass in the Sargasso Sea.

Surface water samples (170 to 200 liters)were collected aboard the RV Weatherbird IIfrom three sites off the coast of Bermuda inFebruary 2003. Additional samples were col-lected aboard the SV Sorcerer II from “Hydro-station S” in May 2003. Sample site locationsare indicated on Fig. 1 and described in tableS1; sampling protocols were fine-tuned fromone expedition to the next (5). Genomic DNAwas extracted from filters of 0.1 to 3.0 !m, andgenomic libraries with insert sizes ranging from2 to 6 kb were made as described (5). Theprepared plasmid clones were sequenced fromboth ends to provide paired-end reads at the J.Craig Venter Science Foundation Joint Tech-nology Center on ABI 3730XL DNA sequenc-ers (Applied Biosystems, Foster City, CA).Whole-genome random shotgun sequencing ofthe Weatherbird II samples (table S1, samples 1 to4) produced 1.66 million reads averaging 818 bpin length, for a total of approximately 1.36 Gbp ofmicrobial DNA sequence. An additional 325,561sequences were generated from the Sorcerer IIsamples (table S1, samples 5 to 7), yielding ap-proximately 265 Mbp of DNA sequence.Environmental genome shotgun as-

sembly. Whole-genome shotgun sequencingprojects have traditionally been applied to iden-tify the genome sequence(s) from one particularorganism, whereas the approach taken here isintended to capture representative sequencefrom many diverse organisms simultaneously.Variation in genome size and relative abun-dance determines the depth of coverage of anyparticular organism in the sample at a givenlevel of sequencing and has strong implicationsfor both the application of assembly algorithmsand for the metrics used in evaluating the re-sulting assembly. Although we would expectabundant species to be deeply covered and wellassembled, species of lower abundance may berepresented by only a few sequences. For asingle genome analysis, assembly coveragedepth in unique regions should approximate aPoisson distribution. The mean of this distribu-tion can be estimated from the observed data,looking at the depth of coverage of contigsgenerated before any scaffolding. The assem-bler used in this study, the Celera Assembler(6), uses this value to heuristically identifyclearly unique regions to form the backbone ofthe final assembly within the scaffolding phase.However, when the starting material consists ofa mixture of genomes of varying abundance, athreshold estimated in this way would classifysamples from the most abundant organism(s) asrepetitive, due to their greater-than-averagedepth of coverage, paradoxically leaving themost abundant organisms poorly assembled.We therefore used manual curation of an initial

1The Institute for Biological Energy Alternatives, 2TheCenter for the Advancement of Genomics, 1901 Re-search Boulevard, Rockville, MD 20850, USA. 3TheInstitute for Genomic Research, 9712 Medical CenterDrive, Rockville, MD 20850, USA. 4The J. Craig VenterScience Foundation Joint Technology Center, 5 Re-search Place, Rockville, MD 20850, USA. 5University ofSouthern California, 223 Science Hall, Los Angeles, CA90089–0740, USA. 6Bermuda Biological Station forResearch, Inc., 17 Biological Lane, St George GE 01,Bermuda.

*To whom correspondence should be addressed. E-mail: [email protected]

RESEARCH ARTICLE

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org66

85Monday, January 28, 13

Page 102: Eisen Lecture for Ian Korf genomics course

Sargasso Sea

assembly to identify a set of large, deeply as-sembling nonrepetitive contigs. This was used toset the expected coverage in unique regions (to23!) for a final run of the assembler. This al-lowed the deep contigs to be treated as uniquesequence when they would otherwise be labeledas repetitive. We evaluated our final assemblyresults in a tiered fashion, looking at well-sampledgenomic regions separately from those barelysampled at our current level of sequencing.

The 1.66 million sequences from theWeatherbird II samples (table S1; samples 1 to4; stations 3, 11, and 13), were pooled andassembled to provide a single master assemblyfor comparative purposes. The assembly gener-ated 64,398 scaffolds ranging in size from 826bp to 2.1 Mbp, containing 256 Mbp of uniquesequence and spanning 400 Mbp. After assem-bly, there remained 217,015 paired-end reads,or “mini-scaffolds,” spanning 820.7 Mbp aswell as an additional 215,038 unassembled sin-gleton reads covering 169.9 Mbp (table S2,column 1). The Sorcerer II samples providedalmost no assembly, so we consider for thesesamples only the 153,458 mini-scaffolds, span-ning 518.4 Mbp, and the remaining 18,692singleton reads (table S2, column 2). In total,1.045 Gbp of nonredundant sequence was gen-erated. The lack of overlapping reads within theunassembled set indicates that lack of addition-al assembly was not due to algorithmic limita-tions but to the relatively limited depth of se-quencing coverage given the level of diversitywithin the sample.

The whole-genome shotgun (WGS) assemblyhas been deposited at DDBJ/EMBL/GenBankunder the project accession AACY00000000,and all traces have been deposited in a corre-sponding TraceDB trace archive. The versiondescribed in this paper is the first version,AACY01000000. Unlike a conventional WGSentry, we have deposited not just contigs andscaffolds but the unassembled paired singletonsand individual singletons in order to accurate-ly reflect the diversity in the sample andallow searches across the entire sample with-in a single database.Genomes and large assemblies. Our

analysis first focused on the well-sampled ge-nomes by characterizing scaffolds with at least3! coverage depth. There were 333 scaffoldscomprising 2226 contigs and spanning 30.9Mbp that met this criterion (table S3), account-ing for roughly 410,000 reads, or 25% of thepooled assembly data set. From this set of well-sampled material, we were able to cluster andclassify assemblies by organism; from the rarespecies in our sample, we used sequence similar-ity based methods together with computationalgene finding to obtain both qualitative and quan-titative estimates of genomic and functional diver-sity within this particular marine environment.

We employed several criteria to sort themajor assembly pieces into tentative organism“bins”; these include depth of coverage, oligo-

nucleotide frequencies (7), and similarity topreviously sequenced genomes (5). With thesetechniques, the majority of sequence assignedto the most abundant species (16.5 Mbp of the30.9 Mb in the main scaffolds) could be sepa-rated based on several corroborating indicators.In particular, we identified a distinct group ofscaffolds representing an abundant populationclearly related to Burkholderia (fig. S2) andtwo groups of scaffolds representing two dis-tinct strains closely related to the published

Shewanella oneidensis genome (8) (fig. S3).There is a group of scaffolds assembling at over6! coverage that appears to represent the ge-nome of a SAR86 (table S3). Scaffold setsrepresenting a conglomerate of Prochlorococ-cus strains (Fig. 2), as well as an unculturedmarine archaeon, were also identified (table S3;Fig. 3). Additionally, 10 putative mega plasmidswere found in the main scaffold set, coveredat depths ranging from 4! to 36! (indicatedwith shading in table S3 with nine depicted in

Fig. 1. MODIS-Aqua satellite image ofocean chlorophyll in the Sargasso Sea gridabout the BATS site from 22 February2003. The station locations are overlainwith their respective identifications. Notethe elevated levels of chlorophyll (greencolor shades) around station 3, which arenot present around stations 11 and 13.

Fig. 2. Gene conser-vation among closelyrelated Prochlorococ-cus. The outermostconcentric circle ofthe diagram depictsthe competed genom-ic sequence of Pro-chlorococcus marinusMED4 (11). Fragmentsfrom environmentalsequencing were com-pared to this complet-ed Prochlorococcus ge-nome and are shown inthe inner concentriccircles and were givenboxed outlines. Genesfor the outermost cir-cle have been as-signed psuedospec-trum colors based onthe position of thosegenes along the chro-mosome, where genesnearer to the start ofthe genome are col-ored in red, and genesnearer to the end of the genome are colored in blue. Fragments from environmental sequencingwere subjected to an analysis that identifies conserved gene order between those fragments andthe completed Prochlorococcus MED4 genome. Genes on the environmental genome segmentsthat exhibited conserved gene order are colored with the same color assignments as theProchlorococcus MED4 chromosome. Colored regions on the environmental segments exhibitingcolor differences from the adjacent outermost concentric circle are the result of conserved geneorder with other MED4 regions and probably represent chromosomal rearrangements. Genes thatdid not exhibit conserved gene order are colored in black.

R E S E A R C H A R T I C L E

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 67

86Monday, January 28, 13

Page 103: Eisen Lecture for Ian Korf genomics course

identified and curated genes. With the vast ma-jority of the Sargasso sequence in short (lessthan 10 kb), unassociated scaffolds and single-tons from hundreds of different organisms, it isimpractical to apply this approach. Instead, wedeveloped an evidence-based gene finder (5).Briefly, evidence in the form of protein align-ments to sequences in the bacterial portion ofthe nonredundant amino acid (nraa) data set(13) was used to determine the most likelycoding frame. Likewise, approximate start andstop positions were determined from the bound-ing coordinates of the alignments and refined toidentify specific start and stop codons. Thisapproach identified 1,214,207 genes coveringover 700 MB of the total data set. This repre-sents approximately an order of magnitudemore sequences than currently archived in thecurated SwissProt database (14), which con-tains 137,885 sequence entries at the time ofwriting; roughly the same number of sequencesas have been deposited into the uncuratedREM-TrEMBL database (14) since its incep-tion in 1996. After excluding all intervals cov-ered by previously identified genes, additionalhypothetical genes were identified on the basisof the presence of conserved open readingframes (5). A total of 69,901 novel genes be-longing to 15,601 single link clusters were iden-tified. The predicted genes were categorized

Fig. 5. Prochlorococcus-related scaffold 2223290 illustrates the assembly of a broad commu-nity of closely related organisms, distinctly nonpunctate in nature. The image represents (A)global structure of Scaffold 2223290 with respect to assembly and (B) a sample of the multiplesequence alignment. Blue segments, contigs; green segments, fragments; and yellow segments,stages of the assembly of fragments into the resulting contigs. The yellow bars indicate thatfragments were initially assembled in several different pieces, which in places collapsed toform the final contig structure. The multiple sequence alignment for this region shows ahomogenous blend of haplotypes, none with sufficient depth of coverage to provide aseparate assembly.

Table 1. Gene count breakdown by TIGR rolecategory. Gene set includes those found on as-semblies from samples 1 to 4 and fragment readsfrom samples 5 to 7. A more detailed table, sep-arating Weatherbird II samples from the Sorcerer IIsamples is presented in the SOM (table S4). Notethat there are 28,023 genes which were classifiedin more than one role category.

TIGR role category Totalgenes

Amino acid biosynthesis 37,118Biosynthesis of cofactors,prosthetic groups, and carriers

25,905

Cell envelope 27,883Cellular processes 17,260Central intermediary metabolism 13,639DNA metabolism 25,346Energy metabolism 69,718Fatty acid and phospholipidmetabolism

18,558

Mobile and extrachromosomalelement functions

1,061

Protein fate 28,768Protein synthesis 48,012Purines, pyrimidines, nucleosides,and nucleotides

19,912

Regulatory functions 8,392Signal transduction 4,817Transcription 12,756Transport and binding proteins 49,185Unknown function 38,067Miscellaneous 1,864Conserved hypothetical 794,061

Total number of roles assigned 1,242,230

Total number of genes 1,214,207

R E S E A R C H A R T I C L E

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 69

87

Fig. 4). Other organisms were not so readilyseparated, presumably reflecting some combi-nation of shorter assemblies with less “taxo-nomic signal,” less distinctive sequence, andgreater divergence from previously sequencedgenomes (9).Discrete species versus a population

continuum. The most deeply covered of thescaffolds (21 scaffolds with over 14! coverageand 9.35 Mb of sequence), contain just over 1single nucleotide polymorphism (SNP) per10,000 base pairs, strongly supporting the pres-ence of discrete species within the sample. Inthe remaining main scaffolds (table S3), theSNP rate ranges from 0 to 26 per 1000 bp, witha length-weighted average of 3.6 per 1000 bp.We closely examined the multiple sequencealignments of the contigs with high SNP ratesand were able to classify these into two fairlydistinct classes: regions where several closelyrelated haplotypes have been collapsed, in-creasing the depth of coverage accordingly(10), and regions that appear to be a relativelyhomogenous blend of discrepancies from theconsensus without any apparent separation intohaplotypes, such as the Prochlorococcus scaf-fold region (Fig. 5). Indeed, the Prochlorococ-cus scaffolds display considerable heterogene-ity not only at the nucleotide sequence level(Fig. 5) but also at the genomic level, wheremultiple scaffolds align with the same region ofthe MED4 (11) genome but differ due to geneor genomic island insertion, deletion, rearrange-ment events. This observation is consistent withprevious findings (12). For instance, scaffolds2221918 and 2223700 share gene synteny witheach other and MED4 but differ by the insertionof 15 genes of probable phage origin, likelyrepresenting an integrated bacteriophage. Thesegenomic differences are displayed graphicallyin Fig. 2, where it is evident that up to fourconflicting scaffolds can align with the sameregion of the MED4 genome. More than 85%of the Prochlorococcus MED4 genome can bealigned with Sargasso Sea scaffolds greaterthan 10 kb; however, there appear to be acouple of regions of MED4 that are not repre-sented in the 10-kb scaffolds (Fig. 2). Thelarger of these two regions (PMM1187 toPMM1277) consists primarily of a gene clustercoding for surface polysaccharide biosynthesis,which may represent a MED4-specific polysac-charide absent or highly diverged in our Sar-gasso Sea Prochlorococcus bacteria. The heter-ogeneity of the Prochlorococcus scaffolds suggestthat the scaffolds are not derived from a singlediscrete strain, but instead probably represent aconglomerate assembled from a population ofclosely related Prochlorococcus biotypes.The gene complement of the Sargasso.

The heterogeneity of the Sargasso sequencescomplicates the identification of microbialgenes. The typical approach for microbial an-notation, model-based gene finding, relies en-tirely on training with a subset of manually

Fig. 3. Comparison ofSargasso Sea scaf-folds to Crenarchaealclone 4B7. Predictedproteins from 4B7and the scaffoldsshowing significanthomology to 4B7 bytBLASTx are arrayedin positional orderalong the x and yaxes. Colored boxesrepresent BLASTpmatches scoring atleast 25% similarityand with an e valueof better than 1e-5.Black vertical andhorizontal lines delin-eate scaffold borders.

Fig. 4. Circular diagrams of nine complete megaplasmids. Genes encoded in the forward directionare shown in the outer concentric circle; reverse coding genes are shown in the inner concentriccircle. The genes have been given role category assignment and colored accordingly: amino acidbiosynthesis, violet; biosynthesis of cofactors, prosthetic groups, and carriers, light blue; cellenvelope, light green; cellular processes, red; central intermediary metabolism, brown; DNAmetabolism, gold; energy metabolism, light gray; fatty acid and phospholipid metabolism, magenta;protein fate and protein synthesis, pink; purines, pyrimidines, nucleosides, and nucleotides, orange;regulatory functions and signal transduction, olive; transcription, dark green; transport and bindingproteins, blue-green; genes with no known homology to other proteins and genes with homologyto genes with no known function, white; genes of unknown function, gray; Tick marks are placedon 10-kb intervals.

R E S E A R C H A R T I C L E

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org68

assembly to identify a set of large, deeply as-sembling nonrepetitive contigs. This was used toset the expected coverage in unique regions (to23!) for a final run of the assembler. This al-lowed the deep contigs to be treated as uniquesequence when they would otherwise be labeledas repetitive. We evaluated our final assemblyresults in a tiered fashion, looking at well-sampledgenomic regions separately from those barelysampled at our current level of sequencing.

The 1.66 million sequences from theWeatherbird II samples (table S1; samples 1 to4; stations 3, 11, and 13), were pooled andassembled to provide a single master assemblyfor comparative purposes. The assembly gener-ated 64,398 scaffolds ranging in size from 826bp to 2.1 Mbp, containing 256 Mbp of uniquesequence and spanning 400 Mbp. After assem-bly, there remained 217,015 paired-end reads,or “mini-scaffolds,” spanning 820.7 Mbp aswell as an additional 215,038 unassembled sin-gleton reads covering 169.9 Mbp (table S2,column 1). The Sorcerer II samples providedalmost no assembly, so we consider for thesesamples only the 153,458 mini-scaffolds, span-ning 518.4 Mbp, and the remaining 18,692singleton reads (table S2, column 2). In total,1.045 Gbp of nonredundant sequence was gen-erated. The lack of overlapping reads within theunassembled set indicates that lack of addition-al assembly was not due to algorithmic limita-tions but to the relatively limited depth of se-quencing coverage given the level of diversitywithin the sample.

The whole-genome shotgun (WGS) assemblyhas been deposited at DDBJ/EMBL/GenBankunder the project accession AACY00000000,and all traces have been deposited in a corre-sponding TraceDB trace archive. The versiondescribed in this paper is the first version,AACY01000000. Unlike a conventional WGSentry, we have deposited not just contigs andscaffolds but the unassembled paired singletonsand individual singletons in order to accurate-ly reflect the diversity in the sample andallow searches across the entire sample with-in a single database.Genomes and large assemblies. Our

analysis first focused on the well-sampled ge-nomes by characterizing scaffolds with at least3! coverage depth. There were 333 scaffoldscomprising 2226 contigs and spanning 30.9Mbp that met this criterion (table S3), account-ing for roughly 410,000 reads, or 25% of thepooled assembly data set. From this set of well-sampled material, we were able to cluster andclassify assemblies by organism; from the rarespecies in our sample, we used sequence similar-ity based methods together with computationalgene finding to obtain both qualitative and quan-titative estimates of genomic and functional diver-sity within this particular marine environment.

We employed several criteria to sort themajor assembly pieces into tentative organism“bins”; these include depth of coverage, oligo-

nucleotide frequencies (7), and similarity topreviously sequenced genomes (5). With thesetechniques, the majority of sequence assignedto the most abundant species (16.5 Mbp of the30.9 Mb in the main scaffolds) could be sepa-rated based on several corroborating indicators.In particular, we identified a distinct group ofscaffolds representing an abundant populationclearly related to Burkholderia (fig. S2) andtwo groups of scaffolds representing two dis-tinct strains closely related to the published

Shewanella oneidensis genome (8) (fig. S3).There is a group of scaffolds assembling at over6! coverage that appears to represent the ge-nome of a SAR86 (table S3). Scaffold setsrepresenting a conglomerate of Prochlorococ-cus strains (Fig. 2), as well as an unculturedmarine archaeon, were also identified (table S3;Fig. 3). Additionally, 10 putative mega plasmidswere found in the main scaffold set, coveredat depths ranging from 4! to 36! (indicatedwith shading in table S3 with nine depicted in

Fig. 1. MODIS-Aqua satellite image ofocean chlorophyll in the Sargasso Sea gridabout the BATS site from 22 February2003. The station locations are overlainwith their respective identifications. Notethe elevated levels of chlorophyll (greencolor shades) around station 3, which arenot present around stations 11 and 13.

Fig. 2. Gene conser-vation among closelyrelated Prochlorococ-cus. The outermostconcentric circle ofthe diagram depictsthe competed genom-ic sequence of Pro-chlorococcus marinusMED4 (11). Fragmentsfrom environmentalsequencing were com-pared to this complet-ed Prochlorococcus ge-nome and are shown inthe inner concentriccircles and were givenboxed outlines. Genesfor the outermost cir-cle have been as-signed psuedospec-trum colors based onthe position of thosegenes along the chro-mosome, where genesnearer to the start ofthe genome are col-ored in red, and genesnearer to the end of the genome are colored in blue. Fragments from environmental sequencingwere subjected to an analysis that identifies conserved gene order between those fragments andthe completed Prochlorococcus MED4 genome. Genes on the environmental genome segmentsthat exhibited conserved gene order are colored with the same color assignments as theProchlorococcus MED4 chromosome. Colored regions on the environmental segments exhibitingcolor differences from the adjacent outermost concentric circle are the result of conserved geneorder with other MED4 regions and probably represent chromosomal rearrangements. Genes thatdid not exhibit conserved gene order are colored in black.

R E S E A R C H A R T I C L E

www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 67Monday, January 28, 13

Page 104: Eisen Lecture for Ian Korf genomics course

rRNA phylotyping from metagenomics

Venter et al., 2004

88Monday, January 28, 13

Page 105: Eisen Lecture for Ian Korf genomics course

Shotgun Sequencing Allows Alternative Anchors (e.g., RecA)

Venter et al., 2004

89

Monday, January 28, 13

Page 106: Eisen Lecture for Ian Korf genomics course

using the curated TIGR role categories (5). Abreakdown of predicted genes by category isgiven in Table 1.

The samples analyzed here represent onlyspecific size fractions of the sampled environ-ment, dictated by the pore size of the collectionfilters. By our selection of filter pore sizes, wedeliberately focused this initial study on theidentification and analysis of microbial organ-isms. However, we did examine the data for thepresence of eukaryotic content as well. Al-though the bulk of known protists are 10 !mand larger, there are some known in the rangeof 1 to 1.5 !m in diameter [for example, Os-treococcus tauri (15) and the Bolidomonas spe-cies (16)], and such organisms could potentiallywork their way through a 0.80 !m prefilter. Aninitial screening for 18S ribosomal RNA(rRNA), a commonly used eukaryotic marker,identified 69 18S rRNA genes, with 63 of theseon singletons and the remaining 6 on verysmall, lowcoverage assemblies. These 18SrRNAs are similar to uncultured marine eu-karyotes and are indicative of a eukaryotic pres-ence but inconclusive on their own. Becausebacterial DNA contains a much greater densityof genes than eukaryotic DNA, the relativeproportion of gene content can be used as an-other indicator to distinguish eukaryotic mate-rial in our sample. An inverse relation wasobserved between the pore size of the pre-filtersand collection filters and the fraction of se-quence coding for genes (table S5). This rela-tion, together with the presence of 18S rRNAgenes in the samples, is strong evidence thateukaryotic material was indeed captured.Diversity and species richness. Most

phylogenetic surveys of uncultured organismshave been based on studies of rRNA genesusing polymerase chain reaction (PCR) withprimers for highly conserved positions in thosegenes. More than 60,000 small subunit rRNAsequences from a wide diversity of prokaryotictaxa have been reported (17). However, PCR-based studies are inherently biased, because notall rRNA genes amplify with the same “univer-sal” primers. Within our shotgun sequence dataand assemblies, we identified 1164 distinctsmall subunit rRNA genes or fragments ofgenes in the Weatherbird II assemblies andanother 248 within the Sorcerer II reads (5).Using a 97% sequence similarity cutoff to dis-tinguish unique phylotypes, we identified 148previously unknown phylotypes in our samplewhen compared against the RDP II database(17). With a 99% similarity cutoff, this numberincreases to 643. Though sequence similarity isnot necessarily an accurate predictor of func-tional conservation and sequence divergencedoes not universally correlate with the biologi-cal notion of “species,” defining species (alsoknown as phylotypes) by sequence similaritywithin the rRNA genes is the accepted standardin studies of uncultured microbes. All sampledrRNAs were then assigned to taxonomic groups

using an automated rRNA classification pro-gram (5). Our samples are dominated by rRNAgenes from Proteobacteria (primarily membersof the ", #, and $ subgroups) with moderatecontributions from Firmicutes (low-GC Grampositive), Cyanobacteria, and species in theCFB phyla (Cytophaga, Flavobacterium, andBacteroides) (fig. S4A; Fig. 6). The patterns wesee are similar in broad outline to those ob-served by rRNA PCR studies from the SargassoSea (18), but with some quantitative differencesthat reflect either biases in PCR studies or dif-ferences in the species found in our sampleversus those in other studies.

An additional disadvantage associated withrelying on rRNA for estimates of species diver-sity and abundance is the varying number ofcopies of rRNA genes between taxa (more thanan order of magnitude among prokaryotes)(19). Therefore, we constructed phylogenetictrees (fig. S4, B to E) using other representedphylogenetic markers found in our data set,[RecA/RadA, heat shock protein 70 (HSP70),elongation factor Tu (EF-Tu), and elongationfactor G (EF-G)]. Each marker gene interval inour data set (with a minimum length of 75amino acids) was assigned to a putative taxo-nomic group using the phylogenetic analysisdescribed for rRNA. For example, our data set

contains over 600 recA homologs fromthroughout the bacterial phylogeny, includingrepresentatives of Proteobacteria, low- andhigh-GC Gram positives, Cyanobacteria, greensulfur and green nonsulfur bacteria, and othergroups. Assignment to phylogenetic groupsshows a broad consensus among the differentphylogenetic markers. For most taxa, therRNA-based proportion is the highest or lowestin comparison to the other markers. We believethis is due to the large amount of variation incopy number of rRNA genes between species.For example, the rRNA-based estimate of theproportion of $Proteobacteria is the highest,while the estimate for cyanobacteria is the low-est, which is consistent with the reports thatmembers of the $-Proteobacteria frequentlyhave more than five rRNA operon copies,whereas cyanobacteria frequently have fewerthan three (19).

Just as phylogenetic classification isstrengthened by a more comprehensive markerset, so too is the estimation of species richness.In this analysis, we define “genomic” species asa clustering of assemblies or unassembled readsmore than 94% identical on the nucleotide lev-el. This cutoff, adjusted for the protein-codingmarker genes, is roughly comparable to the97% cutoff traditionally used for rRNA. Thus

Fig. 6. Phylogenetic diversity of Sargasso Sea sequences using multiple phylogenetic markers. Therelative contribution of organisms from different major phylogenetic groups (phylotypes) wasmeasured using multiple phylogenetic markers that have been used previously in phylogeneticstudies of prokaryotes: 16S rRNA, RecA, EF-Tu, EF-G, HSP70, and RNA polymerase B (RpoB). Therelative proportion of different phylotypes for each sequence (weighted by the depth of coverageof the contigs from which those sequences came) is shown. The phylotype distribution wasdetermined as follows: (i) Sequences in the Sargasso data set corresponding to each of these geneswere identified using HMM and BLAST searches. (ii) Phylogenetic analysis was performed for eachphylogenetic marker identified in the Sargasso data separately compared with all members of thatgene family in all complete genome sequences (only complete genomes were used to control forthe differential sampling of these markers in GenBank). (iii) The phylogenetic affinity of eachsequence was assigned based on the classification of the nearest neighbor in the phylogenetic tree.

R E S E A R C H A R T I C L E

2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org70

90Monday, January 28, 13

Page 107: Eisen Lecture for Ian Korf genomics course

Functional Inference from Metagenomics

• Can work well for individual genes

91Monday, January 28, 13

Page 108: Eisen Lecture for Ian Korf genomics course

Functional Diversity of Proteorhodopsins?

Venter et al., 2004

92Monday, January 28, 13

Page 109: Eisen Lecture for Ian Korf genomics course

Community Function?

• Many attempts to treat community as a bag of genes

• The run pathway prediction tools on entire data set to try and predict “community metabolism”

• Does not work very well

93Monday, January 28, 13

Page 110: Eisen Lecture for Ian Korf genomics course

ABCDEFG

TUVWXYZ

Binning challenge

94Monday, January 28, 13

Page 111: Eisen Lecture for Ian Korf genomics course

ABCDEFG

TUVWXYZ

Binning challenge

Best binning method: reference genomes

95Monday, January 28, 13

Page 112: Eisen Lecture for Ian Korf genomics course

Metagenomics & Ecology

96Monday, January 28, 13

Page 113: Eisen Lecture for Ian Korf genomics course

(figs. S5 and S6). The general patterns of archaealdistribution we observed were consistent with pre-vious field surveys (15, 25, 26). Recovery of ‘‘groupII’’ planktonic Euryarchaeota genomic DNA wasgreatest in the upper water column and declinedbelow the photic zone. This distribution corrob-orates recent observations of ion-translocating pho-toproteins (called proteorhodopsins), now knownto occur in group II Euryarchaeota inhabiting thephotic zone (27). ‘‘Group III’’ EuryarchaeotaDNAwas recovered at all depths, but at a much lowerfrequency (figs. S5 and S6). A novel crenarchaealgroup, closely related to a putatively thermophilicCrenarchaeota (28), was observed at the greatestdepths (fig. S6).

Vertically Distributed Genesand Metabolic PathwaysThe depths sampled were specifically chosen tocapture microbial sequences at discrete biogeo-chemical zones in thewater column encompassingkey physicochemical features (Tables 1 and 2,Fig. 1; figs. S1 and S2). To evaluate sequencesfrom each depth, fosmid end sequences werecompared against different databases includingthe Kyoto Encyclopedia of Genes and Genomes(KEGG) (29), National Center for BiotechnologyInformation (NCBI)’s Clusters of OrthologousGroups (COG) (30), and SEED subsystems (31).After categorizing sequences from each depth inBLAST searches (32) against each database, weidentified protein categories that were more orless well represented in one sample versus an-other, using cluster analysis (33, 34) and boot-strap resampling methodologies (35).

Cluster analyses of predicted protein sequencerepresentation identified specific genes and meta-bolic traits that were differentially distributed inthe water column (fig. S7). In the photic zone (10,70, and 130 m), these included a greaterrepresentation in sequences associated with pho-tosynthesis; porphyrin and chlorophyll metabo-lism; type III secretion systems; and aminosugars,purine, proponoate, and vitamin B6 metabolism,relative to deep-water samples (fig. S7). Indepen-dent comparisons with well-annotated subsystemsin the SEED database (31) also showed similarand overlapping trends (table S1), includinggreater representation in photic zone sequencesassociated with alanine and aspartate; metabolismof aminosugars; chlorophyll and carotenoidbiosynthesis; maltose transport; lactose degrada-tion; and heavy metal ion sensors and exporters.In contrast, samples from depths of 200 m andbelow (where there is no photosynthesis) wereenriched in different sequences, including thoseassociated with protein folding; processing andexport; methionine metabolism; glyoxylate, dicar-boxylate, and methane metabolism; thiaminemetabolism; and type II secretion systems, relativeto surface-water samples (fig. S7).

COG categories also provided insight intodifferentially distributed protein functions andcategories. COGsmore highly represented in photiczone included iron-transport membrane receptors,

deoxyribopyrimidine photolyase, diaminopimelatedecarboxylase, membrane guanosine triphospha-tase (GTPase) with the lysyl endopeptidase geneproduct LepA, and branched-chain amino acid–transport system components (fig. S8). In con-trast, COGs with greater representation indeep-water samples included transposases, sev-eral dehydrogenase categories, and integrases(fig. S8). Sequences more highly represented inthe deep-water samples in SEED subsystem (31)comparisons included those associated withrespiratory dehydrogenases, polyamine adeno-sine triphosphate (ATP)–binding cassette (ABC)transporters, polyamine metabolism, and alkyl-phosphonate transporters (table S1).

Habitat-enriched sequences. We estimatedaverage protein sequence similarities between alldepth bins from cumulative TBLASTX high-scoring sequence pair (HSP) bitscores, derivedfrom BLAST searches of each depth againstevery other (Fig. 3). Neighbor-joining analysesof a normalized, distance matrix derived fromthese cumulative bitscores joined photic zoneand deeper samples together in separate clusters(Fig. 3). When we compared our HOT sequencedatasets to previously reported Sargasso Seamicrobial sequences (19), these datasets alsoclustered according to their depth and sizefraction of origin (fig. S9). The clusteringpattern in Fig. 3 is consistent with the ex-pectation that randomly sampled photic zonemicrobial sequences will tend on average to bemore similar to one another, than to those fromthe deep-sea, and vice-versa.

We also identified those sequences (some ofwhich have no homologs in annotated databases)

that track major depth-variable environmentalfeatures. Specifically, sequence homologs foundonly in the photic zone unique sequences (from10, 70, and 130 m), or deepwater uniquesequences (from 500, 770, and 4000 m) wereidentified (Fig. 3). To categorize potentialfunctions encoded in these photic zone unique(PZ) or deep-water unique (DW) sequencebins, each was compared with KEGG, COG,and NCBI protein databases in separate analy-ses (29, 30, 36).

Some KEGG metabolic pathways appearedmore highly represented in the PZ than in DWsequence bins, including those associated withphotosynthesis; porphyrin and chlorophyll metab-olism; propanoate, purine, and glycerphospholipidmetabolism; bacterial chemotaxis; flagellar assem-bly; and type III secretion systems (Fig. 4A). Allproteorhodopsin sequences (except one) werecaptured in the PZ bin. Well-represented photiczone KEGG pathway categories appeared to re-flect potential pathway interdependencies. Forexample the PZ photosynthesis bin [3% of thetotal (Fig. 4A)] contained Prochlorococcus-likeand Synechococcus-like photosystem I, photo-system II, and cytochrome genes. In tandem,PZ porphyrin and chlorophyll biosynthesis se-quence bins [È3.9% of the total (Fig. 4A)] con-tained high representation of cyanobacteria-likecobalamin and chlorophyll biosynthesis genes, aswell as photoheterotroph-like bacteriochloro-phyll biosynthetic genes. Other probable func-tional interdependencies appear reflected in thecorecovery of sequences associated with che-motaxis (mostly methyl-accepting chemotaxisproteins), flagellar biosynthesis (predominant-

34 34.5 35 35.50

5

10

15

20

25

30

1070

130

200

500

770

4000

Pot

entia

l tem

pera

ture

(°C

)D

epth (meters)

Salinity

Fig. 1. Temperature versus salinity (T-S) relations for the North Pacific Subtropical Gyre at stationALOHA (22-45’N, 158-W). The blue circles indicate the positions, in T-S ‘‘hydrospace’’ of the sevenwater samples analyzed in this study. The data envelope shows the temperature and salinityconditions observed during the period October 1988 to December 2004 emphasizing both thetemporal variability of near-surface waters and the relative constancy of deep waters.

RESEARCH ARTICLES

27 JANUARY 2006 VOL 311 SCIENCE www.sciencemag.org498

on

Ju

ne

1,

20

10

w

ww

.sc

ien

ce

ma

g.o

rgD

ow

nlo

ad

ed

fro

m

97Monday, January 28, 13

Page 114: Eisen Lecture for Ian Korf genomics course

Field Diversity

98Monday, January 28, 13

Page 115: Eisen Lecture for Ian Korf genomics course

ARTICLES

A human gut microbial gene catalogueestablished by metagenomic sequencingJunjie Qin1*, Ruiqiang Li1*, Jeroen Raes2,3, Manimozhiyan Arumugam2, Kristoffer Solvsten Burgdorf4,Chaysavanh Manichanh5, Trine Nielsen4, Nicolas Pons6, Florence Levenez6, Takuji Yamada2, Daniel R. Mende2,Junhua Li1,7, Junming Xu1, Shaochuan Li1, Dongfang Li1,8, Jianjun Cao1, Bo Wang1, Huiqing Liang1, Huisong Zheng1,Yinlong Xie1,7, Julien Tap6, Patricia Lepage6, Marcelo Bertalan9, Jean-Michel Batto6, Torben Hansen4, Denis LePaslier10, Allan Linneberg11, H. Bjørn Nielsen9, Eric Pelletier10, Pierre Renault6, Thomas Sicheritz-Ponten9,Keith Turner12, Hongmei Zhu1, Chang Yu1, Shengting Li1, Min Jian1, Yan Zhou1, Yingrui Li1, Xiuqing Zhang1,Songgang Li1, Nan Qin1, Huanming Yang1, Jian Wang1, Søren Brunak9, Joel Dore6, Francisco Guarner5,Karsten Kristiansen13, Oluf Pedersen4,14, Julian Parkhill12, Jean Weissenbach10, MetaHIT Consortium{, Peer Bork2,S. Dusko Ehrlich6 & Jun Wang1,13

To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Herewe describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million non-redundantmicrobial genes, derived from 576.7 gigabases of sequence, from faecal samples of 124 European individuals. The gene set,,150 times larger than the human gene complement, contains an overwhelming majority of the prevalent (more frequent)microbial genes of the cohort and probably includes a large proportion of the prevalent human intestinal microbial genes. Thegenes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, indicating that the entirecohort harbours between 1,000 and 1,150 prevalent bacterial species and each individual at least 160 such species, which arealso largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms offunctions present in all individuals and most bacteria, respectively.

It has been estimated that the microbes in our bodies collectivelymake up to 100 trillion cells, tenfold the number of human cells,and suggested that they encode 100-fold more unique genes thanour own genome1. The majority of microbes reside in the gut, havea profound influence on human physiology and nutrition, and arecrucial for human life2,3. Furthermore, the gut microbes contribute toenergy harvest from food, and changes of gut microbiome may beassociated with bowel diseases or obesity4–8.

To understand and exploit the impact of the gut microbes onhuman health and well-being it is necessary to decipher the content,diversity and functioning of the microbial gut community. 16S ribo-somal RNA gene (rRNA) sequence-based methods9 revealed that twobacterial divisions, the Bacteroidetes and the Firmicutes, constituteover 90% of the known phylogenetic categories and dominate thedistal gut microbiota10. Studies also showed substantial diversity ofthe gut microbiome between healthy individuals4,8,10,11. Although thisdifference is especially marked among infants12, later in life the gutmicrobiome converges to more similar phyla.

Metagenomic sequencing represents a powerful alternative torRNA sequencing for analysing complex microbial communities13–15.Applied to the human gut, such studies have already generated some3 gigabases (Gb) of microbial sequence from faecal samples of 33

individuals from the United States or Japan8,16,17. To get a broaderoverview of the human gut microbial genes we used the IlluminaGenome Analyser (GA) technology to carry out deep sequencing oftotal DNA from faecal samples of 124 European adults. We generated576.7 Gb of sequence, almost 200 times more than in all previousstudies, assembled it into contigs and predicted 3.3 million uniqueopen reading frames (ORFs). This gene catalogue contains virtuallyall of the prevalent gut microbial genes in our cohort, provides abroad view of the functions important for bacterial life in the gutand indicates that many bacterial species are shared by differentindividuals. Our results also show that short-read metagenomicsequencing can be used for global characterization of the geneticpotential of ecologically complex environments.

Metagenomic sequencing of gut microbiomes

As part of the MetaHIT (Metagenomics of the Human IntestinalTract) project, we collected faecal specimens from 124 healthy, over-weight and obese individual human adults, as well as inflammatorybowel disease (IBD) patients, from Denmark and Spain (Supplemen-tary Table 1). Total DNA was extracted from the faecal specimens18

and an average of 4.5 Gb (ranging between 2 and 7.3 Gb) of sequencewas generated for each sample, allowing us to capture most of the

*These authors contributed equally to this work.{Lists of authors and affiliations appear at the end of the paper.

1BGI-Shenzhen, Shenzhen 518083, China. 2European Molecular Biology Laboratory, 69117 Heidelberg, Germany. 3VIB—Vrije Universiteit Brussel, 1050 Brussels, Belgium. 4HagedornResearch Institute, DK 2820 Copenhagen, Denmark. 5Hospital Universitari Val d’Hebron, Ciberehd, 08035 Barcelona, Spain. 6Institut National de la Recherche Agronomique, 78350Jouy en Josas, France. 7School of Software Engineering, South China University of Technology, Guangzhou 510641, China. 8Genome Research Institute, Shenzhen University MedicalSchool, Shenzhen 518000, China. 9Center for Biological Sequence Analysis, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark. 10Commissariat a l’EnergieAtomique, Genoscope, 91000 Evry, France. 11Research Center for Prevention and Health, DK-2600 Glostrup, Denmark. 12The Wellcome Trust Sanger Institute, Hinxton, CambridgeCB10 1SA, UK. 13Department of Biology, University of Copenhagen, DK-2200 Copenhagen, Denmark. 14Institute of Biomedical Sciences, University of Copenhagen & Faculty of HealthScience, University of Aarhus, 8000 Aarhus, Denmark.

Vol 464 | 4 March 2010 | doi:10.1038/nature08821

59Macmillan Publishers Limited. All rights reserved©2010

99Monday, January 28, 13

Page 116: Eisen Lecture for Ian Korf genomics course

Almost all (99.96%) of the phylogenetically assigned genes belongedto the Bacteria and Archaea, reflecting their predominance in the gut.Genes that were not mapped to orthologous groups were clusteredinto gene families (see Methods). To investigate the functional con-tent of the prevalent gene set we computed the total number oforthologous groups and/or gene families present in any combinationof n individuals (with n 5 2–124; see Fig. 2c). This rarefaction ana-lysis shows that the ‘known’ functions (annotated in eggNOG orKEGG) quickly saturate (a value of 5,569 groups was observed): whensampling any subset of 50 individuals, most have been detected.However, three-quarters of the prevalent gut functionalities consistsof uncharacterized orthologous groups and/or completely novel genefamilies (Fig. 2c). When including these groups, the rarefaction curveonly starts to plateau at the very end, at a much higher level (19,338groups were detected), confirming that the extensive sampling of alarge number of individuals was necessary to capture this considerableamount of novel/unknown functionality.

Bacterial functions important for life in the gut

The extensive non-redundant catalogue of the bacterial genes fromthe human intestinal tract provides an opportunity to identify bac-terial functions important for life in this environment. There arefunctions necessary for a bacterium to thrive in a gut context (thatis, the ‘minimal gut genome’) and those involved in the homeostasisof the whole ecosystem, encoded across many species (the ‘minimalgut metagenome’). The first set of functions is expected to be presentin most or all gut bacterial species; the second set in most or allindividuals’ gut samples.

To identify the functions encoded by the minimal gut genome weuse the fact that they should be present in most or all gut bacterialspecies and therefore appear in the gene catalogue at a frequencyabove that of the functions present in only some of the gut bacterialspecies. The relative frequency of different functions can be deducedfrom the number of genes recruited to different eggNOG clusters,after normalization for gene length and copy number (Supplemen-tary Fig. 10a, b). We ranked all the clusters by gene frequencies anddetermined the range that included the clusters specifying well-known essential bacterial functions, such as those determined experi-mentally for a well-studied firmicute, Bacillus subtilis27, hypothe-sizing that additional clusters in this range are equally important.As expected, the range that included most of B. subtilis essentialclusters (86%) was at the very top of the ranking order (Fig. 5).Some 76% of the clusters with essential genes of Escherichia coli28

were within this range, confirming the validity of our approach.This suggests that 1,244 metagenomic clusters found within the range(Supplementary Table 10; termed ‘range clusters’ hereafter) specifyfunctions important for life in the gut.

We found two types of functions among the range clusters: thoserequired in all bacteria (housekeeping) and those potentially specificfor the gut. Among many examples of the first category are thefunctions that are part of main metabolic pathways (for example,central carbon metabolism, amino acid synthesis), and importantprotein complexes (RNA and DNA polymerase, ATP synthase, generalsecretory apparatus). Not surprisingly, projection of the range clusterson the KEGG metabolic pathways gives a highly integrated picture ofthe global gut cell metabolism (Fig. 6a).

The putative gut-specific functions include those involved in adhe-sion to the host proteins (collagen, fibrinogen, fibronectin) or inharvesting sugars of the globoseries glycolipids, which are carriedon blood and epithelial cells. Furthermore, 15% of range clustersencode functions that are present in ,10% of the eggNOG genomes(see Supplementary Fig. 11) and are largely (74.3%) not defined(Fig. 6b). Detailed studies of these should lead to a deeper compre-hension of bacterial life in the gut.

To identify the functions encoded by the minimal gut metagenome,we computed the orthologous groups that are shared by individuals ofour cohort. This minimal set, of 6,313 functions, is much larger than theone estimated in a previous study8. There are only 2,069 functionallyannotated orthologous groups, showing that they gravely underesti-mate the true size of the common functional complement among indi-viduals (Fig. 6c). The minimal gut metagenome includes a considerablefraction of functions (,45%) that are present in ,10% of thesequenced bacterial genomes (Fig. 6c, inset). These otherwise rare func-tionalities that are found in each of the 124 individuals may be necessaryfor the gut ecosystem. Eighty per cent of these orthologous groupscontain genes with at best poorly characterized function, underscoringour limited knowledge of gut functioning.

Of the known fraction, about 5% codes for (pro)phage-relatedproteins, implying a universal presence and possible important eco-logical role of bacteriophages in gut homeostasis. The most strikingsecondary metabolism that seems crucial for the minimal metage-nome relates, not unexpectedly, to biodegradation of complex sugarsand glycans harvested from the host diet and/or intestinal lining.Examples include degradation and uptake pathways for pectin(and its monomer, rhamnose) and sorbitol, sugars which are omni-present in fruits and vegetables, but which are not or poorly absorbedby humans. As some gut microorganisms were found to degrade bothof them29,30, this capacity seems to be selected for by the gut ecosystemas a non-competitive source of energy. Besides these, capacity toferment, for example, mannose, fructose, cellulose and sucrose is alsopart of the minimal metagenome. Together, these emphasize the

40

30

20

10

0

Clu

ster

(%)

1 2,001 4,001 6,001 8,001 10,001Cluster rank

Range

Figure 5 | Clusters that contain the B. subtilis essential genes. The clusterswere ranked by the number of genes they contain, normalized by averagelength and copy number (see Supplementary Fig. 10), and the proportion ofclusters with the essential B. subtilis genes was determined for successivegroups of 100 clusters. Range indicates the part of the cluster distributionthat contains 86% of the B. subtilis essential genes.

• •

• •

••

••

• •

• •

••

••

Healthy

Crohn’s disease

Ulcerative colitis

P value: 0.031

PC2

PC1

Figure 4 | Bacterial species abundance differentiates IBD patients andhealthy individuals. Principal component analysis with health status asinstrumental variables, based on the abundance of 155 species with $1%genome coverage by the Illumina reads in at least 1 individual of the cohort,was carried out with 14 healthy individuals and 25 IBD patients (21 ulcerativecolitis and 4 Crohn’s disease) from Spain (Supplementary Table 1). Two firstcomponents (PC1 and PC2) were plotted and represented 7.3% of wholeinertia. Individuals (represented by points) were clustered and centre ofgravity computed for each class; P-value of the link between health status andspecies abundance was assessed using a Monte-Carlo test (999 replicates).

ARTICLES NATURE | Vol 464 | 4 March 2010

62Macmillan Publishers Limited. All rights reserved©2010

100Monday, January 28, 13

Page 117: Eisen Lecture for Ian Korf genomics course

Woese Tree of Life

adapted from Baldauf, et al., in Assembling the Tree of Life, 2004

??????

101Monday, January 28, 13

Page 118: Eisen Lecture for Ian Korf genomics course

GEBA Lesson 6: Improves analysis of metagenomic data

102Monday, January 28, 13

Page 119: Eisen Lecture for Ian Korf genomics course

0

0.125

0.250

0.375

0.500

Alphapro

teobacteria

Betap

roteobacteria

Gamm

aproteobacteria

Epsilo

nproteobacteria

Deltapro

teobacteria

Cyanobacteria

Firmicutes

Actinobacteria

Chlorobi

CFB

Chloroflexi

Spirochaetes

Fusobacteria

Deinococcus-Th

ermus

Euryarchaeota

Crenarchaeota

Sargasso Phylotypes

Wei

ghte

d %

of C

lone

s

Major Phylogenetic Group

EFGEFTuHSP70RecARpoBrRNA

Other Markers

GEBA Project improves metagenomic analysis

Venter et al., Science 304: 66-74. 2004 103Monday, January 28, 13

Page 120: Eisen Lecture for Ian Korf genomics course

0

0.125

0.250

0.375

0.500

Alphapro

teobacteria

Betap

roteobacteria

Gamm

aproteobacteria

Epsilo

nproteobacteria

Deltapro

teobacteria

Cyanobacteria

Firmicutes

Actinobacteria

Chlorobi

CFB

Chloroflexi

Spirochaetes

Fusobacteria

Deinococcus-Th

ermus

Euryarchaeota

Crenarchaeota

Sargasso Phylotypes

EFG EFTu HSP70RecA RpoB rRNA

But not a lot

Venter et al., Science 304: 66-74. 2004

Other Markers

104Monday, January 28, 13

Page 121: Eisen Lecture for Ian Korf genomics course

rRNA Tree of Life

Figure from Barton, Eisen et al. “Evolution”, CSHL Press. 2007.

Based on tree from Pace 1997 Science 276:734-740

Archaea

Eukaryotes

Bacteria

105Monday, January 28, 13

Page 126: Eisen Lecture for Ian Korf genomics course

Uncultured Lineages:

• Get into culture

• Enrichment cultures

• If abundant in low diversity ecosystems

• Flow sorting

• Microbeads

• Microfluidic sorting

• Single cell amplification

110Monday, January 28, 13

Page 127: Eisen Lecture for Ian Korf genomics course

111

Number of SAGs from Candidate Phyla

OD

1

OP

11

OP

3

SA

R4

06

Site A: Hydrothermal vent 4 1 - -Site B: Gold Mine 6 13 2 -Site C: Tropical gyres (Mesopelagic) - - - 2Site D: Tropical gyres (Photic zone) 1 - - -

Sample collections at 4 additional sites are underway.

Phil Hugenholtz

GEBA uncultured

Monday, January 28, 13

Page 128: Eisen Lecture for Ian Korf genomics course

Example: Sharpshooter symbionts

112Monday, January 28, 13

Page 129: Eisen Lecture for Ian Korf genomics course

• Obligate xylem feeder

• Transmits Xylella between plants

• Much like mosquitoes transmit malarial pathogen

• Only animal listed as possible “bioterror” agent by US DHS

113

Glassy winged sharpshooter

GLASSY-WINGEDSHARPSHOOTERA Serious Threat to California Agriculture

FROM THE

UNIVERSITY OF CALIFORNIA’S PIERCE’S DISEASE RESEARCH AND

EMERGENCY RESPONSE TASK FORCE

Glassy-winged sharpshooter eggs are laid together on theunderside of leaves, usually in groups of 10 to 12. The eggmasses appear as small, greenish blisters. These blisters areeasier to observe after the eggs hatch, when they appearas tan to brown scars on the leaves.

Parasitized egg masses are tan to brown in color withsmall, circular holes at one end of the eggs.

This informational brochure was produced by ANRCommunication Services for the University of Califor-nia Pierce’s Disease Research and EmergencyResponse Task Force. You may download a copy of thebrochure from the Division of Agriculture and NaturalResources web site at http://danr.ucop.edu or from theCommunication Services web site athttp://danrcs.ucdavis.edu.

Download a copy of this brochure from http://danr.ucop.edu or http://danrcs.ucdavis.edu

For local information, contact your UC CooperativeExtension farm advisor:

Adults

Egg masses

Glassy-winged SharpshooterGeneralized Lifecycle

100

80

60

40

20

0

Jan.

Mar

.

May

July

Sept

.

Nov.

Glassy-winged sharpshooters overwinter as adultsand begin laying egg masses in late Februarythrough May. This first generation matures asadults in late May through late August. Second-generation egg masses are laid starting in mid-June through late September, which develop intoover-wintering adults.

Monday, January 28, 13

Page 130: Eisen Lecture for Ian Korf genomics course

114

Xylem feeding insects also very successful

Monday, January 28, 13

Page 131: Eisen Lecture for Ian Korf genomics course

Xylem and Phloem

From Lodish et al. 2000

115Monday, January 28, 13

Page 132: Eisen Lecture for Ian Korf genomics course

Animal nutrition

• Xylem is frequently missing essential amino acids, vitamins and Co-Factors, and has only small amounts of carbon skeletons

116Monday, January 28, 13

Page 133: Eisen Lecture for Ian Korf genomics course

Plant response to sap feeders

• Possible solutions to no aa, vitamins, etc in xylem Eat other things Evolve metabolic pathways to synthesize missing

nutrients Find some poor sap to make the stuff for you

117Monday, January 28, 13

Page 134: Eisen Lecture for Ian Korf genomics course

Moran N. A. PNAS 2007;104:8627-8633

©2007 by National Academy of Sciences 118

5

Sharpshooter:Cuerna sayi

bacteriomes

Sharpshooters harbor two obligatesymbionts in their bacteriomes

Moran et al. 2003 Environ. Microbiol.Moran et al. 2005 Appl. Environ. Microbiol.

Candidatus “Baumannia cicadellinicola” (Gammaproteobacteria)

Candidatus “Sulcia muelleri” (Bacteroidetes)

D Takiya

0.1mm

Bacteriome dissected from anterior abdomen of H. vitripennis

Orange-red portion- Baumannia only

Yellow portion- Baumannia and Sulcia

(Moran et al. 2003 Environmental Microbiology)

7

10!m

“Candidatus Baumannia cicadellinicola” (Gammaproteobacteria)

in “red” portion of bacteriome of Homalodisca vitripennis

N=host nucleus B=Bacteriocyte membrane E=Endosymbionts

Irregularly spherical

~2 !m diameter

Phylogeny of Sulcia muelleri from Auchenorrhyncha

(Hemiptera): the oldest insect symbiont

Moran et al. Appl Env Micro 2005

Permian

age fossils

(>270 myr)

•= 100% Bootstrap

•support, all methods

Broad congruence with host

relationships

Dates to the origins of vascular

plant-feeding in insects

Symbionts derived

from sharpshooters

Monday, January 28, 13

Page 135: Eisen Lecture for Ian Korf genomics course

How to study microbes

• Key questions about microbes in environment: Who are they? (i.e., what kinds of microbes are they) What are they doing? (i.e., what functions and

processes do they possess)

119Monday, January 28, 13

Page 136: Eisen Lecture for Ian Korf genomics course

Studying the microbe-like entities in the aphid gut

120

Field Observations

Appearance of limited value

5

Sharpshooter:Cuerna sayi

bacteriomes

Sharpshooters harbor two obligatesymbionts in their bacteriomes

Moran et al. 2003 Environ. Microbiol.Moran et al. 2005 Appl. Environ. Microbiol.

Candidatus “Baumannia cicadellinicola” (Gammaproteobacteria)

Candidatus “Sulcia muelleri” (Bacteroidetes)

D Takiya

0.1mm

Bacteriome dissected from anterior abdomen of H. vitripennis

Orange-red portion- Baumannia only

Yellow portion- Baumannia and Sulcia

(Moran et al. 2003 Environmental Microbiology)

Monday, January 28, 13

Page 137: Eisen Lecture for Ian Korf genomics course

121

Culturing Field Observations

Key bacteria in sharpshooter gut have

not been cultured

Studying the microbe-like entities in the aphid gut

Appearance of limited value

5

Sharpshooter:Cuerna sayi

bacteriomes

Sharpshooters harbor two obligatesymbionts in their bacteriomes

Moran et al. 2003 Environ. Microbiol.Moran et al. 2005 Appl. Environ. Microbiol.

Candidatus “Baumannia cicadellinicola” (Gammaproteobacteria)

Candidatus “Sulcia muelleri” (Bacteroidetes)

D Takiya

0.1mm

Bacteriome dissected from anterior abdomen of H. vitripennis

Orange-red portion- Baumannia only

Yellow portion- Baumannia and Sulcia

(Moran et al. 2003 Environmental Microbiology)

Monday, January 28, 13

Page 138: Eisen Lecture for Ian Korf genomics course

122

Culturing Field Observations

Studying the microbe-like entities in the aphid gut

Appearance of limited value

DNA

Key bacteria in sharpshooter gut have

not been cultured

5

Sharpshooter:Cuerna sayi

bacteriomes

Sharpshooters harbor two obligatesymbionts in their bacteriomes

Moran et al. 2003 Environ. Microbiol.Moran et al. 2005 Appl. Environ. Microbiol.

Candidatus “Baumannia cicadellinicola” (Gammaproteobacteria)

Candidatus “Sulcia muelleri” (Bacteroidetes)

D Takiya

0.1mm

Bacteriome dissected from anterior abdomen of H. vitripennis

Orange-red portion- Baumannia only

Yellow portion- Baumannia and Sulcia

(Moran et al. 2003 Environmental Microbiology)

Monday, January 28, 13

Page 139: Eisen Lecture for Ian Korf genomics course

• Who Are They?

123Monday, January 28, 13

Page 140: Eisen Lecture for Ian Korf genomics course

DNA extraction

PCR SequencerRNA genes

Sequence alignment = Data matrixPhylogenetic tree

PCR

rRNA1

Yeast

Makes lots of copies of the rRNA genes in sample

E. coli

Humans

A

T

T

A

G

A

A

C

A

T

C

A

C

A

A

C

A

G

G

A

G

T

T

CrRNA1

E. coli Humans

Yeast

124

rRNA1 5’

...TACAGTATAGGTGGAGCTAGCGATC

GATCGA... 3’

PCR and phylogenetic analysis of rRNA genes

5

Sharpshooter:Cuerna sayi

bacteriomes

Sharpshooters harbor two obligatesymbionts in their bacteriomes

Moran et al. 2003 Environ. Microbiol.Moran et al. 2005 Appl. Environ. Microbiol.

Candidatus “Baumannia cicadellinicola” (Gammaproteobacteria)

Candidatus “Sulcia muelleri” (Bacteroidetes)

D Takiya

0.1mm

Bacteriome dissected from anterior abdomen of H. vitripennis

Orange-red portion- Baumannia only

Yellow portion- Baumannia and Sulcia

(Moran et al. 2003 Environmental Microbiology)

Monday, January 28, 13

Page 141: Eisen Lecture for Ian Korf genomics course

Baumania is close relative of Buchnera symbionts of aphids

SharpshootersAphidsAphidsAphidsAntsFlies

125Monday, January 28, 13

Page 142: Eisen Lecture for Ian Korf genomics course

Baumania is close relative of Buchnera symbionts of aphids

SharpshootersAphidsAphidsAphidsAntsFlies

126Monday, January 28, 13

Page 143: Eisen Lecture for Ian Korf genomics course

127Monday, January 28, 13

Page 144: Eisen Lecture for Ian Korf genomics course

• What Are They Doing?

128Monday, January 28, 13

Page 145: Eisen Lecture for Ian Korf genomics course

129Monday, January 28, 13

Page 146: Eisen Lecture for Ian Korf genomics course

DNA extraction

PCR

130

Genome sequencing

Sequence the whole genome

Predict functions by comparison to other organisms

5

Sharpshooter:Cuerna sayi

bacteriomes

Sharpshooters harbor two obligatesymbionts in their bacteriomes

Moran et al. 2003 Environ. Microbiol.Moran et al. 2005 Appl. Environ. Microbiol.

Candidatus “Baumannia cicadellinicola” (Gammaproteobacteria)

Candidatus “Sulcia muelleri” (Bacteroidetes)

D Takiya

0.1mm

Bacteriome dissected from anterior abdomen of H. vitripennis

Orange-red portion- Baumannia only

Yellow portion- Baumannia and Sulcia

(Moran et al. 2003 Environmental Microbiology)

Monday, January 28, 13

Page 150: Eisen Lecture for Ian Korf genomics course

134Monday, January 28, 13

Page 151: Eisen Lecture for Ian Korf genomics course

DNA extraction

PCR SequencerRNA genes

Sequence alignment = Data matrixPhylogenetic tree

PCR

rRNA1

rRNA2

Makes lots of copies of the rRNA genes in sample

rRNA1 5’

...ACACACATAGGTGGAGCTAGCGATC

GATCGA... 3’

E. coli

Humans

A

T

T

A

G

A

A

C

A

T

C

A

C

A

A

C

A

G

G

A

G

T

T

CrRNA1

E. coli Humans

rRNA2

135

rRNA2 5’

...TACAGTATAGGTGGAGCTAGCGATC

GATCGA... 3’

PCR and phylogenetic analysis of rRNA genes

5

Sharpshooter:Cuerna sayi

bacteriomes

Sharpshooters harbor two obligatesymbionts in their bacteriomes

Moran et al. 2003 Environ. Microbiol.Moran et al. 2005 Appl. Environ. Microbiol.

Candidatus “Baumannia cicadellinicola” (Gammaproteobacteria)

Candidatus “Sulcia muelleri” (Bacteroidetes)

D Takiya

0.1mm

Bacteriome dissected from anterior abdomen of H. vitripennis

Orange-red portion- Baumannia only

Yellow portion- Baumannia and Sulcia

(Moran et al. 2003 Environmental Microbiology)

Monday, January 28, 13

Page 152: Eisen Lecture for Ian Korf genomics course

136Monday, January 28, 13

Page 154: Eisen Lecture for Ian Korf genomics course

DNA extraction

PCR

138

Genome sequencing

Sequence the whole genome

Predict functions by comparison to other organisms

5

Sharpshooter:Cuerna sayi

bacteriomes

Sharpshooters harbor two obligatesymbionts in their bacteriomes

Moran et al. 2003 Environ. Microbiol.Moran et al. 2005 Appl. Environ. Microbiol.

Candidatus “Baumannia cicadellinicola” (Gammaproteobacteria)

Candidatus “Sulcia muelleri” (Bacteroidetes)

D Takiya

0.1mm

Bacteriome dissected from anterior abdomen of H. vitripennis

Orange-red portion- Baumannia only

Yellow portion- Baumannia and Sulcia

(Moran et al. 2003 Environmental Microbiology)

Monday, January 28, 13

Page 155: Eisen Lecture for Ian Korf genomics course

Sulcia makes essential amino acids

139Monday, January 28, 13

Page 156: Eisen Lecture for Ian Korf genomics course

Sulcia makes essential amino acids

140

ESSENTIAL AMINO ACID PRODUCING

MACHINE

Monday, January 28, 13

Page 157: Eisen Lecture for Ian Korf genomics course

Wu et al. 2006 PLoS Biology 4: e188.

Baumannia makes vitamins and cofactors

Sulcia makes essential amino acids

Monday, January 28, 13

Page 158: Eisen Lecture for Ian Korf genomics course

Symbiosis between Buchnera and aphids

OrganismOrganism

Class of symbiosis A B

Mutualism + +

Commensalism + 0

Parasitism + -

142Symbiosis between bacteria & sharpshooters?

Monday, January 28, 13

Page 159: Eisen Lecture for Ian Korf genomics course

OrganismOrganism

Class of symbiosis A B

Mutualism + +

Commensalism + 0

Parasitism + -

143

Symbiosis between Xylella and sharpshooters

Symbiosis between bacteria & sharpshooters?Monday, January 28, 13

Page 160: Eisen Lecture for Ian Korf genomics course

Pierce’s Disease

144Monday, January 28, 13

Page 161: Eisen Lecture for Ian Korf genomics course

Bacteria and archaea are key commensals of many eukaryotes

145Monday, January 28, 13

Page 162: Eisen Lecture for Ian Korf genomics course

Sequencing Technology

Monday, January 28, 13

Page 163: Eisen Lecture for Ian Korf genomics course

Surpassing Moore Law

Monday, January 28, 13

Page 164: Eisen Lecture for Ian Korf genomics course

We Will Sequence Everything

Monday, January 28, 13

Page 165: Eisen Lecture for Ian Korf genomics course

Key Issues

• Cost / bp• Read length• Paired end• Ease of feeding• Error profiles• Barcoding potential

Monday, January 28, 13

Page 166: Eisen Lecture for Ian Korf genomics course

GEBA Now

• 300+ genomes• Rich sampling of major groups of

cultured organisms• Zoomed in sampling of haloarchaea,

cyanobacteria and more

Monday, January 28, 13