C A M E R A A Metagenomics Resource for Marine Microbial Ecology July 27, 2007 Paul Gilna UCSD/Calit2 Saul A. Kravitz J. Craig Venter Institute
Dec 21, 2015
C A M E R AA Metagenomics Resource for
Marine Microbial Ecology
July 27, 2007
Paul GilnaUCSD/Calit2
Saul A. KravitzJ. Craig Venter Institute
• UCSD/Calit2- Larry Smarr, PI; Paul Gilna, Executive Director
- Phil Papadopoulos, Technical Lead
- Weizhong Li
• JCVI- Marv Frazier, co-PI
- Leonid Kagan, Architect; Jennifer Wortman, Bioinformatics
- Rekha Seshadri, Outreach and Training;
- Doug Rusch, Shibu Yooseph, Aaron Halpern, Granger Sutton
• UC Davis- Jonathan Eisen, co-investigator
• Gordon and Betty Moore Foundation- David Kingsbury and Mary Maxon
Acknowledgements
Outline
• New Discipline of Metagenomics
• Global Ocean Sampling Expedition
• Challenges of Metagenomic Data
• CAMERA Features
• CAMERA Usage to Date
• Cyberinfrastructure
• Genomics – ‘Old School’- Study of an organism's genome - Genome sequence determined using shotgun
sequencing and assembly- ~1300 microbes sequenced, first in 1995
- DNA usually obtained from pure cultures
• Metagenomics - Application of genome sequencing methods to
environmental samples (no culturing)- Environmental shotgun sequencing is the most widely
used approach
Genomics vs Metagenomics
• Within an environment- What biological functions are present (absent)?
- What organisms are present (absent)
• Compare data from (dis)similar environments- What are the fundamental rules of microbial ecology
• Search for novel proteins and protein families
Metagenomic Questions
Metagenomics Applications
• Marine Ecology and Microbiology• Alternative Energy and Industrial
- Hypersaline ponds, Oceans- Termite Metabolism
• Medical Applications- Microbial Ecology of Human body cavities and fluids
• Agricultural- Disease Vector Metabolism (Glassy Eyed Sharpshooter)- Soil Ecology
• Environmental Remediation- DOE: Acid Mine Drainage, Chemical and Radioactive Waste
• Metagenomics- Genomics + Metadata
• Environmental Metadata- Time and location (lat, long, depth)
of sample collection
- Correlate w/remote sensing data
- Physico-chemical properties (e.g. temperature, salinity)
MODIS-Aqua satellite image of ocean chlorophyll in the Sargasso Sea grid about the BATS site from 22 February 2003
Metadata
Global Ocean Sampling (GOS)178 Total Sampling Locations
Phase 1: 41 samples, 7.7M reads, >6M proteinsDiverse Environments
Open ocean, estuary, embayment, upwelling, fringing reef, atoll, warm seep, mangrove, fresh water, biofilms, sediments, soils
• Novel clustering process• Sequence similarity based
• Predict proteins and group into related clusters
• Include GOS and all known proteins
• Findings• GOS proteins cover ~all existing prokaryotic families
• GOS expands diversity of known protein families
• 1700 large novel clusters with no homology to known protein families
• Higher than expected proportion of novel clusters are viral
• No saturation in the rate of novel protein family discover
GOS Protein Analysis Yooseph et al (PLoS 2007)
H. marismortui
B. haloduransT. thermophilus
B. anthracis
D. psychrophila
D. radiodurans
UVDE homologs
Rubisco homologs
GOS prokaryotes
Known eukaryotes
Known prokaryotes
GOS prokaryotes
Known eukaryotes
Known prokaryotes
GOS viral
Known viral
GOS eukaryotes
Added Diversity
Rate of discovery
0
50
100
150
200
250
0 1 2 3 4 5 6 7
Number of sequences (millions)
Nu
mb
er o
f clu
ster
s (t
ho
usa
nd
s)
size >=3
size >=5
size >=10
size >=20
Rate of Protein Discovery
Fragment Recruitment ViewerRusch et al, PLoS 3/2007
Pe
rce
nt I
den
tity
Reference Genome Coordinates
100%
55% Ribosomal operon
“core” genome,
~75% identical
Sequence absent from most strains – phage/other lateral transfer?
100%
50%
• Public repositories not focused on environmental metagenomics- Sargasso Sea data underutilized by community
• M$ invested in sequencing and analysis but only accessible to bioinformatics elite
• Release of GOS dataset in March 2007• Comply with Convention on Biodiversity
Why CAMERA?
CAMERA – http://camera.calit2.net
• “Convenient acronym for cumbersome name…”- Henry Nichols, PLoS Biology
• Mission- Enable Research in Marine Microbiology
• CAMERA Partners:
• Enormous datasets with high gene density- large compute resources required- 2 orders of magnitude jump
• Fragmentary data- inadequate bioinformatics tools for assembly,
annotation, analysis, visualization
• Metadata standards non-existent- metadata absent from databases- Lack of standards impedes collection of datasets
• Diversity of User Sophistication and Needs
Challenges
• Maintain searchable sequence collections- ALL metagenomic sequence reads, assemblies
- Non-identical amino acid collection (extended NRAA)
- Viral, Fungal, pico-Eukaryotes, Microbial
- CAMERA protein clusters
• Metagenomics data easily downloadable
• Interactive and Batch Search Facility- Scalable parallel implementations of BLAST
- Integrated with associated metadata
CAMERA Services
• Graphical Tools for Visualizing Diversity- Based on Rusch et al- Fragment recruitment viewer
• CAMERA Protein Clusters- Based on Yooseph et al- Incremental version implemented in 2007
• Annotation- Break through quadratic complexity via clusters- Phyletic Classification
• Overviews of sequence collections
Distinctive Features Set in Progress