Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center
Feb 23, 2016
Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects
Richard H. Scheuermann, Ph.D.Department of Pathology
Division of Biomedical InformaticsU.T. Southwestern Medical Center
NIAID Bioinformatics Resource Centerswww.pathogenportal.net
Influenza Research Databasewww.fludb.org
NIAID Genome Sequencing Centers
Metadata Inconsistencies
• Each project was providing different types of metadata
• No consistent nomenclature being used• Impossible to perform reliable comparative
genomics analysis
Dengue Clinical Metadata
Complex Query Interface
Additional Clinical Characteristics
GSC-BRC Metadata Standards Working Group
• NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs
• Develop metadata standards for pathogen isolate sequencing projects
GSC-BRC Metadata Working Groups
Metadata Standards Process• Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors• Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS)• Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data
fields that appear to be project specific• For each data field, provide:
– definitions, – synonyms, – allowed value sets preferably using controlled vocabularies, – expected syntax, – examples, – data categories,– data providers
• Merge subgroup core elements into a common set of core metadata fields and attributes• Assemble metadata fields into a semantic network• Harmonize semantic network with the Ontology of Biomedical Investigation (OBI)• Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples• Establish policies and procedures for metadata submission workflows and GenBank linkage• Develop data submission spreadsheets to be used for all white paper and BRC-associated projects
Core Sample Metadata
30 Core Sample Metadata Fields
Core Project Metadata
16 Core Project Metadata Fields
Metadata Standards Process• Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors• Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS)• Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data
fields that appear to be project specific• For each data field, provide:
– definitions, – synonyms, – allowed value sets preferably using controlled vocabularies, – expected syntax, – examples, – data categories,– data providers
• Merge subgroup core elements into a common set of core metadata fields and attributes• Assemble metadata fields into a semantic network (Scheuermann)• Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) (Stoeckert, Zheng)• Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples• Establish policies and procedures for metadata submission workflows and GenBank linkage• Develop data submission spreadsheets to be used for all white paper and BRC-associated projects
organism
environmentalmaterial
equipment
person
specimensource role
specimencapture role
specimencollector role
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
specimen Xspecimen isolationprocedure X
isolationprotocol
has_input
has_output
plays
plays
has_specification
has_partdenotes
located_in
name
denotes
spatialregion
geographiclocation
denoteslocated_in
affiliation
has_affiliation
ID
v2
v5-6
v3-4
v7v8
v15
v16
denotes
specimen typeinsta
nce_of
specimen isolationprocedure type
instance_of
Specimen Isolation
plays
has_input
Comments
????
v9
organism parthypothesis v17
is_about
IRB/IACUCapproval
has_authorization
v19v18
b18
b22environmenthas_quality
b23
b24
b28 b29
b25 b26 b27
b30
Metadata Processes
data transformations –image processing
assemblysequencing assay
specimen source – organism or environmental
specimencollector
input sample
reagents
technician
equipment
type ID qualities
temporal-spatialregion
data transformations –variant detection
serotype marker detect.gene detection
primarydata
sequencedata
genotype/serotype/gene data
specimen
microorganism
enrichedNA sample
microorganismgenomic NA
specimen isolationprocess
isolationprotocol
sample processing
data archivingprocess
sequencedata record
has_input
has_output
has_output
has_specification has_part has_part
is_about
has_input
has_output
has_input
has_input
has_input
has_output
has_output
has_output
is_about
GenBankID
denotes
located_in
denotes
has_input
has_qualityinstance_of
temporal-spatialregion
located_in
Specimen Isolation
Material Processing
Data ProcessingSequencing Assay
Investigation
Core-Project
Core-Specimen
assay X
samplematerial X
material X
person X
equipment X
lot #
primarydata
assayprotocol
temporal-spatialregion
has_input
located_in
has_specification
has_output
plays
spatialregion
temporalinterval
GPSlocationdate/time
spatialregion
geographiclocation
Generic Assay
has_part
located_indenotes denotes
runID
assaytype
denotes
instance_of
reagentrole
reagenttype
instance_
of
denotes
sample ID
playstarget
role
sampletype
instance_
of
denotes
name
playstechnician
role
species
instance_
of
denotes
serial #
playssignal
detection role
equipmenttype
instance_
of
denotes
has_input
has_input
has_input
objectives
has_part
analyte X
has_part
quality x
has_quality
input samplematerial X
is_about
materialtransformation X
samplematerial X
material X
person X
equipment X
lot #
outputmaterial X
material transformationprotocol
temporal-spatialregion
has_input
located_in
has_specification
has_output
plays
spatialregion
temporalinterval
GPSlocationdate/time
spatialregion
geographiclocation
Generic Material Transformation
has_part
located_indenotes denotes
runID
material transformationtype
denotes
instance_of
reagentrole
reagenttype
instance_
of
denotes
sample ID
playstarget
role
sampletype
instance_
of
denotes
name
playstechnician
role
species
instance_
of
denotes
serial #
playssignal
detection role
equipmenttype
instance_
of
denotes
has_input
has_input
has_input
objectives
has_part
quality x
has_quality
quality x
materialtype
has_quality
instance_of
sample IDdenotes
data transformation Xinputdata
outputdata
material X
algorithm
has_specification
has_output
is_about
software
has_input
located_in
person Xname
data analystrole
denotes
runID
denotes
Generic Data Transformation
temporal-spatialregion
spatialregion
temporalinterval
GPSlocationdate/time
spatialregion
geographiclocation
has_part
located_indenotes denotes
data transformationtype
instance_of
plays
Generic Material (IC)
material X
ID
materialtype
quality x
has_quality
material Y
has_part
material Z
has_part
quality y
has_quality
denotes
instance_of
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
located_in
spatialregion
geographiclocation
denotes denotes denotes
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
located_in
spatialregion
geographiclocation
denotes denotes denotes
located_in located_in
Conclusions
• Utility of semantic representation– Identified gaps in data field list (e.g. temporal components)– Identified gaps in ontology data standards (use case-driven standard development)– Identified commonalities in data structures (reusable)– Support for semantic queries and inferential analysis in future
• Two flavors of MIBBI– Distinguish between minimum information to reproduce an experiment and the
minimum information to structure in a database for query and analysis• OBI-based framework is re-usable
– Sequencing => “omics”• Practical issues about implementation strategies
– Challenge of using ontologies for preferred value sets• Can be large• May not directly match common language