Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Richard H. Scheuermann, Ph.D.Department of Pathology

Division of Biomedical InformaticsU.T. Southwestern Medical Center

NIAID Bioinformatics Resource Centerswww.pathogenportal.net

Influenza Research Databasewww.fludb.org

NIAID Genome Sequencing Centers

Metadata Inconsistencies

• Each project was providing different types of metadata

• No consistent nomenclature being used• Impossible to perform reliable comparative

genomics analysis

Dengue Clinical Metadata

Complex Query Interface

Additional Clinical Characteristics

GSC-BRC Metadata Standards Working Group

• NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs

• Develop metadata standards for pathogen isolate sequencing projects

GSC-BRC Metadata Working Groups

Metadata Standards Process• Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors• Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS)• Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data

fields that appear to be project specific• For each data field, provide:

– definitions, – synonyms, – allowed value sets preferably using controlled vocabularies, – expected syntax, – examples, – data categories,– data providers

• Merge subgroup core elements into a common set of core metadata fields and attributes• Assemble metadata fields into a semantic network• Harmonize semantic network with the Ontology of Biomedical Investigation (OBI)• Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples• Establish policies and procedures for metadata submission workflows and GenBank linkage• Develop data submission spreadsheets to be used for all white paper and BRC-associated projects

Core Sample Metadata

30 Core Sample Metadata Fields

Core Project Metadata

16 Core Project Metadata Fields

Metadata Standards Process• Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors• Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS)• Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data

fields that appear to be project specific• For each data field, provide:

– definitions, – synonyms, – allowed value sets preferably using controlled vocabularies, – expected syntax, – examples, – data categories,– data providers

• Merge subgroup core elements into a common set of core metadata fields and attributes• Assemble metadata fields into a semantic network (Scheuermann)• Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) (Stoeckert, Zheng)• Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples• Establish policies and procedures for metadata submission workflows and GenBank linkage• Develop data submission spreadsheets to be used for all white paper and BRC-associated projects

organism

environmentalmaterial

equipment

person

specimensource role

specimencapture role

specimencollector role

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

specimen Xspecimen isolationprocedure X

isolationprotocol

has_input

has_output

plays

plays

has_specification

has_partdenotes

located_in

name

denotes

spatialregion

geographiclocation

denoteslocated_in

affiliation

has_affiliation

ID

v2

v5-6

v3-4

v7v8

v15

v16

denotes

specimen typeinsta

nce_of

specimen isolationprocedure type

instance_of

Specimen Isolation

plays

has_input

Comments

????

v9

organism parthypothesis v17

is_about

IRB/IACUCapproval

has_authorization

v19v18

b18

b22environmenthas_quality

b23

b24

b28 b29

b25 b26 b27

b30

Metadata Processes

data transformations –image processing

assemblysequencing assay

specimen source – organism or environmental

specimencollector

input sample

reagents

technician

equipment

type ID qualities


data transformations –variant detection

serotype marker detect.gene detection

primarydata

sequencedata

genotype/serotype/gene data

specimen

microorganism

enrichedNA sample

microorganismgenomic NA

specimen isolationprocess

isolationprotocol

sample processing

data archivingprocess

sequencedata record

has_input

has_output

has_output

has_specification has_part has_part

is_about

has_input

has_output

has_input

has_input

has_input

has_output

has_output

has_output

is_about

GenBankID

denotes

located_in

denotes

has_input

has_qualityinstance_of


located_in

Specimen Isolation

Material Processing

Data ProcessingSequencing Assay

Investigation

Core-Project

Core-Specimen

assay X

samplematerial X

material X

person X

equipment X

lot #

primarydata

assayprotocol


has_input

located_in

has_specification

has_output

plays

spatialregion

temporalinterval

GPSlocationdate/time

spatialregion

geographiclocation

Generic Assay

has_part

located_indenotes denotes

runID

assaytype

denotes

instance_of

reagentrole

reagenttype

instance_

of

denotes

sample ID

playstarget

role

sampletype

instance_

of

denotes

name

playstechnician

role

species

instance_

of

denotes

serial #

playssignal

detection role

equipmenttype

instance_

of

denotes

has_input

has_input

has_input

objectives

has_part

analyte X

has_part

quality x

has_quality

input samplematerial X

is_about

materialtransformation X

samplematerial X

material X

person X

equipment X

lot #

outputmaterial X

material transformationprotocol


has_input

located_in

has_specification

has_output

plays

spatialregion

temporalinterval


spatialregion

geographiclocation

Generic Material Transformation

has_part


runID

material transformationtype

denotes

instance_of

reagentrole

reagenttype

instance_

of

denotes

sample ID

playstarget

role

sampletype

instance_

of

denotes

name

playstechnician

role

species

instance_

of

denotes

serial #

playssignal

detection role

equipmenttype

instance_

of

denotes

has_input

has_input

has_input

objectives

has_part

quality x

has_quality

quality x

materialtype

has_quality

instance_of

sample IDdenotes

data transformation Xinputdata

outputdata

material X

algorithm

has_specification

has_output

is_about

software

has_input

located_in

person Xname

data analystrole

denotes

runID

denotes

Generic Data Transformation


spatialregion

temporalinterval


spatialregion

geographiclocation

has_part


data transformationtype

instance_of

plays

Generic Material (IC)

material X

ID

materialtype

quality x

has_quality

material Y

has_part

material Z

has_part

quality y

has_quality

denotes

instance_of


spatialregion

temporalinterval

GPSlocation

date/time

has_part

located_in

spatialregion

geographiclocation

denotes denotes denotes


spatialregion

temporalinterval

GPSlocation

date/time

has_part

located_in

spatialregion

geographiclocation

denotes denotes denotes

located_in located_in

Conclusions

• Utility of semantic representation– Identified gaps in data field list (e.g. temporal components)– Identified gaps in ontology data standards (use case-driven standard development)– Identified commonalities in data structures (reusable)– Support for semantic queries and inferential analysis in future

• Two flavors of MIBBI– Distinguish between minimum information to reproduce an experiment and the

minimum information to structure in a database for query and analysis• OBI-based framework is re-usable

– Sequencing => “omics”• Practical issues about implementation strategies

– Challenge of using ontologies for preferred value sets• Can be large• May not directly match common language

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Documents

standardizing metadata

metadata submission

ceirsidentify data fields

data categories

pathogen subgroup core

project sources

project specificfor

pathogen subgroups viruses