Top Banner
The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data Meeting Bristol, April 2003
43

The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Dec 25, 2015

Download

Documents

Roger Wilson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

The importance of meta data capture – problems and solutions

Helen ParkinsonMicroarray Informatics Team

European Bioinformatics Institute

NERC Meta Data Meeting

Bristol, April 2003

Page 2: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Metadata Annotation problems What is an ontology? The MGED ontology Annotation and stds for microarray

data Annotation tools

Talk structure

Page 3: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

The meta data challenge

Meta data is data about data Microarray experiments (should) have

lots of meta data of different types Meta data varies by experiment and by

community Meta data is often free text Free text is evil in a database context

Page 4: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Informatics resources for biologists

Over 500 databanks and analysis tools that work over various resources

Repositories of knowledge and data at various levels, primary and secondary databases,and interfaces eg EMBL, Swissprot, Ensembl

Knowledge often held as free text; limited use made of controlled vocabularies

Enormous amount of semantic heterogeneity and poor query facilities

Lots of effort goes into filtering databases to add value e.g RefSeq from Genbank

Page 5: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

What are the problems with free text

It’s slow to search It’s written by humans, humans are error

prone, (typos, synonyms) Humans have a limited processing power Meaning is not explicit in free text Two options, develop tools for NLP (~40%

efficient) Or eliminate free text

Page 6: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Why is there so much free text?

Historically there was a small amount of data and a limited number databases that held the data

Humans are like free text and are resistant to change

There was nothing better than free text, though use of simple lists helped (e.g. SP keywords)

Page 7: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Gene names and free text The EMBL flat file controls only Species

nomenclature, everything else is free text, genes, KW, descriptions etc

Searching primary databases which are redundant and annotated with free text is hard work and relies upon the user infering information from what is supplied

Reproducing this for new databases isn’t desirable

Without resources to help them curators cannot control free text efficiently

Page 8: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Search for “Ssp1” gene in DDBJ/EMBL/Genbank using SRSgene=“Ssp1” Species=“S.pombe”

1: AB027913 Schizosaccharomyces pombe gene for Ser/Thr protein kinase, partial cds, clone:TA76

2: AL441624 S.pombe chromosome I cosmid c110 3: AL159180 S.pombe chromosome I P1 p14E8 4: AL049609 S.pombe chromosome III cosmid c297 5: AL136235 S.pombe chromosome I cosmid c664 6: D45882 Yeast ssp1 gene for protein kinase,

complete cds 7: X59987 S.pombe SSP1 gene for mitochondrial

Hsp70 protein (Ssp1)

Page 9: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Whose responsibility?

Many important projects have no coordination of standards , for e.g. gene nomenclature, disease state, developmental stage

There are competing resources in some domains, and none in others

Successful projects are built by a community or with community input, e.g GO

Page 10: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

What is an ontology?

Captures knowledge for both humans and computer applications

Has a set of vocabulary definitions that capture a community’s knowledge of a domain

`An ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning. This includes definitions and an indication of how concepts are inter-related which collectively impose a structure on the domain and constrain the possible interpretations of terms.‘

It is more than a controlled vocabulary, it has structure (but a cv is a good place to start)

Page 11: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Advantages of using ontologies

Meaning is explicit Meaning is human and computer readable Ease of updating, no need to find terms in

free text and change them Data transfer possible without loss of

meaning Reasoning to aid queries, annotation etc.

Page 12: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Building Ontologies

Simple lists work well but adding structure adds reasoning power

Class, container for information, has a definition and a relationship to other classes (is-a, part-of, kind-of)

Instances, terms that are contained within a class

Page 13: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Example of a class, subclass relationship and a constraint

Class Domestic Catsub-class of Pet

slot constraint cleans itself, eats stinky foodslot has filler

slot has instance Suki Just formalised way to say that a domestic cat is a pet that you don’t have

to clean - but this is machine readable, and I can use it to classify

Page 14: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

MIAME and ArrayExpress

Page 15: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

•Minimum Information About a Microarray Experiment.

•MIAME is a guideline for microarray experimenters to describe their data so that:

•Sufficient information is recorded to:•Correctly interpret & verify their experiments.•Able to replicate the experiments.

•Structured information must be recorded to:• Query and correctly retrieve the data.• Analyse the data.

MIAME

Page 16: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

MIAMEMIAME

6 parts of a microarray experiment

Experiment

HybridisationSample Array

Normalisation

Data

• Sample source• Sample treatments• Extraction protocol• Labeling protocol

•Array design information• Location of each element• Description of each element

•Control array elements

•Statistical treatment

• Image• Scanning protocol• Software specifications

• Quantification matrix• Analysis protocol• Software specifications

Page 17: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

ArrayExpress

A public repository for Microarray Data – MIAME compliant

Uses the MAGE model (designed to hold MIAME compliant data)

Holds public and private data Uses controlled vocabulary where possible Can represent complex metadata and

reference external resources e.g databases and ontologies

Page 18: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Infrastructure at the EBI

ArrayExpress(Oracle)

Other publicMicroarrayDatabases

(GEO, CIBEX)

www

EBI

ExpressionProfiler

ExternalBioinformatic

databases

Data analysis

www

Queries

www

MIAMExpress(MySQL)

MAGE-ML

Submissions

MA

GE

-ML

Array Manufacturers

LIMS

Microarray

software

Data AnalysissoftwareM

AG

E-M

L E

xpo

rt

Local MIAMExpressInstallations

MAGE-ML files

Submissions

MAGE-ML pipelines

Page 19: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Desirable Queries

Show me all experiments where gene x is on the array

Show me all the experiments where organism x is treated by compound Y

Return all experiments using developmental stage X, disease stage Y

– Sort by platform type– Which are untreated? Treated?

• Treated by what• How comparable are these?

Page 20: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

MAGE-OM and Ontology Entries

ArrayExpress uses the MAGE-OM Requires OntologyEntries of 3 types

– Simple lists, e.g image format, GIFF, TIFF etc– Infinite – e.g. anything that could be meta data

relating to a sample, Species, disease state– In between - types of protocols, types of data

transformation, types of Biosequence etc

Page 21: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

ArrayExpress Conceptual Model

PublicationExternal links

Hybridisation ArraySampleSource

(e.g., Taxonomy)

Experiment

Normalisation

Gene(e.g., EMBL)

Data

Page 22: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

MGED

MGED is the Microarray Gene Expression Database group

It’s a group who develop resources, including the object model and ontologies for the comunnity

It includes, SMD, TIGR, Affy , Agilent etc Conference in Aix-en-Provence Sep 2003

www.mged.org

Page 23: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

The MGED Ontology:A framework for describing functional

genomics experiments

Page 24: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

The MGED Ontology An ontology for microarray experiments

– Parts are applicable to describing experiments in general with a focus on microarray

– Built by people with legacy data problems EBI, U.Penn,TIGR,SMD, UC Berkeley, NIH plus contributions from the mailing list

– Supports the ontology requirements of the MAGE model Our approach to interfacing with other ontologies is

“experimental”– Provide a framework to point to other ontologies

• Know where to find different types of annotation• How to interpret that annotation

Page 25: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

The MGED ontology is not

Limited to any one domain or species Modelling the real world or reinventing the

MAGE model Mapping terms from external non orthogonal

ontologies Recommending one ontology over another

(though some are not freely available) Just for microarrays, the same concepts like

‘assay’ apply to phenotype and we want to make a reusable resource

Page 26: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Current MGED Ontology

Page 27: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

MGED Ontology: BioSequence

Page 28: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Organising by sub-classing

In MAGE Spots relate to BioSequences, these have types: gene, intergenic sequence, clone, PCR product, EST

BioSequenceType

Page 29: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

MGED Ontology: limiting redundancy

Page 30: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

MGED Ontology: OntologyEntry

Page 31: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Example of External Terms

Page 32: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Ontology in Browseable Form

Page 33: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

ArrayExpress

MIAMExpress

RADMAGE-ML data exchange

Ontology instances propagated to submission/annotation web forms

Curation of user defined terms, before inclusion in the ontology

User defined terms collected via forms

MGED Ontology

BiomaterialDescription

SexC

C

C

C Gender

documentation: Subclass of sex applicable to heterogametic species (i.e., those in which the sexes produce gametes of markedly different size). Males produce small numerous gametes. Females produce small numbers of large gametes. Hermaphrodites are individuals with both male and female characteristics. Mixed refers to a population of individuals with more than one type of gender.

used in individuals: female, hermaphrodite,male,mixed_sex,unknown_sex

Page 34: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.
Page 35: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.
Page 36: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.
Page 37: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Turning free text into controlled annotation

Sample source and treatment description, and its correct annotation using the MGED ontology

“Seven week old C57BL/6N mice were treated with fenofibrate. Liver was dissected out, RNA prepared”

Page 38: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

©-BioMaterialDescription

©-Biosource Property

©-Organism

©-Age

©-DevelopmentStage

©-Sex

©-StrainOrLine

©-BiosourceProvider

©-OrganismPart

©-BioMaterialManipulation

©-EnvironmentalHistory

©-CultureCondition

©-Temperature

©-Humidity

©-Light

©-PathogenTests

©-Water

©-Nutrients

©-Treatment

©-CompoundBasedTreatment

(Compound)

(Treatment_application)

(Measurement)

MGED BioMaterial Ontology Instances

7 weeks after birth

Female

Charles River, Japan

22 2C

55 5%

12 hours light/dark cycle

Specified pathogen free conditions

ad libitum

MF, Oriental Yeast, Tokyo, Japan

in vivo, oral gavage

100mg/kg body weight

External References

NCBI TaxonomyNCBI Taxonomy

Mouse Anatomical DictionaryMouse Anatomical Dictionary

International Committee on Standardized Genetic Nomenclature for Mice

International Committee on Standardized Genetic Nomenclature for Mice

Mouse Anatomical DictionaryMouse Anatomical Dictionary

ChemIDplusChemIDplus

Mus musculus musculus id: 39442

Stage 28

C57BL/6

Liver

Fenofibrate, CAS 49562-28-9

Page 39: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.
Page 40: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Care when using ontologies

Ontologies are rarely complete Ontologies are fit for the purpose for

which they are designed – not off the shelf solutions to your problem

Building ontologies is hard - it needs both domain experts and tools

A simple list is an excellent start

Page 41: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Possible solutions

Think about what you want to query – this determines what you will annotate

Look for existing resources and use them if they are appropriate

Do the doable first Share your resources

Page 42: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Quote

‘Most biologists would rather share their toothbrush than share a gene name’

Michael Ashburner

Page 43: The importance of meta data capture – problems and solutions Helen Parkinson Microarray Informatics Team European Bioinformatics Institute NERC Meta Data.

Acknowledgements

Chris Stoeckert, Trish Whetzel, Joe White, Cathy Ball, Paul Spellman - MGED ontology

Microarray Informatics Team, EBI

Robert Stevens, University of Manchester

Funding: EMBL, EU, ILSI