Kno.e...2009/05/27  · Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis

Post on 30-May-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Ontologies and data integration in biomedicine

Olivier Bodenreider

Lister Hill National Centerfor Biomedical Communications

Bethesda, Maryland - USA

Kno.e.sisWright State University, Dayton, Ohio

May 27, 2009

Lister Hill National Center for Biomedical Communications 2

Outline

Why integrate data?Ontologies and data integrationExamplesChallenging issues

Why integrate data?

Lister Hill National Center for Biomedical Communications 4

Why integrate data?

Sources of informationCreated by

Independent researchersSeparate workflows

HeterogeneousScattered“Silos”

To identify patterns in integrated datasetsHypothesis generationKnowledge discovery

Lister Hill National Center for Biomedical Communications 5

Motivation Translational research

“Bench to Bedside”Integration of clinical and research activities and resultsSupported by research programs

NIH RoadmapClinical and Translational Science Awards (CTSA)

Requires the effective integration and exchange and of information between

Basic researchClinical research

Lister Hill National Center for Biomedical Communications 6

Genotype and phenotype[Goh, PNAS 2007]

• OMIM• [HPO]

Genes and environmental factors

[Liu, BMC Bioinf. 2008]

• MEDLINE (MeSH index terms)• Genetic Association Database

Lister Hill National Center for Biomedical Communications 8

Integrating drugs and targets[Yildirim, Nature Biot. 2007]

• DrugBank• ATC• Gene Ontology

Why ontologies?

Lister Hill National Center for Biomedical Communications 10

Uses of biomedical ontologies

Knowledge managementAnnotating data and resourcesAccessing biomedical informationMapping across biomedical ontologies

Data integration, exchange and semantic interoperabilityDecision support

Data selection and aggregationDecision supportNLP applicationsKnowledge discovery

[Bodenreider, YBMI 2008]

Lister Hill National Center for Biomedical Communications 11

Terminology and translational research

CancerBasic

Research

EHRCancerPatients

NCI Thesaurus SNOMED CT

Lister Hill National Center for Biomedical Communications 12

Approaches to data integration (1)

WarehousingSources to be integrated are transformed into a common format and converted to a common vocabulary

MediationLocal schema (of the sources)Global schema (in reference to which the queries are made)

[Stein, Nature Rev. Gen. 2003][Hernandez, SIGMOD Rec. 2004]

[Goble J. Biomedical Informatics 2008]

Lister Hill National Center for Biomedical Communications 13

Approaches to data integration (2)

Linked dataLinks among data elementsEnable navigation by humans

[Stein, Nature Rev. Gen. 2003][Hernandez, SIGMOD Rec. 2004]

[Goble J. Biomedical Informatics 2008]

Lister Hill National Center for Biomedical Communications 14

Ontologies and warehousing

RoleProvide a conceptualization of the domain

Help define the schemaInformation model vs. ontology

Provide value sets for data elementsEnable standardization and sharing of data

ExamplesAnnotations to the Gene OntologyBioWarehouseClinical information systems

http://biowarehouse.ai.sri.com/

Lister Hill National Center for Biomedical Communications 15

Ontologies and mediation

RoleReference for defining the global schemaMap between local and global schemas

Query reformulationLocal-as-view vs. Global-as-view

ExamplesTAMBISBioMediatorOntoFusion

[Stevens, Bioinformatics 2000]

[Louie, AMIA 2005]

[Perez-Rey, Comput Biol Med 2006]

Lister Hill National Center for Biomedical Communications 16

Ontologies and linked data

RoleExplicit conceptualization of the domainSemantic normalization of data elements

ExamplesEntrezSemantic Web mashupsBio2RDF

[http://www.ncbi.nlm.nih.gov/]

[J. Biomedical informatics 41(5) 2008]

[http://bio2rdf.org/]

Lister Hill National Center for Biomedical Communications 17

Ontologies and data integration

Source of identifiers for biomedical entitiesSemantic normalizationWarehouse approaches

Source of reference relations for the global schemaMapping between local and global schemasMediator-based approaches

Source of identifiers for biomedical entitiesSemantic normalizationExplicit conceptualization of the domainLinked data approaches

Lister Hill National Center for Biomedical Communications 18

Ontologies and data aggregation

Source of hierarchical relationsAggregate data into coarser categoriesAbstract away from low-frequency, fine grained data pointsIncrease powerImprove visualization

Examples

Gene Ontologyhttp://www.geneontology.org/

Lister Hill National Center for Biomedical Communications 20

Annotating data

Gene OntologyFunctional annotation of gene productsin several dozen model organisms

Various communities use the same controlled vocabulariesEnabling comparisons across model organismsAnnotations

Assigned manually by curatorsInferred automatically (e.g., from sequence similarity)

Lister Hill National Center for Biomedical Communications 21

GO Annotations for Aldh2 (mouse)

http:// www.informatics.jax.org/

Lister Hill National Center for Biomedical Communications 22

GO ALD4 in Yeast

http://db.yeastgenome.org/

Lister Hill National Center for Biomedical Communications 23

GO Annotations for ALDH2 (Human)

http://www.ebi.ac.uk/GOA/

Lister Hill National Center for Biomedical Communications 24

Integration applications

Based on shared annotationsEnrichment analysis (within/across species)Clustering (co-clustering with gene expression data)

Based on the structure of GOClosely related annotationsSemantic similarity

Based on associations between gene products and annotationsLeveraging reasoning

[Bodenreider, PSB 2005]

[Sahoo, Medinfo 2007]

[Lord, PSB 2003]

Lister Hill National Center for Biomedical Communications 25

Gene Ontology

Integration Entrez Gene + GO

gene

GO

PubMed

Gene name

OMIM

Sequence

InteractionsGlycosyltransferase

Congenital muscular dystrophy

Entrez Gene

[Sahoo, Medinfo 2007]

Lister Hill National Center for Biomedical Communications 26

From glycosyltransferaseto congenital muscular dystrophy

MIM:608840 Muscular dystrophy, congenital, type 1D

GO:0008375

has_associated_phenotype

has_molecular_function

EG:9215LARGE

acetylglucosaminyl-transferase

GO:0016757glycosyltransferase

GO:0008194isa

GO:0008375 acetylglucosaminyl-transferase

GO:0016758

Examples

caBIGhttp://cabig.nci.nih.gov/

Lister Hill National Center for Biomedical Communications 28

Cancer Biomedical Informatics Grid

US National Cancer InstituteCommon infrastructure used to share data and applications across institutions to support cancer research efforts in a grid environmentService-oriented architecture

Data and application services available on the gridSupported by ontological resources

Lister Hill National Center for Biomedical Communications 29

caBIG services

caArrayMicroarray data repository

caTissueBiospecimen repository

caFE (Cancer Function Express)Annotations on microarray data

caTRIPCancer Translational Research Informatics PlatformIntegrates data services

Lister Hill National Center for Biomedical Communications 30

Ontological resources

NCI ThesaurusReference terminology for the cancer domain~ 60,000 conceptsOWL Lite

Cancer Data Standards Repository (caDSR)Metadata repositoryUsed to bridge across UML models through Common Data ElementsLinks to concepts in ontologies

Examples

Semantic Webfor Health Care and Life Sciences

http://www.w3.org/2001/sw/hcls/

Lister Hill National Center for Biomedical Communications 32

Semantic Web layer cake

Linked datalinkeddata.org

Lister Hill National Center for Biomedical Communications 34

Linked data

Lister Hill National Center for Biomedical Communications 35

Linked biomedical data[Tim Berners-Lee TED 2009 conference]http://www.w3.org/2009/Talks/0204-ted-tbl/#(1)

Lister Hill National Center for Biomedical Communications 36

W3C Health Care and Life Sciences IG

Lister Hill National Center for Biomedical Communications 37

Biomedical Semantic Web

IntegrationData/InformationE.g., translational research

Hypothesis generationKnowledge discovery

[Ruttenberg, BMC Bioinf. 2007]

Lister Hill National Center for Biomedical Communications 38

HCLS mashup of biomedical sources

NeuronDB

BAMS

NC Annotations

Homologene

SWAN

Entrez Gene

Gene Ontology

Mammalian Phenotype

PDSPki

BrainPharm

AlzGene

Antibodies

PubChem

MeSH

Reactome

Allen Brain Atlas

Publications

http://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demo

Lister Hill National Center for Biomedical Communications 39

Shared identifiers Example

GO

Lister Hill National Center for Biomedical Communications 40

HCLS mashup NeuronDB

Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents

BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID

NC Annotations

Genes/ProteinsProcessesCells (maybe)PubMed ID

Allen Brain Atlas

GenesBrain imagesGross anatomy -> neuroanatomy

Homologene

GenesSpeciesOrthologiesProofs

SWAN

PubMedIDHypothesisQuestionsEvidence

Genes

Entrez GeneGenesProtein

GOPubMedID

Interaction (g/p)Chromosome

C. location

GO

Molecular functionCell components

Biological processAnnotation gene

PubMedID

Mammalian Phenotype

Genes Phenotypes

DiseasePubMedID

ProteinsChemicals

Neurotransmitters

PDSPki

BrainPharmDrug

Drug effectPathological agent

PhenotypeReceptorsChannelsCell typesPubMedIDDisease

AlzGene

Gene Polymorphism

PopulationAlz Diagnosis

AntibodiesGenes Antibodies

PubChem

NameStructurePropertiesMeSH term

MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem

Reactome

Genes/proteinsInteractionsCellular locationProcesses (GO)

Lister Hill National Center for Biomedical Communications 41

HCLS mashup NeuronDB

Protein (channels/receptors)NeurotransmittersNeuroanatomyCellCompartmentsCurrents

BAMSProteinNeuroanatomyCellsMetabolites (channels)PubMedID

NC Annotations

Genes/ProteinsProcessesCells (maybe)PubMed ID

Allen Brain Atlas

GenesBrain imagesGross anatomy -> neuroanatomy

Homologene

GenesSpeciesOrthologiesProofs

SWAN

PubMedIDHypothesisQuestionsEvidence

Genes

Entrez GeneGenesProtein

GOPubMedID

Interaction (g/p)Chromosome

C. location

GO

Molecular functionCell components

Biological processAnnotation gene

PubMedID

Mammalian Phenotype

GenesPhenotypes

DiseasePubMedID

ProteinsChemicals

Neurotransmitters

PDSPki

BrainPharmDrug

Drug effectPathological agent

PhenotypeReceptorsChannelsCell typesPubMedIDDisease

AlzGene

GenePolymorphism

PopulationAlz Diagnosis

AntibodiesGenesAntibodies

PubChem

NameStructurePropertiesMeSH term

MeSHDrugsAnatomyPhenotypesCompoundsChemicalsPubMedIDPubChem

Reactome

Genes/proteinsInteractionsCellular locationProcesses (GO)

Lister Hill National Center for Biomedical Communications 42

HCLS mashups

Based on RDF/OWLBased on shared identifiers

“Recombinant data” (E. Neumann)

Ontologies used in some casesSupport applications (SWAN, SenseLab, etc.)

Journal of Biomedical Informaticsspecial issue on Semantic bio-mashups[J. Biomedical Informatics 41(5) 2008]

Lister Hill National Center for Biomedical Communications 43

Semantic bio-mashupsBio2RDF: Towards a mashup to build bioinformatics knowledge systemsIdentifying disease-causal genes using Semantic Web-based representation of integrated genomic and phenomic knowledgeSchema driven assignment and implementation of life science identifiers (LSIDs)The SWAN biomedical discourse ontologyAn ontology-driven semantic mashup of gene and biological pathway information: Application to the domain of nicotine dependenceTowards an ontology for sharing medical images and regions of interest in neuroimagingyOWL: An ontology-driven knowledge base for yeast biologistsDynamic sub-ontology evolution for traditional Chinese medicine web ontologyOntology-centric integration and navigation of the dengue literatureInfrastructure for dynamic knowledge integration—Automated biomedical ontology extension using textual resourcesAn ontological knowledge framework for adaptive medical workflowSemi-automatic web service composition for the life sciences using the BioMoby semantic web frameworkCombining Semantic Web technologies with Multi-Agent Systems for integrated access to biological resources

[J. Biomedical Informatics 41(5) 2008]

Challenging issues

Lister Hill National Center for Biomedical Communications 45

Challenging issues

Bridges across ontologiesPermanent identifiers for biomedical entitiesOther issues

Challenging issues

Bridges across ontologies

Lister Hill National Center for Biomedical Communications 47

Trans-namespace integration

Addison Disease(D000224)

Addison's disease (363732003)

Biomedicalliterature

MeSH

Clinicalrepositories

SNOMED CT

Primary adrenocortical insufficiency(E27.1)

ICD 10

Lister Hill National Center for Biomedical Communications 48

(Integrated) concept repositories

Unified Medical Language Systemhttp://umlsks.nlm.nih.govNCBO’s BioPortalhttp://www.bioontology.org/tools/portal/bioportal.htmlcaDSRhttp://ncicb.nci.nih.gov/NCICB/infrastructure/cacore_overview/cadsr

Open Biomedical Ontologies (OBO)http://obofoundry.org/

Lister Hill National Center for Biomedical Communications 49

Integrating subdomains

Biomedicalliterature

MeSH

Genomeannotations

GOModelorganisms

NCBITaxonomy

Geneticknowledge bases

OMIM

Clinicalrepositories

SNOMED CTOthersubdomains

Anatomy

FMA

UMLS

Lister Hill National Center for Biomedical Communications 5050

Integrating subdomains

Biomedicalliterature

Genomeannotations

Modelorganisms

Geneticknowledge bases

Clinicalrepositories

Othersubdomains

Anatomy

Lister Hill National Center for Biomedical Communications 51

Trans-namespace integration

Genomeannotations

GOModelorganisms

NCBITaxonomy

Geneticknowledge bases

OMIMOther

subdomains

Anatomy

FMA

UMLSAddison Disease (D000224)

Addison's disease (363732003)

Biomedicalliterature

MeSH

Clinicalrepositories

SNOMED CT

UMLSC0001403

Lister Hill National Center for Biomedical Communications 52

Mappings

Created manually (e.g., UMLS)PurposeDirectionality

Created automatically (e.g., BioPortal)Lexically: ambiguity, normalizationSemantically: lack of / incomplete formal definitions

Key to enabling semantic interoperabilityEnabling resource for the Semantic Web

Challenging issues

Permanent identifiers for biomedical entities

Lister Hill National Center for Biomedical Communications 54

Identifying biomedical entities

Multiple identifiers for the same entity in different ontologiesBarrier to data integration in general

Data annotated to different ontologies cannot “recombine”Need for mappings across ontologies

Barrier to data integration in the Semantic WebMultiple possible identifiers for the same entity

Depending on the underlying representational scheme (URI vs. LSID)Depending on who creates the URI

Lister Hill National Center for Biomedical Communications 55

Possible solutions

PURL http://purl.orgOne level of indirection between developers and usersIndependence from local constraints at the developer’s end

The institution creating a resource is also responsible for minting URIs

E.g., URI for genes in Entrez Gene

Guidelines: “URI note”W3C Health Care and Life Sciences Interest Group

Shared names initiativeIdentify resources vs. entities

[http://sharedname.org/]

Challenging issues

Other issues

Lister Hill National Center for Biomedical Communications 57

Availability

Many ontologies are freely availableThe UMLS is freely available for research purposes

Cost-free license requiredLicensing issues can be tricky

SNOMED CT is freely available in member countries of the IHTSDO

Being freely availableIs a requirement for the Open Biomedical Ontologies (OBO)Is a de facto prerequisite for Semantic Web applications

Lister Hill National Center for Biomedical Communications 58

Discoverability

Ontology repositoriesUMLS: 152 source vocabularies(biased towards healthcare applications)NCBO BioPortal: ~141ontologies(biased towards biological applications)Limited overlap between the two repositories

Need for discovery servicesMetadata for ontologies

Lister Hill National Center for Biomedical Communications 59

Formalism

Several major formalismWeb Ontology Language (OWL) – NCI ThesaurusOBO format – most OBO ontologiesUMLS Rich Release Format (RRF) – UMLS, RxNorm

Conversion mechanismsOBO to OWLLexGrid (import/export to LexGrid internal format)

Lister Hill National Center for Biomedical Communications 60

Ontology integration

Post hoc integration , form the bottom upUMLS approachIntegrates ontologies “as is”, including legacy ontologiesFacilitates the integration of the corresponding datasets

Coordinated development of ontologiesOBO Foundry approachEnsures consistency ab initioExcludes legacy ontologies

Lister Hill National Center for Biomedical Communications 61

Quality

Quality assurance in ontologies is still imperfectly defined

Difficult to define outside a use case or applicationSeveral approaches to evaluating quality

Collaboratively, by users (Web 2.0 approach)Marginal notes enabled by BioPortal

Centrally, by expertsOBO Foundry approach

Important factors besides qualityGovernanceInstalled base / Community of practice

Lister Hill National Center for Biomedical Communications 62

Conclusions

Ontologies are enabling resources for data integrationStandardization works

Grass roots effort (GO)Regulatory context (ICD 9-CM)

Bridging across resources is crucialOntology integration resources / strategies(UMLS, BioPortal / OBO Foundry)

Massive amounts of imperfect data integrated with rough methods might still be useful

MedicalOntologyResearch

Olivier Bodenreider

Lister Hill National Centerfor Biomedical CommunicationsBethesda, Maryland - USA

Contact:Web:

olivier@nlm.nih.govmor.nlm.nih.gov

top related