Top Banner
ONTOLOGY MAPPING FOR LIFE SCIENCE LINKED DATA ISWC2016:::BMDID::Dumontier 1 Amrapali Zaveri and Michel Dumontier Stanford Center for Biomedical Informatics Research Stanford University
27

2016 bmdid-mappings

Apr 15, 2017

Download

Science

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier1

ONTOLOGY MAPPING FOR LIFE SCIENCE LINKED DATA

Amrapali Zaveri and Michel Dumontier

Stanford Center for Biomedical Informatics ResearchStanford University

Page 2: 2016 bmdid-mappings

2

Large and growing network of Linked Data

ISWC2016:::BMDID::DumontierLinking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"

Page 3: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier

Linked Data for the Life Sciences

3

Bio2RDF is an open source project to unify the representation and interlinking of biological data using RDF.

chemicals/drugs/formulations, genomes/genes/proteins, domainsInteractions, complexes & pathwaysanimal models and phenotypesDisease, genetic markers, treatmentsTerminologies & publications

• 11B+ interlinked statements from 35 biomedical datasets and 400+ ontologies

• dataset description, provenance & statistics• A growing interoperable ecosystem with the EBI,

NCBI, DBCLS, NCBO, OpenPHACTS, and commercial tool providers

Page 4: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier4

Biomedical Linked Data

Page 5: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier5

the lack of coordination to a global schema makes Linked Data chaotic and unwieldy

Page 6: 2016 bmdid-mappings

6

Federated queries require intimate knowledge of each dataset schema

Get all protein catabolic processes (and more specific GO terms) in biomodels

SELECT ?go ?label count(distinct ?x) WHERE { service <http://bioportal.bio2rdf.org/sparql> { ?go rdfs:label ?label . ?go rdfs:subClassOf+ ?tgo ?tgo rdfs:label ?tlabel . FILTER regex(?tlabel, "^protein catabolic process") } service <http://biomodels.bio2rdf.org/sparql> { ?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go . ?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> . }}

ISWC2016:::BMDID::Dumontier

Page 7: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier7

uniprot:P05067

uniprot:Protein

is a

sio:gene

is a is a

Previous work involved manual mappings between Bio2RDF types and relations and the Semanticscience

Integrated Ontology (SIO)

dataset

ontology

Knowledge Base

pharmgkb:PA30917

refseq:Protein

is a

is a

omim:189931

omim:Gene pharmgkb:Gene

Querying Bio2RDF Linked Open Data with a Global Schema. Alison Callahan, José Cruz-Toledo and Michel Dumontier. Bio-ontologies 2012.

Page 8: 2016 bmdid-mappings

8 ISWC2016:::BMDID::Dumontier

Semanticscience Ontology (SIO)An effective upper level ontology.1500+ classes207 object properties (inc. inverses)1 datatype property

Page 9: 2016 bmdid-mappings

9

Bio2RDF and SIO powered SPARQL federated query: Find chemicals (from CTD) and proteins (from SGD) that

participate in the same process (from GOA)SELECT ?chem, ?prot, ?procFROM <http://bio2rdf.org/ctd>WHERE { SERVICE <http://ctd.bio2rdf.org/sparql> {

?chemical a sio:chemical-entity. ?chemical rdfs:label ?chem.?chemical sio:is-participant-in ?process. ?process rdfs:label ?proc.

FILTER regex (?process, "http://bio2rdf.org/go:") }

SERVICE <http://sgd.bio2rdf.org/sparql> {?protein a sio:protein . ?protein sio:is-participant-in ?process. ?protein rdfs:label ?prot .

}}

ISWC2016:::BMDID::Dumontier

Page 10: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier

Many vocabularies, ontologies and community-based standards

are now available

10

Page 11: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier11

PubChem uses multiple terminologies

Page 12: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier12

Existing limitations with Bio2RDF mappings

• New datasets have been added• Existing datasets have changed• The target ontology (SIO) has changed• The target ontology (SIO) is incomplete and there

may be better ontologies to use• These ontologies are evolving, today’s mappings

may be invalid or imprecise tomorrow• Manual process -> not easy and not reproducible

-> must automate

Page 13: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier13

Goal

Develop a semi-automated procedure to generate high quality mappings between Bio2RDF and SIO.

Page 14: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier14

approach

distance metrics

graph-based

instance-based

BioPortal

crowdsourcing

previous work*Our work

Automated Manual

Page 15: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier

Idea: Create mappings between SIO and Bio2RDF using ontologies in BioPortal

15

Bio2RDF

NCBO Annotator/Recommender

SIO

Page 16: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier

Bio2RDF-SIO mappings via transitive closure through BioPortal ontologies

16

Bio2RDF

SIO

Super Class

Mapped Class

match

Page 17: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier

Results

17

319 (of 6093) classespruned

1 NCBO Annotator 174 Bio2RDF classesmatched directly and exactly to SIO

2 NCBO Recommender94 Bio2RDF classes matched toBioPortal ontologies

Bio2RDFremove blank nodes, general resources, OWL vocabulary & non-Bio2RDF types/relations.

Page 18: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier

Results

18

SIO1500 classes

475 BioPortalOntologies3

393 BioPortal ontologiesmatched to SIO

Page 19: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier

Results

19

Bio2RDF319 classes

4 Traverse hierarchySIO1500 classes

393 BioPortal ontologiesmatched to SIO

94 Bio2RDF classes matched toBioPortal ontologies

Page 20: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier

Results

20

Bio2RDF319 classes

4 Traverse hierarchy

SIO1500 classes

393 BioPortal ontologiesmatched to SIO

94 Bio2RDF classes matched toBioPortal ontologies

71 matches

Mapped class

Super class

Page 21: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier

Results — Example

21

Bio2RDFclass

clinicaltrials:Clincial-Study

Super class

Edda:Study_Design

Mapped class

edda:clinical_trial

SIOclass

sio:001041| (study design)

skos:broader

Page 22: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier

Mappings often occurred to more than one class

22

sider:Drug-Indication-Association

sio:010038 (drug)

sio:010299 (disease)

sio:000897 (association)

Page 23: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier

Manual validation of mappings

23

Bio2RDF Class SIO Class Annotation

drugbank:Biotech no match

clinicaltrials:Organization sio:00012 (organization) exact

drugbank:toxicity sio:001008 (toxicity) exact

sgd:GlycineCount sio:000794 (count) partial – is-a

wormbase:Genetic-Interaction sio:010035 (gene) partial – part-of

clinicaltrials:Serious-Event sio:000614 (attribute) incorrect

drugbank:Source sio:000510 (model) incorrect

All results available at https://goo.gl/eiijmQ

Page 24: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier

Conclusion

• Developed a semi-automated methodology to map Bio2RDF classes to SIO via BioPortal ontologies

• 245 of 319 Bio2RDF classes matched to SIO

24

Page 25: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier

Limitations

• Unmatched classes: neither SIO nor other ontologies have complete coverage

• Overly general concepts: Semantically incompatible classes

• Incorrect mappings: Matches to part of the class

• Mappings are insufficient to precisely to retrieve data across different datasets

25

Page 26: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier

Future Work

• Extend SIO to include classes that are ultimately not found

• Explore mid-level portion of SIO to eliminate root level mappings

• Scalable validation by via crowdsourcing• Pursue query rewriting

26

Page 27: 2016 bmdid-mappings

ISWC2016:::BMDID::Dumontier27

[email protected]

Website: http://dumontierlab.com