2016 bmdid-mappings

Post on 15-Apr-2017

207 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

Transcript

ISWC2016:::BMDID::Dumontier1

ONTOLOGY MAPPING FOR LIFE SCIENCE LINKED DATA

Amrapali Zaveri and Michel Dumontier

Stanford Center for Biomedical Informatics ResearchStanford University

2

Large and growing network of Linked Data

ISWC2016:::BMDID::DumontierLinking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"

ISWC2016:::BMDID::Dumontier

Linked Data for the Life Sciences

3

Bio2RDF is an open source project to unify the representation and interlinking of biological data using RDF.

chemicals/drugs/formulations, genomes/genes/proteins, domainsInteractions, complexes & pathwaysanimal models and phenotypesDisease, genetic markers, treatmentsTerminologies & publications

• 11B+ interlinked statements from 35 biomedical datasets and 400+ ontologies

• dataset description, provenance & statistics• A growing interoperable ecosystem with the EBI,

NCBI, DBCLS, NCBO, OpenPHACTS, and commercial tool providers

ISWC2016:::BMDID::Dumontier4

Biomedical Linked Data

ISWC2016:::BMDID::Dumontier5

the lack of coordination to a global schema makes Linked Data chaotic and unwieldy

6

Federated queries require intimate knowledge of each dataset schema

Get all protein catabolic processes (and more specific GO terms) in biomodels

SELECT ?go ?label count(distinct ?x) WHERE { service <http://bioportal.bio2rdf.org/sparql> { ?go rdfs:label ?label . ?go rdfs:subClassOf+ ?tgo ?tgo rdfs:label ?tlabel . FILTER regex(?tlabel, "^protein catabolic process") } service <http://biomodels.bio2rdf.org/sparql> { ?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go . ?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> . }}

ISWC2016:::BMDID::Dumontier

ISWC2016:::BMDID::Dumontier7

uniprot:P05067

uniprot:Protein

is a

sio:gene

is a is a

Previous work involved manual mappings between Bio2RDF types and relations and the Semanticscience

Integrated Ontology (SIO)

dataset

ontology

Knowledge Base

pharmgkb:PA30917

refseq:Protein

is a

is a

omim:189931

omim:Gene pharmgkb:Gene

Querying Bio2RDF Linked Open Data with a Global Schema. Alison Callahan, José Cruz-Toledo and Michel Dumontier. Bio-ontologies 2012.

8 ISWC2016:::BMDID::Dumontier

Semanticscience Ontology (SIO)An effective upper level ontology.1500+ classes207 object properties (inc. inverses)1 datatype property

9

Bio2RDF and SIO powered SPARQL federated query: Find chemicals (from CTD) and proteins (from SGD) that

participate in the same process (from GOA)SELECT ?chem, ?prot, ?procFROM <http://bio2rdf.org/ctd>WHERE { SERVICE <http://ctd.bio2rdf.org/sparql> {

?chemical a sio:chemical-entity. ?chemical rdfs:label ?chem.?chemical sio:is-participant-in ?process. ?process rdfs:label ?proc.

FILTER regex (?process, "http://bio2rdf.org/go:") }

SERVICE <http://sgd.bio2rdf.org/sparql> {?protein a sio:protein . ?protein sio:is-participant-in ?process. ?protein rdfs:label ?prot .

}}

ISWC2016:::BMDID::Dumontier

ISWC2016:::BMDID::Dumontier

Many vocabularies, ontologies and community-based standards

are now available

10

ISWC2016:::BMDID::Dumontier11

PubChem uses multiple terminologies

ISWC2016:::BMDID::Dumontier12

Existing limitations with Bio2RDF mappings

• New datasets have been added• Existing datasets have changed• The target ontology (SIO) has changed• The target ontology (SIO) is incomplete and there

may be better ontologies to use• These ontologies are evolving, today’s mappings

may be invalid or imprecise tomorrow• Manual process -> not easy and not reproducible

-> must automate

ISWC2016:::BMDID::Dumontier13

Goal

Develop a semi-automated procedure to generate high quality mappings between Bio2RDF and SIO.

ISWC2016:::BMDID::Dumontier14

approach

distance metrics

graph-based

instance-based

BioPortal

crowdsourcing

previous work*Our work

Automated Manual

ISWC2016:::BMDID::Dumontier

Idea: Create mappings between SIO and Bio2RDF using ontologies in BioPortal

15

Bio2RDF

NCBO Annotator/Recommender

SIO

ISWC2016:::BMDID::Dumontier

Bio2RDF-SIO mappings via transitive closure through BioPortal ontologies

16

Bio2RDF

SIO

Super Class

Mapped Class

match

ISWC2016:::BMDID::Dumontier

Results

17

319 (of 6093) classespruned

1 NCBO Annotator 174 Bio2RDF classesmatched directly and exactly to SIO

2 NCBO Recommender94 Bio2RDF classes matched toBioPortal ontologies

Bio2RDFremove blank nodes, general resources, OWL vocabulary & non-Bio2RDF types/relations.

ISWC2016:::BMDID::Dumontier

Results

18

SIO1500 classes

475 BioPortalOntologies3

393 BioPortal ontologiesmatched to SIO

ISWC2016:::BMDID::Dumontier

Results

19

Bio2RDF319 classes

4 Traverse hierarchySIO1500 classes

393 BioPortal ontologiesmatched to SIO

94 Bio2RDF classes matched toBioPortal ontologies

ISWC2016:::BMDID::Dumontier

Results

20

Bio2RDF319 classes

4 Traverse hierarchy

SIO1500 classes

393 BioPortal ontologiesmatched to SIO

94 Bio2RDF classes matched toBioPortal ontologies

71 matches

Mapped class

Super class

ISWC2016:::BMDID::Dumontier

Results — Example

21

Bio2RDFclass

clinicaltrials:Clincial-Study

Super class

Edda:Study_Design

Mapped class

edda:clinical_trial

SIOclass

sio:001041| (study design)

skos:broader

ISWC2016:::BMDID::Dumontier

Mappings often occurred to more than one class

22

sider:Drug-Indication-Association

sio:010038 (drug)

sio:010299 (disease)

sio:000897 (association)

ISWC2016:::BMDID::Dumontier

Manual validation of mappings

23

Bio2RDF Class SIO Class Annotation

drugbank:Biotech no match

clinicaltrials:Organization sio:00012 (organization) exact

drugbank:toxicity sio:001008 (toxicity) exact

sgd:GlycineCount sio:000794 (count) partial – is-a

wormbase:Genetic-Interaction sio:010035 (gene) partial – part-of

clinicaltrials:Serious-Event sio:000614 (attribute) incorrect

drugbank:Source sio:000510 (model) incorrect

All results available at https://goo.gl/eiijmQ

ISWC2016:::BMDID::Dumontier

Conclusion

• Developed a semi-automated methodology to map Bio2RDF classes to SIO via BioPortal ontologies

• 245 of 319 Bio2RDF classes matched to SIO

24

ISWC2016:::BMDID::Dumontier

Limitations

• Unmatched classes: neither SIO nor other ontologies have complete coverage

• Overly general concepts: Semantically incompatible classes

• Incorrect mappings: Matches to part of the class

• Mappings are insufficient to precisely to retrieve data across different datasets

25

ISWC2016:::BMDID::Dumontier

Future Work

• Extend SIO to include classes that are ultimately not found

• Explore mid-level portion of SIO to eliminate root level mappings

• Scalable validation by via crowdsourcing• Pursue query rewriting

26

ISWC2016:::BMDID::Dumontier27

dumontierlab.commichel.dumontier@stanford.edu

Website: http://dumontierlab.com

top related