Top Banner
Data integraon with idenfiers and ontologies Why are names and graphs not enough? Egon Willighagen hp://chem-bla-ics.blogspot.com/ @egonwillighagen ORCID:0000-0001-7542-0286 OpenTox Euro 2016, Rheinfelden/DE 2016-10-28
29

Data integration with identifiers and ontologies

Jan 08, 2017

Download

Science

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data integration with identifiers and ontologies

Data integration with identifiers and ontologiesWhy are names and graphs not enough?

Egon Willighagen

http://chem-bla-ics.blogspot.com/@egonwillighagenORCID:0000-0001-7542-0286

OpenTox Euro 2016, Rheinfelden/DE2016-10-28

Page 2: Data integration with identifiers and ontologies

Acknowledgements● WikiPathways and PathVisio projects

– Prof. Alex Pico's team, UCSF

– Current and past members of BiGCaT (Prof. Chris Evelo): Marloes Poort

– Pathway Providers: Pieter Giesbertz (TUM), Kozo Nishida (RIKEN)

● Maastricht University– Toxicology: Rianne Fijten

– MaCSBio team

– Maastricht Science Programma (VOC project)

● Open PHACTS– Manchester University: Prof. Carole Goble, Christian Brenninkmeijer, Stian Soiland-Reyes

– Heriot-Watt University: Alasdair Gray

– Royal Society of Chemistry: Colin Batchelor

● Others– Bioclipse: Ola Spjuth (Uppsala University), Bioclipse-Opentox: Nina Jeliazkova

– MetaboLights collaboration: Reza Salek, Chandu Venkata, Garima Thakur

– ChEBI collaboration: Christoph Steinbeck, Gareth Owen

– PubChem collaboration: Evan Bolton, Gang Fu

– HMDB, Wikidata teams

Page 3: Data integration with identifiers and ontologies

Asthma: Detecting and Understanding

Smolinska et al. PLOS ONE. 2014 9:e105447doi:10.1371/journal.pone.0105447

Page 4: Data integration with identifiers and ontologies

Systems Biology: pathways

Andón FT, Fadeel B; ''Programmed Cell Death: Molecular Mechanisms and Implications for Safety Assessment of Nanomaterials.''; Acc Chem Res, 2012

Page 5: Data integration with identifiers and ontologies

Dopamine metabolism

Marloes Poort

Page 6: Data integration with identifiers and ontologies

The effect of troglitazone on heme biosynthesis

Page 7: Data integration with identifiers and ontologies
Page 8: Data integration with identifiers and ontologies
Page 9: Data integration with identifiers and ontologies

PathVisio: pathway enrichment (etc)

Van Iersel, M.P., et al. "Presenting and exploring biological pathways with PathVisio." BMC bioinformatics 9.1 (2008): 399. http://pathvisio.org/ → Martina Kutmon

Page 10: Data integration with identifiers and ontologies

We see a lot? But what is it?● Current techniques can see up to 1000

metabolites in one analysis– Only part of all 40k metabolites

● Only 10% we can identify– The other 90% is unknown

Page 11: Data integration with identifiers and ontologies

Databases & identifiers

● HMDB: Human Metabolome Database● ChEBI: Database of Chemicals Entities of

Biological Interest● ChemSpider, PubChem● CAS: Chemical Abstracts Service

● InChI: International Chemical Identifier

Page 12: Data integration with identifiers and ontologies

Acid/Base conjugates

CHEBI:15361 (Pyruvate) -> Ce:CHEBI:32816 (conjugate) -> Ck:C00022 -> [WP2456 HIF1A and PPARG regulation of glycolysis, WP2453 TCA Cycle and PDHc]

Page 13: Data integration with identifiers and ontologies

Switching identities: Glucose

Page 14: Data integration with identifiers and ontologies

Switching identities: Warfarin

Porter, W. (2010). Warfarin: history, tautomerism and activityJournal of Computer-Aided Molecular Design, 24 (6-7), 553-573DOI: 10.1007/s10822-010-9335-7

Page 15: Data integration with identifiers and ontologies

Bridging: identifiers

Page 16: Data integration with identifiers and ontologies

So, what IDs are used in WikiPathways?

Curated Collectionsubset

Page 17: Data integration with identifiers and ontologies

BridgeDb

Van Iersel, M.P., et al. "The BridgeDb framework: standardized accessto gene, protein and metabolite identifier mapping services."BMC Bioinformatics 11.1 (2010): 5.

New tools● Open PHACTS' Identifier Mapping Service

● R package● Bioclipse

Page 18: Data integration with identifiers and ontologies

Metabolite ID Mapping database● HMDB, ChEBI Wikidata

Page 19: Data integration with identifiers and ontologies

BridgeDb: scientific lenses

● Gene

– gene-protein– gene-probe

● Metabolite

– Tautomers– Compound class– Charge (acid/ate)

Brenninkmeijer, CYA, et al. "Scientific Lenses over Linked Data: An approach to support task specific views of the data. A vision." Proceedings of 2nd International Workshop on Linked Science. 2012.

Page 20: Data integration with identifiers and ontologies
Page 21: Data integration with identifiers and ontologies

#1: The breath data setCAS numbers: 1843

CAS numbers (unique): 1733

CAS numbers with mappings: 718

CAS numbers matches: 54

Pathways found: 76

Matches via CAS: 9

Matches via mapping: 29

Matches via ChEBI super class: 35

Matches via ChEBI charged species: 3

Matches via ChEBI tautomers: 0

CAS: 544-63-8 (myristic acid) → Ce:28875 → Ce:15904 (long-chain fatty acid) → [WP368 Mitochondrial LC-Fatty Acid Beta-Oxidation, WP357 Fatty Acid Biosynthesis]

Page 22: Data integration with identifiers and ontologies

What if we add more CAS ID mappings? (e.g. from Wikidata)INFO: Number of ids in Ch (HMDB): 41514 (changed +0.0%)INFO: Number of ids in Ce (ChEBI): 64222 (changed +0.0%)INFO: Number of ids in Kd (KEGG Drug): 2406 (changed +23960.0%)INFO: Number of ids in Ca (CAS): 38621 (changed +30.5%)INFO: Number of ids in Wi (Wikipedia): 3991 (changed +0.0%)INFO: Number of ids in Ck (KEGG Compound): 15896 (changed +0.0%)INFO: Number of ids in Cpc (PubChem-compound): 29170 (changed +72.5%)INFO: Number of ids in Wd: 18237INFO: Number of ids in Cs (Chemspider): 23981 (changed +49.4%)

- 30% more CAS numbers (294 unique IDs in WikiPathways)- 73% more PubChem compound identifiers (217 unique IDs in WP)- 50% more Chemspider identifiers (157 unique IDs in WP)- a lot more KEGG Drug identifiers

Page 23: Data integration with identifiers and ontologies

#1: The breath data set

CAS numbers: 1843CAS numbers (unique): 1733CAS numbers with mappings: 978CAS numbers matches: 116Pathways found: 158 (unique: 62)Matches via CAS: 9Matches via mapping: 28Matches via ChEBI super class: 108Matches via ChEBI charged species: 9Matches via ChEBI tautomers: 0Matches via ChEBI roles: 4

CAS: 544-63-8 (myristic acid) → Ce:28875 → Ce:15904 (long-chain fatty acid) → [WP368 Mitochondrial LC-Fatty Acid Beta-Oxidation, WP357 Fatty Acid Biosynthesis]

Page 24: Data integration with identifiers and ontologies

Wikidata

Mietchen, D. et al. Enabling open science: Wikidata for research (Wiki4R). Research Ideas and Outcomes 1, e7573+ (2015)

Page 25: Data integration with identifiers and ontologies

Wikidata: identifiers

Page 26: Data integration with identifiers and ontologies

Which API approach? REST, SADI, XMPP†, SOAP†?

Willighagen et al. "Computational toxicology using the OpenTox application programming interface and Bioclipse." BMC Research Notes 4.1 (2011): 1.

Wagener et al. "XMPP for cloud computing in bioinformatics supporting discovery and invocation of asynchronous web services." BMC Bioinformatics 10.1 (2009): 279.

Page 27: Data integration with identifiers and ontologies

Interactive API Docs with OpenAPI (Swagger)

Page 28: Data integration with identifiers and ontologies

Application Programming Interfaces

Page 29: Data integration with identifiers and ontologies

Conclusions

● Updated metabolite ID database– HMDB: still a major workhorse– ChEBI: charged species, compound

classes– Wikidata: CAS numbers, other

missing● Pathway Analysis

– Mapping with Bioclipse and PathVisio

– Scientific lenses improve mappings– Better annotation