Top Banner
UZH BIO390 Semantic web, RDF, Ontologies and Knowledge Graphs in biomedical sciences Ahmad Aghaebrahimian Zurich University of Applied Sciences [email protected]
56

UZH BIO390

Jun 04, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: UZH BIO390

UZH BIO390Semantic web, RDF, Ontologies and Knowledge Graphs

in biomedical sciences

Ahmad Aghaebrahimian

Zurich University of Applied [email protected]

Page 2: UZH BIO390

- Ahmad Aghaebrahimian

- Research Associate at ZHAW

- Ph.D. Computer Sciences focusing on Computational Linguistics

- Area of interests: Machine Learning Deep Neural NetworksBiomedical text analytics Natural Language ProcessingSemantic Web

Email: [email protected]

Introduction

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH ©

Page 3: UZH BIO390

- Introduction

- Stack of standards (URI, XML, RDF, SPARQL, OWL, …)

- RDF: Entities and Relationships

- Ontology

- Knowledge graphs

Session Content

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 3/25

Page 4: UZH BIO390

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 4/25

LOF

ART1

Melanoma

Page 5: UZH BIO390

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 4/25

LOF

ART1

Melanoma

Melanoma

Melanoma Tumors

Same_as

Page 6: UZH BIO390

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 4/25

LOF

ART1

Melanoma

Melanoma

Melanoma Tumors

Same_as

Caused_by

Type

UV

Melanoma Tumors

Cancer

Page 7: UZH BIO390

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 4/25

LOF

ART1

Melanoma

Melanoma

Melanoma Tumors

Same_as

Caused_by

Type

UV

Melanoma Tumors

Cancer

DisableADP-ribosyltransferase 1

UV

Page 8: UZH BIO390

The linked open data

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 4/25

• Linked open data example

LOF

ART1

Melanoma

Melanoma

Melanoma Tumors

Same_as

Caused_by

Type

UV

Melanoma Tumors

Cancer

DisableADP-ribosyltransferase 1

UV

Same_as

ART1

ADP-ribosyltransferase 1

Page 9: UZH BIO390

The linked open data cloud

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 5/25

Page 10: UZH BIO390

The life sciences data cloud

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 6/25

Page 11: UZH BIO390

Basics of the web

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 7/25

- Web structure:

Server vs. Client

Page 12: UZH BIO390

Basics of the web

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 7/25

- Web structure:

Server vs. Client

- Web Components:

Uniform Resource Locator (URL): identify document

Hypertext Markup Language (HTML): access document

Hypertext Transfer Protocol (HTTP): transfer document

Page 13: UZH BIO390

Basics of the web

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 7/25

- Web structure:

Server vs. Client

- Web Components:

Uniform Resource Locator (URL): identify document

Hypertext Markup Language (HTML): access document

Hypertext Transfer Protocol (HTTP): transfer document

- Moving from pages to resources

Interactive web, Web 2.0 or semantic web

Page 14: UZH BIO390

Semantic Web

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 8/25

Page 15: UZH BIO390

Semantic Web

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 8/25

What?

Semantic Web (SW) is an extension of the World Wide Web that uses the Resource

Description Framework (RDF) and Web Ontology Language (OWL), among other

standards, to make the Internet machine-readable.

Page 16: UZH BIO390

Semantic Web

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 8/25

What?

Semantic Web (SW) is an extension of the World Wide Web that uses the Resource

Description Framework (RDF) and Web Ontology Language (OWL), among other

standards, to make the Internet machine-readable.

Page 17: UZH BIO390

Semantic Web

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 8/25

What?

Semantic Web (SW) is an extension of the World Wide Web that uses the Resource

Description Framework (RDF) and Web Ontology Language (OWL), among other

standards, to make the Internet machine-readable.

Why?

Page 18: UZH BIO390

Semantic Web

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 8/25

What?

Semantic Web (SW) is an extension of the World Wide Web that uses the Resource

Description Framework (RDF) and Web Ontology Language (OWL), among other

standards, to make the Internet machine-readable.

Why?

- Presenting knowledge about data

Page 19: UZH BIO390

Semantic Web

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 8/25

What?

Semantic Web (SW) is an extension of the World Wide Web that uses the Resource

Description Framework (RDF) and Web Ontology Language (OWL), among other

standards, to make the Internet machine-readable.

Why?

- Presenting knowledge about data

Page 20: UZH BIO390

Semantic Web

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 8/25

What?

Semantic Web (SW) is an extension of the World Wide Web that uses the Resource

Description Framework (RDF) and Web Ontology Language (OWL), among other

standards, to make the Internet machine-readable.

Why?

- Presenting knowledge about data

- Allowing data integration from data silos

Page 21: UZH BIO390

Semantic Web

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 8/25

What?

Semantic Web (SW) is an extension of the World Wide Web that uses the Resource

Description Framework (RDF) and Web Ontology Language (OWL), among other

standards, to make the Internet machine-readable.

Why?

- Presenting knowledge about data

- Allowing data integration from data silos

- Introduce intelligence to systems

Page 22: UZH BIO390

Semantic Web Standards

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 9/25

URI:

What is a Resource?

URL → URI → IRI

Physically located → conceptually identified → conceptually identified in all languages

Page 23: UZH BIO390

Semantic Web Standards

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 9/25

URI:

What is a Resource?

URL → URI → IRI

Physically located → conceptually identified → conceptually identified in all languages

XML:

Open family of languages represent structured data using tags and in textual format

Rules:

- Only one root <root> </root>

- Opening with closing <Gene></Gene>

- no tag begin with number or xml

- Case sensitive <Gene> != <gene>

- Order matters <Gene> <nucl> </nucl></Gene>

- Tags may have attributes <Gene inherited=’true’ />

Page 24: UZH BIO390

Semantic Web Standards

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 10/25

OWL:

OWL provides a rich vocabulary to add semantics and context and allow reasoning and inference

Page 25: UZH BIO390

Ontology

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 10/25

• Ontology is

• A model of a domain

• A vocabulary consisting of classes and properties

• Machine-readable knowledge representation

Page 26: UZH BIO390

Ontology

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 10/25

• Ontology is

• A model of a domain

• A vocabulary consisting of classes and properties

• Machine-readable knowledge representation

• How to build an ontology?

• Define a domain

• Define the classes and properties

• Extend existing ontology (RDF schema, dbpedia,...)

Page 27: UZH BIO390

Ontology

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 10/25

• Ontology is

• A model of a domain

• A vocabulary consisting of classes and properties

• Machine-readable knowledge representation

• How to build an ontology?

• Define a domain

• Define the classes and properties

• Extend existing ontology (RDF schema, dbpedia,...)

• Benefits of an ontology in Biomedical research? (And why they are important)• Data integration• Language processing via domain vocabulary• Defining the precise meaning of classes• Automated processing

Page 28: UZH BIO390

Ontology Continued

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 11/25

• Ontology as a set of:• Definitions• Terms and their synonyms• Relationships

Page 29: UZH BIO390

Ontology Continued

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 11/25

• Ontology as a set of:• Definitions• Terms and their synonyms• Relationships

• OBO : ChEBI Access via : ‘https://github.zhaw.ch/agha/D-Heath’

[Term]id: CHEBI:60871name: selenium(2+)def: "The selenium ion with two positive charges." []synonym: "Se(2+)" RELATED [UniProt:]synonym: "selenium dication" RELATED [ChEBI:]synonym: "Se2+" RELATED [SUBMITTER:]synonym: "Se" RELATED FORMULA [ChEBI:]synonym: "[Se++]" RELATED SMILES [ChEBI:]synonym: "InChI=1S/Se/q+2" RELATED InChI [ChEBI:]synonym: "InChIKey=MFSBVGSNNPNWMD-UHFFFAOYSA-N" RELATED InChIKey [ChEBI:]is_a: CHEBI:60250is_a: CHEBI:30412

[Term]id: CHEBI:60250name: selenium iondef: "A selenium atom having a net electric charge." []is_a: CHEBI:36904is_a: CHEBI:36914

Page 30: UZH BIO390

Ontology Continued

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 11/25

• Ontology as a set of:• Definitions• Terms and their synonyms• Relationships

• OBO : ChEBI Access via : ‘https://github.zhaw.ch/agha/D-Heath’

• UMLS:• Metathesaurus• Semantic network• Specialized Lexicon

[Term]id: CHEBI:60871name: selenium(2+)def: "The selenium ion with two positive charges." []synonym: "Se(2+)" RELATED [UniProt:]synonym: "selenium dication" RELATED [ChEBI:]synonym: "Se2+" RELATED [SUBMITTER:]synonym: "Se" RELATED FORMULA [ChEBI:]synonym: "[Se++]" RELATED SMILES [ChEBI:]synonym: "InChI=1S/Se/q+2" RELATED InChI [ChEBI:]synonym: "InChIKey=MFSBVGSNNPNWMD-UHFFFAOYSA-N" RELATED InChIKey [ChEBI:]is_a: CHEBI:60250is_a: CHEBI:30412

[Term]id: CHEBI:60250name: selenium iondef: "A selenium atom having a net electric charge." []is_a: CHEBI:36904is_a: CHEBI:36914

Page 31: UZH BIO390

Semantic Web Standards

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 12/25

RDF:

RDF is a graph-based data model and the set of syntax that allows us to write description about the resources on

the web and to exchange them. It presents data in the triple format and gives it structures and unique identifiers so

that data can be easily linked

Page 32: UZH BIO390

Semantic Web Standards

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 12/25

RDF:

RDF is a graph-based data model and the set of syntax that allows us to write description about the resources on

the web and to exchange them. It presents data in the triple format and gives it structures and unique identifiers so

that data can be easily linked.

Principles:

Triple structure: (subject, predicate, object)

- subject → a URI resource

- predicate → binary type URI

- object → a URI resource or literal

Predicates are labeled

Predicates are directed

RDF is a graph model

Page 33: UZH BIO390

Semantic Web Standards

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 12/25

RDF:

RDF is a graph-based data model and the set of syntax that allows us to write description about the resources on

the web and to exchange them. It presents data in the triple format and gives it structures and unique identifiers so

that data can be easily linked.

Principles:

Triple structure: (subject, predicate, object)

- subject → a URI resource

- predicate → binary type URI

- object → a URI resource or literal

Predicates are labeled

Predicates are directed

RDF is a graph model

RDF serialization:

XML, N-triple, Turtle, TriG, JSON-LD

Page 34: UZH BIO390

The Graph data model

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 13/25

• Storing data in form of triplets: (Subject, Predicate, Object)e.g. (ART, LOF, Melanoma_Tumors)Subject and Predicate must be in URI form

Page 35: UZH BIO390

The Graph data model

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 13/25

• Storing data in form of triplets: (Subject, Predicate, Object)e.g. (ART, LOF, Melanoma_Tumors)Subject and Predicate must be in URI form

• Triplets follow the RDF standard.

• Triplets are easily Expanded and Interlinked.

• Triplets can be queried via SPARQL:

Page 36: UZH BIO390

The Graph data model

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 13/25

• Storing data in form of triplets: (Subject, Predicate, Object)e.g. (ART, LOF, Melanoma_Tumors)Subject and Predicate must be in URI form

• Triplets follow the RDF standard.

• Triplets are easily Expanded and Interlinked.

• Triplets can be queried via SPARQL:

SELECT ?gene ?relationWHERE {

?gene ?relation Melanoma_Tumors .}

Page 37: UZH BIO390

Subjects, Predicates, Objects

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 14/25

Named Entity

Relationship

NLP components:

• Named Entity Recognition (NER)

• Named Entity Disambiguation (NED)

• Relation Extraction (RE)

Page 38: UZH BIO390

Named Entity Recognition (NER)

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 15/25

B O B O B O O B O O O

Conditional Random Fields

Long Short-Term Memory

Convolutional Neural Network

Atorvastatin lowers LDL and triglycerides and raises HDL in the blood.

Atorvastatin lowers LDL and triglycerides and raises HDL in the blood.

B O B O B O O B O O O

Page 39: UZH BIO390

NER Evaluation:

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 16/25

Accuracy:

Page 40: UZH BIO390

NER Evaluation:

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 16/25

Accuracy:

F1 score:

Example: Cancer Diagnostics

Cancer No-cancer

Cancer TP FP

No-cancer FN TN

True labels

Pre

dic

ted

labe

ls

Page 41: UZH BIO390

NER Evaluation:

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 16/25

Accuracy:

F1 score:

Example: Cancer Diagnostics

Cancer No-cancer

Cancer TP FP

No-cancer FN TN

True labels

Pre

dic

ted

labe

ls

Page 42: UZH BIO390

NER Evaluation:

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 16/25

Accuracy:

F1 score:

Example: Cancer Diagnostics

Cancer No-cancer

Cancer TP FP

No-cancer FN TN

True labels

Pre

dic

ted

labe

ls

Page 43: UZH BIO390

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 17/25

CHEBI:39548 lowers CHEBI:47774 and IUPAC:46823 and raises CHEBI:47775 in the blood.

Atorvastatin lowers LDL and triglycerides and raises HDL in the blood.

Named Entity Disambiguation (NED)

Page 44: UZH BIO390

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 17/25

CHEBI:39548 lowers CHEBI:47774 and IUPAC:46823 and raises CHEBI:47775 in the blood.

Problem:

- Different order

- Morphological forms- Synonymous names

- Abbreviation

Atorvastatin lowers LDL and triglycerides and raises HDL in the blood.

Named Entity Disambiguation (NED)

Page 45: UZH BIO390

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 17/25

CHEBI:39548 lowers CHEBI:47774 and IUPAC:46823 and raises CHEBI:47775 in the blood.

Problem:

- Different order

- Morphological forms- Synonymous names

- Abbreviation

True match Entity False match

LSTM LSTM LSTM

Attention Attention Attention

Hinge Loss

Aghaebrahimian, A., Cieliebak, M.(2020), Named Entity Disambiguation at Scale, ANNPR, Winterthur, Switzerland

Atorvastatin lowers LDL and triglycerides and raises HDL in the blood.

Named Entity Disambiguation (NED)

Page 46: UZH BIO390

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 18/25

Atorvastatin lowers LDL and triglycerides and raises HDL in the blood.

Atorvastatin lowers LDL and triglycerides and raises HDL in the blood.

Relation Extraction (RE)

Page 47: UZH BIO390

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 18/25

Atorvastatin lowers LDL and triglycerides and raises HDL in the blood.

Atorvastatin lowers LDL and triglycerides and raises HDL in the blood.

Single hop RE:- Atorvastatin, LDL => lowers

- Atorvastatin, triglycerides => lowers

- Atorvastatin, HDL => raises

- LDL, triglycerides => None

- LDL, HDL => None

Atorvastatin lowers LDL and triglycerides and raises HDL in the blood

Embedding

CNN

Classifier

Embedding

}

Encode(.)

Aghaebrahimian, A. and Jurcicek, F., (2016), Open-domain Factoid Question Answering via Knowledge Graph Search, NAACL, San Diego, USA

Relation Extraction (RE)

Page 48: UZH BIO390

Knowledge Graph

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 19/25

Collection of billions of triplet graph structures known as Assertion modeled in the RDF model

Page 49: UZH BIO390

Knowledge Graph

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 19/25

Collection of billions of triplet graph structures known as Assertion modeled in the RDF model

(ART , LOF, Melanoma Tumors) (ego-3 , REG , paralysis) (ego-3 , REG , sterility) (STAT1 , GOF , immunodeficiency and autoimmunity)

Page 50: UZH BIO390

Knowledge Graph

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 19/25

Collection of billions of triplet graph structures known as Assertion modeled in the RDF model

(ART , LOF, Melanoma Tumors) (ego-3 , REG , paralysis) (ego-3 , REG , sterility) (STAT1 , GOF , immunodeficiency and autoimmunity)

Which genes are related to paralysis?How STAT1 impacts the immune system?

Page 51: UZH BIO390

Knowledge Graph

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 19/25

Collection of billions of triplet graph structures known as Assertion modeled in the RDF model

(ART , LOF, Melanoma Tumors) (ego-3 , REG , paralysis) (ego-3 , REG , sterility) (STAT1 , GOF , immunodeficiency and autoimmunity)

Which genes are related to paralysis?How STAT1 impacts the immune system?

What proteins are associated with adverse events caused by Fulvestrant?

Fulvestrant

causes

events

associated

Protein A

Protein B

Page 52: UZH BIO390

The linked open data

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 20/25

• Linked open data example

LOF

ART1

Melanoma

Melanoma

Melanoma Tumors

Same_as

Caused_by

Type

UV

Melanoma Tumors

Cancer

DisableADP-ribosyltransferase 1

UV

Same_as

ART1

ADP-ribosyltransferase 1

Page 53: UZH BIO390

The linked open data

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 20/25

• Linked open data example

LOF

ART1

Melanoma

Melanoma

Melanoma Tumors

Same_as

Caused_by

Type

UV

Melanoma Tumors

Cancer

DisableADP-ribosyltransferase 1

UV

Same_as

ART1

ADP-ribosyltransferase 1

Question: How do we know that the dotted entities are the same entities.

Page 54: UZH BIO390

Semantic Web tools

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 23/25

- RDFa:

Extracting triples from HTML pages via markups

https://rdfa.info/play/

- Gleaning Resource Descriptions from Dialects of Languages (GRDDL):

Algorithms instead of markups

<link rel="transformation" href="http://www.w3.org/2000/06/dc-extract/dc-extract.xsl" />

- JSON for Linked Data: JSON-LD

Attaching context to JSON files

- R2RML: Transforming tables to RDF

Page 55: UZH BIO390

SPARQL

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 24/25

W3C standard

SPARQL Protocol And RDF Query Language

Lab work: https://bit.ly/3wjyHpf

Page 56: UZH BIO390

Life Sciences RDF data and SPARQL Endpoints

Ahmad Aghaebrahimian ([email protected]) BIO 390 – UZH © 25/25

A SPARQL endpoint gets queries and returns their results using HTTP protocol• Generic

- http://sparql.org/sparql.html

- http://demo.openlinksw.com/sparql

• Specific

• Dbpedia

- https://dbpedia.org/sparql

• SIB Swiss Institute of Bioinformatics

- UniProt: http://sparql.uniprot.org

- neXtProt: http://snorql.nextprot.org

• EBI European Bioinformatics Institute:

- BioSamples, BioModels, ChEMBL, Expression Atlas, Reactome, Ensembl

- https://www.ebi.ac.uk/rdf/services/sparql

• NCBI National Center for Biotechnology Information:

- PubChemRDF (rdf only, no SPARQL endpoint)

- https://pubchem.ncbi.nlm.nih.gov/rdf/

• http://sparql-playground.sib.swiss/