Top Banner
12 th July, 2016 Connecting life sciences data at the European Bioinformatics Institute Tony Burdett Technical Co-ordinator Samples, Phenotypes and Ontologies Team www.ebi.ac.uk
29

Connecting life sciences data at the European Bioinformatics Institute

Jan 12, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Connecting life sciences data at the European Bioinformatics Institute

12th July, 2016

Connecting life sciences data at the

European Bioinformatics Institute

Tony Burdett

Technical Co-ordinator –

Samples, Phenotypes and

Ontologies Team

www.ebi.ac.uk

Page 2: Connecting life sciences data at the European Bioinformatics Institute

Bioinformatics is

the science of storing,

retrieving and analysing

large amounts of

biological information.

Page 3: Connecting life sciences data at the European Bioinformatics Institute

What is EMBL-EBI?

• Europe’s home for biological data services, research

and training

• A trusted data provider for the life sciences

• Part of the European Molecular Biology Laboratory,

an intergovernmental research organisation

• International: 570 members of staff from 57 nations

• Home of the ELIXIR Technical hub.

Page 4: Connecting life sciences data at the European Bioinformatics Institute

OUR MISSION

To provide freely

available data and

bioinformatics services

to all facets of the

scientific community in

ways that promote

scientific progress

Page 5: Connecting life sciences data at the European Bioinformatics Institute

Big data, big demand

~18.5 million requests to EMBL-EBI

websites every day

60 petabytesof EMBL-EBI storage capacity

EMBL-EBI handles

9.2 million jobs on average per

month

Scientists at over

5 million unique sites use

EMBL-EBI websites

Page 6: Connecting life sciences data at the European Bioinformatics Institute

Atlas

what happens where

From molecules to medicine

Biology is changing:

• Lower-cost sequencing

• More data produced

• New types of data

• Emphasis on systems biology

Bioinformatics enables new

applications:

• molecular medicine

• agriculture

• food

• environmental sciences

Page 7: Connecting life sciences data at the European Bioinformatics Institute

Data resources at EMBL-EBIGenes, genomes & variation

RNA Central

Array

Express

Expression Atlas

Metabolights

PRIDE

InterPro Pfam UniProt

ChEMBL SureChEMBL ChEBI

Molecular structures

Protein Data Bank in Europe

Electron Microscopy Data Bank

European Nucleotide Archive

European Variation Archive

European Genome-phenome Archive

Gene, protein & metabolite expression

Protein sequences, families &

motifs

Chemical biology

Reactions, interactions &

pathways

IntAct Reactome MetaboLights

Systems

BioModels Enzyme Portal BioSamples

Ensembl

Ensembl Genomes

GWAS Catalog

Metagenomics portal

Europe PubMed Central

BioStudies

Gene Ontology

Experimental Factor

Ontology

Literature &

ontologies

Page 8: Connecting life sciences data at the European Bioinformatics Institute

Database interactions

• Collaborative community

facilitates social,

scientific and technical

interactions

• Right: internal

interactions between

data resources as

determined by the

exchange of data.

• Width of each internal

arc weighted according

to the number of different

data types exchanged.

Page 9: Connecting life sciences data at the European Bioinformatics Institute

Biology 101 – Central Dogma

Dhorspool at en.wikipedia [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)

or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons

Page 10: Connecting life sciences data at the European Bioinformatics Institute

Sadly, it’s not *quite* that simple…

User:Dhorspool [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)

or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons

Page 11: Connecting life sciences data at the European Bioinformatics Institute

Nope, not that simple either…

Proteome

Metabolome

Genome

tissue

CE-

MS

antibody array LC-MS/MS

m/z

600 800 1000 1200 1400 1600

10

20

30

40

50

60

70

80

90

100

Inte

nsit

y

609.256

b6

755.422y8

882.357

b9

852.476

y9

995.435

b10

1092.506

b11

1181.252

y12

1318.578b13

1587.759b16

1715.817b18

858.408

b18 ++

794.380

b16 ++

0

miRNA

array

mRNA

array

PathwaysProtein

Interaction

Drug

targets

Page 12: Connecting life sciences data at the European Bioinformatics Institute

Connections between Databases

Gene (via identifiers.

org/ensembl)

RNA transcript (via

identifiers.org/ensembl)

uniprot:Protein

rdfs:seeAlso (not currently linking

to identifiers.org but soon)

discretized differential

gene expression ratio

(sio: SIO_001078)

Gene Expression Atlas

Ensembl

sio:'is attribute of'(sio:SIO_000011)

Uniprot

Gene Ontology

GO BP GO MF GO CC

uniprot:classifiedWith

bq:occursIn

Organisms

Organism/taxon

ChEMBL

Assay

(?)

chem

bl:hasT

arget

?

bq:isVersionOf

uniprot:organism

rdfs:seeAlso

1

1

1

*

1

* * *

1

1

BioModels

SBMLModel

Reaction

Species

Compartment

bq:isbq:isVersionOf

bq:isVersionOf

bq:is

bq:isVersionOfbq:isHomologTo

bq:hasPart

ChEBI

Reactome

Pathway

bq:is

Ver

sion

Of

bq:isVersionOf

SBObq:is

Relationships within Biomodels can be found

at https://github.com/sarala/ricordo-

rdfconverter/wiki/SBML-RDF-Schema

rdfs:seeAlso

Structure

PDB

1

rdfs:seeAlso

Target (?)

unip

rot:t

rans

cribed

From

Protein (via identifiers.

org/ensembl)

uniprot:translatedTo

bq:isVersionOf

Page 13: Connecting life sciences data at the European Bioinformatics Institute

We get REALLY good at doing this…

Page 14: Connecting life sciences data at the European Bioinformatics Institute

We get REALLY good at doing this…

Page 15: Connecting life sciences data at the European Bioinformatics Institute

http://www.ebi.ac.uk/rdf

Page 16: Connecting life sciences data at the European Bioinformatics Institute

How do we turn data into Linked Data

(Example from the Gene Expression Atlas)

Relational Data to RDF graph conversion

• Give “things” URIs

• Type “things” with ontologies

• Link “things” to other related “things”

Page 17: Connecting life sciences data at the European Bioinformatics Institute

Modeling data vs biology

• Typing and semantics is the main strength of RDF, so we

focused on this aspect

• A lot of ontologies for the life sciences

• However, most model biology

• What does an Ensembl entry represent? Is an Ensembl

identifier really an instance of a Sequence Ontology Gene

class?

ensembl:ENSMUSG00000001467

rdf:type

so:’protein coding gene’

Codiad

Page 18: Connecting life sciences data at the European Bioinformatics Institute

Database Entry or Real World Entity?

• Practically it makes sense to treat database entries as

proxies for the real world entity they represent

• Alternative introduces a layer of indirection that would only

make linking resources harder

• It means we can use biologically meaningful relationships

• But this may or may not work for all use cases

ensembl:ENSMUSG00000001467

rdf:type

so:’protein coding gene’

ensembl:ENSMUST00000001507

rdf:type

so:’transcript’

so:’transcribed from’

Page 19: Connecting life sciences data at the European Bioinformatics Institute

Knowledge representation challenges

• The semantics of our data is complex

• The provenance models are even more complex

• The relationship are hard to define

• Balancing use-cases with representation is a major

challenge

• The harder you try to get representation correct, the harder it

is for users to query

• Performance drops off for simple queries

Page 20: Connecting life sciences data at the European Bioinformatics Institute

Connecting Gene and Protein in EBI RDF

Page 21: Connecting life sciences data at the European Bioinformatics Institute

EBI RDF Platform

Successes

• Novel queries possible over

EBI datasets

• Production quality RDF

releases

• Community of users

• Highly available public

SPARQL endpoints

• 500+ users (10-50 million

hits per month)

• Lot of interest from industry

• Catalyst for new RDF efforts

Lessons

● Public SPARQL endpoints

problematic

● Query federation not

performant

● Inference support limited

● Not scalable for all EBI data

e.g. Variation, ENA

● Lack of expertise in service

teams

● Too much overhead to get

started quickly in this space

Page 22: Connecting life sciences data at the European Bioinformatics Institute

Ontologies for life sciences

22

Genotype Phenotype

Sequence

Proteins

Gene products Transcript

Pathways

Cell type

BRENDA tissue /

enzyme source

Development

Anatomy

Phenotype

Plasmodium

life cycle

-Sequence types

and features

-Genetic Context

- Molecule role

- Molecular Function

- Biological process

- Cellular component

-Protein covalent bond

-Protein domain

-UniProt taxonomy

-Pathway ontology

-Event (INOH pathway

ontology)

-Systems Biology

-Protein-protein

interaction

-Arabidopsis development

-Cereal plant development

-Plant growth and developmental stage

-C. elegans development

-Drosophila development FBdv fly

development.obo OBO yes yes

-Human developmental anatomy, abstract

version

-Human developmental anatomy, timed version

-Mosquito gross anatomy

-Mouse adult gross anatomy

-Mouse gross anatomy and development

-C. elegans gross anatomy

-Arabidopsis gross anatomy

-Cereal plant gross anatomy

-Drosophila gross anatomy

-Dictyostelium discoideum anatomy

-Fungal gross anatomy FAO

-Plant structure

-Maize gross anatomy

-Medaka fish anatomy and development

-Zebrafish anatomy and development

-NCI Thesaurus

-Mouse pathology

-Human disease

-Cereal plant trait

-PATO PATO attribute and value.obo

-Mammalian phenotype

- Human phenotype

-Habronattus courtship

-Loggerhead nesting

-Animal natural history and life history

eVOC (Expressed

Sequence Annotation

for Humans)

Page 23: Connecting life sciences data at the European Bioinformatics Institute

Ontologies as Graphs

• OWL ontologies aren’t graphs, but…

… can be represented as an RDF graph

… people want to use them as graphs

• Plenty of RDF databases around

• But incomplete w.r.t. OWL semantics

• SPARQL is an acquired taste

Page 24: Connecting life sciences data at the European Bioinformatics Institute

Ontology repository use-cases

• Search for ontology terms

• labels, synonyms, descriptions

• Querying the structure

• Get parent/child terms

• Querying transitive closure

• Get ancestor/descendant terms

• Querying across relations

• Partonomy or development stages

• We can satisfy these requirements with Neo4J

Page 25: Connecting life sciences data at the European Bioinformatics Institute

OWL to Neo4j schema

Label every node by type (e.g. class, property or individual) and ontology id

Label every relation by name

include additional index for “special relations” like partonomy and subsets

Page 26: Connecting life sciences data at the European Bioinformatics Institute

Powerful yet simple queries

• Get the transitive closure for “heart” following parent and

partonomy relations from the UBERON anatomy ontology

MATCH path =

(n:Class)-

[r:SUBCLASSOF|RelatedTree*]

->(parent)<-

[r2:SUBCLASSOF|RelatedTree]

-(sibling:Class)

WHERE n.ontology_name = {0}

AND n.iri = {1}

Page 27: Connecting life sciences data at the European Bioinformatics Institute

Final thoughts – Neo4j and JSON-LD?

• A lot of frameworks now make it trivial to produce good

APIs

• What’s currently missing is how to integrate data from two or

more independent APIs

• Hard to crawl independent datasets for connections without

a human to interpret semantics

• Still a need to express a schema alongside the data

• W3C standard like RDF/RDFS/SKOS/OWL provide the

basic vocabularies and semantics for expressing data

schemas

• JSON-LD is bridging the gap from JSON to RDF

Page 28: Connecting life sciences data at the European Bioinformatics Institute

Acknowledgements

• Sample Phenotypes and Ontologies

• Simon Jupp, Olga Vrousgou, Thomas Liener, Dani Welter,

Catherine Leroy, Sira Sarntivijai, Ilinca Tudose, Helen

Parkinson

• Funding

• European Molecular Biology Laboratory (EMBL)

• European Union projects: DIACHRON, BioMedBridges and

CORBEL, Excelerate

Page 29: Connecting life sciences data at the European Bioinformatics Institute

Questions?