12 th July, 2016 Connecting life sciences data at the European Bioinformatics Institute Tony Burdett Technical Co-ordinator – Samples, Phenotypes and Ontologies Team www.ebi.ac.uk
Jan 12, 2017
12th July, 2016
Connecting life sciences data at the
European Bioinformatics Institute
Tony Burdett
Technical Co-ordinator –
Samples, Phenotypes and
Ontologies Team
www.ebi.ac.uk
Bioinformatics is
the science of storing,
retrieving and analysing
large amounts of
biological information.
What is EMBL-EBI?
• Europe’s home for biological data services, research
and training
• A trusted data provider for the life sciences
• Part of the European Molecular Biology Laboratory,
an intergovernmental research organisation
• International: 570 members of staff from 57 nations
• Home of the ELIXIR Technical hub.
OUR MISSION
To provide freely
available data and
bioinformatics services
to all facets of the
scientific community in
ways that promote
scientific progress
Big data, big demand
~18.5 million requests to EMBL-EBI
websites every day
60 petabytesof EMBL-EBI storage capacity
EMBL-EBI handles
9.2 million jobs on average per
month
Scientists at over
5 million unique sites use
EMBL-EBI websites
Atlas
what happens where
From molecules to medicine
Biology is changing:
• Lower-cost sequencing
• More data produced
• New types of data
• Emphasis on systems biology
Bioinformatics enables new
applications:
• molecular medicine
• agriculture
• food
• environmental sciences
Data resources at EMBL-EBIGenes, genomes & variation
RNA Central
Array
Express
Expression Atlas
Metabolights
PRIDE
InterPro Pfam UniProt
ChEMBL SureChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene, protein & metabolite expression
Protein sequences, families &
motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
BioStudies
Gene Ontology
Experimental Factor
Ontology
Literature &
ontologies
Database interactions
• Collaborative community
facilitates social,
scientific and technical
interactions
• Right: internal
interactions between
data resources as
determined by the
exchange of data.
• Width of each internal
arc weighted according
to the number of different
data types exchanged.
Biology 101 – Central Dogma
Dhorspool at en.wikipedia [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)
or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons
Sadly, it’s not *quite* that simple…
User:Dhorspool [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)
or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons
Nope, not that simple either…
Proteome
Metabolome
Genome
tissue
CE-
MS
antibody array LC-MS/MS
m/z
600 800 1000 1200 1400 1600
10
20
30
40
50
60
70
80
90
100
Inte
nsit
y
609.256
b6
755.422y8
882.357
b9
852.476
y9
995.435
b10
1092.506
b11
1181.252
y12
1318.578b13
1587.759b16
1715.817b18
858.408
b18 ++
794.380
b16 ++
0
miRNA
array
mRNA
array
PathwaysProtein
Interaction
Drug
targets
Connections between Databases
Gene (via identifiers.
org/ensembl)
RNA transcript (via
identifiers.org/ensembl)
uniprot:Protein
rdfs:seeAlso (not currently linking
to identifiers.org but soon)
discretized differential
gene expression ratio
(sio: SIO_001078)
Gene Expression Atlas
Ensembl
sio:'is attribute of'(sio:SIO_000011)
Uniprot
Gene Ontology
GO BP GO MF GO CC
uniprot:classifiedWith
bq:occursIn
Organisms
Organism/taxon
ChEMBL
Assay
(?)
chem
bl:hasT
arget
?
bq:isVersionOf
uniprot:organism
rdfs:seeAlso
1
1
1
*
1
* * *
1
1
BioModels
SBMLModel
Reaction
Species
Compartment
bq:isbq:isVersionOf
bq:isVersionOf
bq:is
bq:isVersionOfbq:isHomologTo
bq:hasPart
ChEBI
Reactome
Pathway
bq:is
Ver
sion
Of
bq:isVersionOf
SBObq:is
Relationships within Biomodels can be found
at https://github.com/sarala/ricordo-
rdfconverter/wiki/SBML-RDF-Schema
rdfs:seeAlso
Structure
PDB
1
rdfs:seeAlso
Target (?)
unip
rot:t
rans
cribed
From
Protein (via identifiers.
org/ensembl)
uniprot:translatedTo
bq:isVersionOf
How do we turn data into Linked Data
(Example from the Gene Expression Atlas)
Relational Data to RDF graph conversion
• Give “things” URIs
• Type “things” with ontologies
• Link “things” to other related “things”
Modeling data vs biology
• Typing and semantics is the main strength of RDF, so we
focused on this aspect
• A lot of ontologies for the life sciences
• However, most model biology
• What does an Ensembl entry represent? Is an Ensembl
identifier really an instance of a Sequence Ontology Gene
class?
ensembl:ENSMUSG00000001467
rdf:type
so:’protein coding gene’
Codiad
Database Entry or Real World Entity?
• Practically it makes sense to treat database entries as
proxies for the real world entity they represent
• Alternative introduces a layer of indirection that would only
make linking resources harder
• It means we can use biologically meaningful relationships
• But this may or may not work for all use cases
ensembl:ENSMUSG00000001467
rdf:type
so:’protein coding gene’
ensembl:ENSMUST00000001507
rdf:type
so:’transcript’
so:’transcribed from’
Knowledge representation challenges
• The semantics of our data is complex
• The provenance models are even more complex
• The relationship are hard to define
• Balancing use-cases with representation is a major
challenge
• The harder you try to get representation correct, the harder it
is for users to query
• Performance drops off for simple queries
EBI RDF Platform
Successes
• Novel queries possible over
EBI datasets
• Production quality RDF
releases
• Community of users
• Highly available public
SPARQL endpoints
• 500+ users (10-50 million
hits per month)
• Lot of interest from industry
• Catalyst for new RDF efforts
Lessons
● Public SPARQL endpoints
problematic
● Query federation not
performant
● Inference support limited
● Not scalable for all EBI data
e.g. Variation, ENA
● Lack of expertise in service
teams
● Too much overhead to get
started quickly in this space
Ontologies for life sciences
22
Genotype Phenotype
Sequence
Proteins
Gene products Transcript
Pathways
Cell type
BRENDA tissue /
enzyme source
Development
Anatomy
Phenotype
Plasmodium
life cycle
-Sequence types
and features
-Genetic Context
- Molecule role
- Molecular Function
- Biological process
- Cellular component
-Protein covalent bond
-Protein domain
-UniProt taxonomy
-Pathway ontology
-Event (INOH pathway
ontology)
-Systems Biology
-Protein-protein
interaction
-Arabidopsis development
-Cereal plant development
-Plant growth and developmental stage
-C. elegans development
-Drosophila development FBdv fly
development.obo OBO yes yes
-Human developmental anatomy, abstract
version
-Human developmental anatomy, timed version
-Mosquito gross anatomy
-Mouse adult gross anatomy
-Mouse gross anatomy and development
-C. elegans gross anatomy
-Arabidopsis gross anatomy
-Cereal plant gross anatomy
-Drosophila gross anatomy
-Dictyostelium discoideum anatomy
-Fungal gross anatomy FAO
-Plant structure
-Maize gross anatomy
-Medaka fish anatomy and development
-Zebrafish anatomy and development
-NCI Thesaurus
-Mouse pathology
-Human disease
-Cereal plant trait
-PATO PATO attribute and value.obo
-Mammalian phenotype
- Human phenotype
-Habronattus courtship
-Loggerhead nesting
-Animal natural history and life history
eVOC (Expressed
Sequence Annotation
for Humans)
Ontologies as Graphs
• OWL ontologies aren’t graphs, but…
… can be represented as an RDF graph
… people want to use them as graphs
• Plenty of RDF databases around
• But incomplete w.r.t. OWL semantics
• SPARQL is an acquired taste
Ontology repository use-cases
• Search for ontology terms
• labels, synonyms, descriptions
• Querying the structure
• Get parent/child terms
• Querying transitive closure
• Get ancestor/descendant terms
• Querying across relations
• Partonomy or development stages
• We can satisfy these requirements with Neo4J
OWL to Neo4j schema
Label every node by type (e.g. class, property or individual) and ontology id
Label every relation by name
include additional index for “special relations” like partonomy and subsets
Powerful yet simple queries
• Get the transitive closure for “heart” following parent and
partonomy relations from the UBERON anatomy ontology
MATCH path =
(n:Class)-
[r:SUBCLASSOF|RelatedTree*]
->(parent)<-
[r2:SUBCLASSOF|RelatedTree]
-(sibling:Class)
WHERE n.ontology_name = {0}
AND n.iri = {1}
Final thoughts – Neo4j and JSON-LD?
• A lot of frameworks now make it trivial to produce good
APIs
• What’s currently missing is how to integrate data from two or
more independent APIs
• Hard to crawl independent datasets for connections without
a human to interpret semantics
• Still a need to express a schema alongside the data
• W3C standard like RDF/RDFS/SKOS/OWL provide the
basic vocabularies and semantics for expressing data
schemas
• JSON-LD is bridging the gap from JSON to RDF
Acknowledgements
• Sample Phenotypes and Ontologies
• Simon Jupp, Olga Vrousgou, Thomas Liener, Dani Welter,
Catherine Leroy, Sira Sarntivijai, Ilinca Tudose, Helen
Parkinson
• Funding
• European Molecular Biology Laboratory (EMBL)
• European Union projects: DIACHRON, BioMedBridges and
CORBEL, Excelerate