What developers need to know about ontologies? Barry Smith http:// ontology.buffalo.edu/smith 1
Jan 19, 2016
What developers need to know about
ontologies?
Barry Smithhttp://ontology.buffalo.edu/smith
1
HL7 Watch (blog)
Microsoft Healthvault:
Allergic Episode is_a Health Record Item,
Health Record Item =def. A single piece of data in a health record that is accessible through the HealthVault service
2
3
Problem of ensuring sensible cooperation in a massively interdisciplinary community
concepttypeinstancemodelrepresentationdata
4
What do these mean?
‘conceptual data model’‘semantic knowledge model’‘reference information model’
You’re interested in which genes control heart muscle development
17,536 results
5
Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...
Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)
attacked
time
control
Puparial adhesionMolting cyclehemocyanin
Defense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genes
Immune responseToll regulated genes
Amino acid catabolismLipid metobolism
Peptidase activityProtein catabloismImmune response
Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...
Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)
Microarray datashows changed expression ofthousands of genes.
How will you spot the patterns?
6
Lab / pathology dataEHR dataClinical trial dataFamily history data Medical image dataMicroarray dataModel organism dataFlow cytometryMass specGenotype / SNP data
How will you find the data you need?
7
− Human − Mouse− Rat − Fish− Yeast− E. coli
How will you find the compare the data? How will you integrate the data
8
:.
The GO Idea
MouseEcotope GlyProt
DiabetInGene
GluChem
sphingolipid transporter
activity
:.
annotation using common ontologies yields integration of databases
MouseEcotope GlyProt
DiabetInGene
GluChem
Holliday junction helicase complex
• For this to work, ontologies cannot be allowed to proliferate uncontrollably
• Rather, we need as far as possible non-overlapping ontology modules (OBO Foundry)
• How should we build these modules in such a way as to ensure glue-ability of annotations?
12
Glue-ability / integration• rests on the existence of a common
benchmark called ‘reality’
• the ontologies we want to glue together are representations of what exists in the world
• not of what exists in the heads of different groups of people
13
two kinds of annotations
14
names of types
15
names of instances
16
First basic distinction
type vs. instance
(science text vs. diary)
(human being vs. Tom Cruise)
17
For ontologies
it is generalizations that are important = ontologies are
about types, kinds, universals
18
Ontology types Instances
19
Ontology = A Representation of types
20
An ontology is a representation of types
We learn about types in reality from looking at the results of scientific experiments in the form of scientific theories
experiments relate to what is particular science describes what is general
21
Inventory vs. CatalogTwo kinds of representational
artifact
Very roughly:
Databases represent instances
Ontologies represent types
22
A 515287 DC3300 Dust Collector Fan
B 521683 Gilmer Belt
C 521682 Motor Drive Belt
Catalog vs. inventory
23
Catalog vs. inventory
24
Catalog of types/Types
25
siamese
mammal
cat
organism
objecttypes
animal
frog
instances
26
Ontologies are here
27
or here
28
ontologies represent general structures in reality (leg)
29
Ontologies do not represent concepts in people’s heads
30
They represent types in reality
31
which provide the benchmark for integration
32
Entity =def
anything which exists, including things and processes, functions and qualities, beliefs and actions, documents and software (Levels 1, 2 and 3)
33
what are the kinds of entity?
34
First basic distinction
type vs. instance
(science text vs. diary)
(human being vs. Tom Cruise)
35
Ontology Types Instances
36
Ontology = A Representation of types
37
Domain =def
a portion of reality that forms the subject-matter of a single science or technology or mode of study or administrative practice ...;
proteomics
HIV
epidemiology
38
Representation =def
an image, idea, map, picture, name or description ... of some entity or entities.
39
Ontologies are representational artifacts
comparable to science textsand subject to the same sorts of constraints (including need
for update)
40
Representational units =def
terms, icons, alphanumeric identifiers ... which refer, or are intended to refer, to entities
and which are minimal (atoms)
41
Composite representation =defrepresentation
(1) built out of representational units
which
(2) form a structure that mirrors, or is intended to mirror, the entities in some domain
42
Analogue representations
no representational units, no ‘atoms’
43
Periodic Table
The Periodic Table
44
Class =defa maximal collection of particulars determined by a general term (‘cell’. ‘electron’ but also: ‘ ‘restaurant in Palo Alto’, ‘Italian’)
the class A = the collection of all particulars x for which ‘x is A’ is true
45
types vs. their extensions
types
{a,b,c,...} collections of particulars
46
Extension =def
The extension of a type A is the class: instance of the type A
(it is the class of A’s instances)
(the class of all entities to which the term ‘A’ applies)
47
Problem
The same general term can be used to refer both to types and to collections of particulars. Consider:
HIV is an infectious retrovirus
HIV is spreading very rapidly through Asia
48
types vs. classes
types
{c,d,e,...} classes
49
types vs. classes
types
~ defined classes
50
types vs. classes
types
e.g. populations, ...
51
Defined class =def
a class defined by a general term which does not designate a type
the class of all diabetic patients in Leipzig on 4 June 1952
52
OWL is a good representation of defined classes
• sibling of Finnish spy
• member of Abba aged > 50 years
• pizza with > 4 different toppings
53
Terminology =def.
a representational artifact whose representational units are natural language terms (with IDs, synonyms, comments, etc.) which are intended to designate types together with defined classes, with no particular attention to composite representations
54
types, classes, concepts
types
defined classes
‘concepts’ ?
55
types < defined classes < ‘concepts’
‘concepts’ which do not correspond to defined classes:
‘Surgical or other procedure not carried out because of patient's decision’
‘Congenital absent nipple’
because they do not correspond to anything
Gene Ontology: The Very Top
cellular component
molecular function
biological process
56
Gene Ontology: The Very Top
continuant
cellular component
molecular function
occurrent
biological process
57
BFO: The Very Top
continuant occurrent
biological processes
independentcontinuant
cellular component
dependentcontinuant
molecular function
58
Basic Formal Ontology
continuant occurrent
independentcontinuant
dependentcontinuant
organism
59
Basic Formal Ontology
continuant occurrent
independentcontinuant
dependentcontinuant
anatomical structure
60
Continuants
• continue to exist through time, preserving their identity while undergoing different sorts of changes
• independent continuants – objects, things, ...
• dependent continuants – qualities, attributes, shapes, potentialities ...
61
Qualitiestemperatureblood pressuremass...
are continuantsthey exist through time while undergoing changes
62
Qualitiestemperature / blood pressure /
mass ...are dimensions of variation within the structure of the entity; a quality is something which can change while its bearer remains one and the same
63
A Chart representing how John’s temperature
changes
65
John’s temperaturethe temperature he has throughout his entire life, cycles through different determinate temperatures from one time to the next
John’s temperature is a physiology variable which, in thus changing, exerts an influence on other physiology variables through time
66
BFO: The Very Top
continuant
independentcontinuant
dependentcontinuant
quality
occurrent
temperature 67
Blinding Flash of the Obvious
independentcontinuant
dependentcontinuant
quality
temperature types
instances
organism
John John’s
temperature 68
Blinding Flash of the Obvious
independentcontinuant
dependentcontinuant
quality
temperature types
instances
organism
John John’s
temperature 69
Blinding Flash of the Obvious
temperature types
instances
organism
John John’s
temperature
70
inheres_in
temperature types
instances
John’s temperature
71
37ºC37.1º
C37.5º
C37.2º
C37.3º
C37.4º
C
instantiates at t1
instantiates at t2
instantiates at t3
instantiates at t4
instantiates at t5
instantiates at t6
human types
instances
John
72
embryo
fetus adultneonat
einfant child
instantiates at t1
instantiates at t2
instantiates at t3
instantiates at t4
instantiates at t5
instantiates at t6
• lower lever of types does not ‘carry identity’ in OntoClean terms
• are threshold divisions (hence we do not have sharp boundaries, and we have a certain degree of choice, e.g. in how many subtypes to distinguish, though not in their ordering)
73
independentcontinuant
dependentcontinuant
quality
temperature types
instances
organism
John John’s
temperature
74
independentcontinuant
dependentcontinuant
quality
temperature
organism
John John’s
temperature
occurrent
process
course of temperature
changes
John’s temperature history
75
independentcontinuant
dependentcontinuant
quality
temperature
organism
John John’s
temperature
occurrent
process
life of an organism
John’s life
76
BFO/GO: The Very Top
continuant occurrent
biological processes
independentcontinuant
cellular component
dependentcontinuant
molecular function
77
BFO: The Very Top
continuant occurrent
independentcontinuant
dependentcontinuant
quality functionrole
disposition
78
:.
Function - of liver: to store glycogen- of birth canal: to enable transport- of eye: to see- of mitochondrion: to produce ATP- of liver: to store glycogen
not optional; reflection of physical makeup of bearer; can malfunction
79
:.
Role optional:exists because the bearer is in some special natural, social, or institutional set of circumstances in which the bearer does not have to be
80
:.
Role - bearers can have more than one role
person as student / as staff member- roles often form systems of mutual dependence
husband / wife first in queue / last in queuedoctor / patient
host / pathogen 81
:.
Role of some chemical compound: to serve as analyte in an experiment
of a dose of penicillin in this human child: to treat a disease
of this bacteria in a primary host: to cause infection
82
:.
Qualities are categorical features of reality – you just have them
Functions, roles and dispositions are potential featires of reality: they are realizable dependent continuants, realized in certain associated processes
83
independentcontinuant
dependentcontinuant
role
drug role
portion of chemical compound
this portion of aspirin
role of this portion of aspirin
occurrent
process
process of drug
adminstration
John’s taking this portion of aspirin
84
independentcontinuant
dependentcontinuant
role
drug role
portion of chemical compound
this portion of aspirin
role of this portion of aspirin
occurrent
process
process of drug
adminstration
John’s taking this portion of aspirin
85
inheres_in
realized_in
RELATION TO TIME
GRANULARITY
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
ORGAN ANDORGANISM
Organism(NCBI
Taxonomy)
Anatomical Entity(FMA, CARO)
OrganFunction
(FMP, CPRO) Phenotypic
Quality(PaTO)
Biological Process
(GO)CELL AND CELLULAR
COMPONENT
Cell(CL)
Cellular Compone
nt(FMA, GO)
Cellular Function
(GO)
MOLECULEMolecule
(ChEBI, SO,RnaO, PrO)
Molecular Function(GO)
Molecular Process
(GO)
The Open Biomedical Ontologies (OBO) Foundry86
• The Road to Convergence
All ontologies for each given domain (anatomy, chemistry…) should be part of a single suite of interoperable ontologies
should use a common top-level corefor subdomains with many variants, should
follow the strategy of canonical ontologies with extensions
should require acceptance of common, tested guidelines on all subscribing ontology developers
87
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
ORGAN ANDORGANISM
Organism(NCBI
Taxonomy)
Anatomical Entity
(FMA, CARO)
OrganFunction
(FMP, CPRO) Phenotypic
Quality(PaTO)
Organism-Level Process
(GO)
CELL AND CELLULAR
COMPONENT
Cell(CL)
Cellular Compone
nt(FMA, GO)
Cellular Function
(GO)
Cellular Process
(GO)
MOLECULEMolecule
(ChEBI, SO,RnaO, PrO)
Molecular Function(GO)
Molecular Process
(GO)
initial OBO Foundry coverage, ontologies automatically semantically coupled
GRANULARITY
RELATION TO TIME
88
Disposition (Internally-Grounded Realizable
Entity)disposition =def.
a realizable entity which if it ceases to exist, then its bearer is physically changed, and whose realization occurs when this bearer is in some special physical circumstances, in virtue of the bearer’s physical make-up
89
Function
• A Disposition (Internally-Grounded Realizable Entity) that is designed or selected for
90
OGMS• Ontology for General Medical
Science
http://code.google.com/p/ogms
91
:.
Physical Disorder
– independent continuantfiat object part
92
Big Picture
93
A disease is a disposition rooted in a physical disorder in the organism and realized in pathological processes.
etiological process
produces
disorder
bears
disposition
realized_in
pathological process
produces
abnormal bodily features
recognized_as
signs & symptomsinterpretive process
produces
diagnosis
used_in94
Elucidation of Primitive Terms
• ‘bodily feature’ - an abbreviation for a physical component, a bodily quality, or a bodily process.
• disposition - an attribute describing the propensity to initiate certain specific sorts of processes when certain conditions are satisfied.
• clinically abnormal - some bodily feature that – (1) is not part of the life plan for an organism of the relevant
type (unlike aging or pregnancy), – (2) is causally linked to an elevated risk either of pain or other
feelings of illness, or of death or dysfunction, and – (3) is such that the elevated risk exceeds a certain threshold
level.*
*Compare: baldness95
Definitions - Foundational Terms
• Disorder =def. – A causally linked combination of physical components that is clinically abnormal.
• Pathological Process =def. – A bodily process that is a manifestation of a disorder and is clinically abnormal.
• Disease =def. – A disposition (i) to undergo pathological processes that (ii) exists in an organism because of one or more disorders in that organism.
96
Dispositions and Predispositions
• All diseases are dispositions; not all dispositions are diseases.• A predisposition is a disposition.• Predisposition to Disease of Type X =def. – A disposition in an
organism that constitutes an increased risk of the organism’s subsequently developing the disease X.
• HNPCC is caused by a – disorder (mutation) in a DNA mismatch repair gene that – disposes to the acquisition of additional mutations from
defective DNA repair processes, and thus is a– predisposition to the development of colon cancer.
97
Cirrhosis - environmental exposure
• Etiological process - phenobarbitol-induced hepatic cell death
– produces
• Disorder - necrotic liver
– bears
• Disposition (disease) - cirrhosis
– realized_in
• Pathological process - abnormal tissue repair with cell proliferation and fibrosis that exceed a certain threshold; hypoxia-induced cell death
– produces
• Abnormal bodily features
– recognized_as
• Symptoms - fatigue, anorexia
• Signs - jaundice, splenomegaly
Symptoms & Signs used_in
Interpretive process produces
Hypothesis - rule out cirrhosis suggests
Laboratory tests produces
Test results - elevated liver enzymes in serum used_in
Interpretive process produces
Result - diagnosis that patient X has a disorder that bears the disease cirrhosis
98
Influenza - infectious
• Etiological process - infection of airway epithelial cells with influenza virus
– produces
• Disorder - viable cells with influenza virus
– bears
• Disposition (disease) - flu
– realized_in
• Pathological process - acute inflammation
– produces
• Abnormal bodily features
– recognized_as
• Symptoms - weakness, dizziness
• Signs - fever
Symptoms & Signs used_in
Interpretive process produces
Hypothesis - rule out influenza suggests
Laboratory tests produces
Test results - elevated serum antibody titers used_in
Interpretive process produces
Result - diagnosis that patient X has a disorder that bears the disease flu
But the disorder also induces normal physiological processes (immune response) that can results in the elimination of the disorder (transient disease course).
99
Huntington’s Disease - genetic
• Etiological process - inheritance of >39 CAG repeats in the HTT gene– produces
• Disorder - chromosome 4 with abnormal mHTT– bears
• Disposition (disease) - Huntington’s disease– realized_in
• Pathological process - accumulation of mHTT protein fragments, abnormal transcription regulation, neuronal cell death in striatum– produces
• Abnormal bodily features– recognized_as
• Symptoms - anxiety, depression• Signs - difficulties in speaking and
swallowing
Symptoms & Signs used_in
Interpretive process produces
Hypothesis - rule out Huntington’s suggests
Laboratory tests produces
Test results - molecular detection of the HTT gene with >39CAG repeats used_in
Interpretive process produces
Result - diagnosis that patient X has a disorder that bears the disease Huntington’s disease
100
HNPCC - genetic pre-disposition
• Etiological process - inheritance of a mutant mismatch repair gene– produces
• Disorder - chromosome 3 with abnormal hMLH1– bears
• Disposition (disease) - Lynch syndrome– realized_in
• Pathological process - abnormal repair of DNA mismatches– produces
• Disorder - mutations in proto-oncogenes and tumor suppressor genes with microsatellite repeats (e.g. TGF-beta R2)– bears
• Disposition (disease) - non-polyposis colon cancer– realized in
• Symptoms (including pain)
101
The OBO Foundry Initiative
102
A good solution to the data integration problem must be:
• modular• incremental• bottom-up• evidence-based • revisable• incorporate a strategy for motivating
potential developers and users
103
GO is amazingly successful – but covers only three sorts of biological entities:–cellular components–molecular functions–biological processes
and does not provide representations of disease-related phenomena
104
RELATION TO TIME
GRANULARITY
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
ORGAN ANDORGANISM
Organism(NCBI
Taxonomy)
Anatomical Entity(FMA, CARO)
OrganFunction
(FMP, CPRO) Phenotypic
Quality(PaTO)
Biological Process
(GO)CELL AND CELLULAR
COMPONENT
Cell(CL)
Cellular Compone
nt(FMA, GO)
Cellular Function
(GO)
MOLECULEMolecule
(ChEBI, SO,RnaO, PrO)
Molecular Function(GO)
Molecular Process
(GO)
The Open Biomedical Ontologies (OBO) Foundry105
OBO Foundry provides
• tested guidelines enabling new groups to develop the ontologies they need in ways which counteract forking and dispersion of effort
• an incremental bottoms-up approach to evidence-based terminology practices in medicine that is rooted in basic biology
• automatic web-based linkage between medical terminologies and biological knowledge resources
• traffic laws and traffic police
106
the strategy
establish common rules governing best practices for creating ontologies in coordinated fashion, with an evidence-based pathway to incremental improvement
107
The methodology of cross-products
compound terms in ontologies to be defined as cross-products of simpler terms:E.g elevated blood glucose is a cross-product of PATO: increased concentration with FMA: blood and CheBI: glucose.
= factoring out of ontologies into discipline-specific modules (orthogonality)
108
The methodology of cross-products
enforcing use of common relations in linking terms drawn from Foundry ontologies serves
• to ensure that the ontologies are maintained and revised in tandem
• logically defined relations serve to bind terms in different ontologies together to create a network
109
CRITERIA
opennness
common formal language.
collaborative development
evidence-based maintenance
identifiers
versioning
textual and formal definitions
CRITERIA
110
Orthogonality = modularity
• one ontology for each domain• no need for mappings (which are in
any case too expensive, too fragile, too difficult to keep up-to-date as mapped ontologies change)
• everyone knows where to look to find out how to annotate each kind of data
111
Ontologies and research groups
using BFO and RO
– OBO Foundry (60 biomedical ontologies, including
GO, OBI, Protein Ontology, Cell Ontology, IDO …– National Cancer Institute (BiomedGT)– NIF (NIH Neuroscience Information Framework)– Cleveland Clinic Semantic Database– Siemens– AstraZeneca– EU (ACGT Cancer Ontology, RAPS, …)
112
Because the ontologies in the Foundry
are built as orthogonal modules which form an incrementally evolving network
• scientists are motivated to commit to developing ontologies because they will need in their own work ontologies that fit into this network
• users are motivated by the assurance that the ontologies they turn to are maintained by experts
113
More benefits of orthogonality
• helps those new to ontology to find what they need
• to find models of good practice• ensures mutual consistency of ontologies
(trivially)• and thereby ensures additivity of annotations
114
More benefits of orthogonality
• it rules out the sorts of simplification and partiality which may be acceptable under more pluralistic regimes
• thereby brings an obligation on the part of ontology developers to commit to scientific accuracy and domain-completeness
115
More criteria of a successful standard
1. intelligibility to users, consistent use of terms like ‘term’, ‘class’, ‘entity’, ‘object’ …)
2. track record of lessons learned (GO has 10 years of hard user testing)
3. lots of existing users (ontologies are like telephone networks)
116
The ontology uses relations which are unambiguously defined following the pattern of definitions laid down in the Basic Formal Ontology (BFO) including the Relation Ontology (RO)
http://ifomis.org/bfo
http://www.obofoundry.org/ro/
COMMON ARCHITECTURE
117
Anatomy Ontology(FMA*, CARO)
Environment
Ontology(EnvO)
Infectious Disease
Ontology(IDO*)
Biological Process
Ontology (GO*)
Cell Ontology
(CL)
CellularComponentOntology
(FMA*, GO*) Phenotypic Quality
Ontology(PaTO)
Subcellular Anatomy Ontology (SAO)Sequence Ontology
(SO*) Molecular Function
(GO*)Protein Ontology(PRO*) OBO Foundry Modular Organization
top level
mid-level
domain level
Information Artifact Ontology
(IAO)
Ontology for Biomedical
Investigations(OBI)
Spatial Ontology(BSPO)
Basic Formal Ontology (BFO)
118
BFO:continuant
BFO:occurrent
Example: The Cell Ontology
Anatomy Ontology(FMA*, CARO)
Environment
Ontology(EnvO)
Infectious Disease
Ontology(IDO*)
Biological Process
Ontology (GO*)
Cell Ontology
(CL)
CellularComponentOntology
(FMA*, GO*) Phenotypic Quality
Ontology(PaTO)
Subcellular Anatomy Ontology (SAO)Sequence Ontology
(SO*) Molecular Function
(GO*)Protein Ontology(PRO*) OBO Foundry Modular Organization
top level
mid-level
domain level
Information Artifact Ontology
(IAO)
Ontology for Biomedical
Investigations(OBI)
Spatial Ontology(BSPO)
Basic Formal Ontology (BFO)
122