Ontologies in Biomedicine Mark A. Musen Stanford University QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
Dec 22, 2015
Ontologies in Biomedicine
Mark A. Musen
Stanford University
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
What Is An Ontology?
• The study of being • A discipline co-opted by Computer Science to enable the
“explicit specification of the conceptualization” of application domains:– Entities– Properties and attributes of entities– Constraints on properties and attributes– Individuals (often, but not always)
• A theory that provides – a common vocabulary– a shared understanding of the entities in
an appliation area
Why Develop an Ontology?
• To share common understanding of the structure of descriptive information – among people– among software agents– between people and software
• To enable reuse of domain knowledge– to avoid “re-inventing the wheel”– to introduce standards to allow interoperability
Ontologies are just the beginning
OntologiesOntologies
Software agents
Software agents Problem-
solving methods
Problem-solving
methods
AnnotatedData
AnnotatedData
DatabasesDatabasesDeclarestructure
Knowledgebases
Knowledgebases
Providedomain
descriptions
Enumeratedomainterms
Supreme genus: SUBSTANCE
Subordinate genera: BODY SPIRIT
Differentiae: material immaterial
Differentiae: animate inanimate
Differentiae: sensitive insensitive
Subordinate genera: LIVING MINERAL
Proximate genera: ANIMAL PLANT
Species: HUMAN BEAST
Differentiae: rational irrational
Individuals: Socrates Plato Aristotle …
Porphyry’s depiction of Aristotle’s Categories
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.
Foundational Model of Anatomy
• Long-term project at University of Washington to create a comprehensive ontology of human anatomy
• 72K concepts, 1.9M relationships
• One of the largest and best developed ontologies in biomedicine
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Physical Anatomical Entity
AnatomicalSpatial Entity
Anatomical Structure
Body Substance
BodyPart
OrganSystem
OrganismThe Body
OrganPartOrganCell
OrganSubdivision
OrganComponentTissue
Top level of the Foundational Model of Anatomy
Heart
Cavityof Heart
Wallof Heart
RightAtrium
Cavity ofRight Atrium
Wall ofRight Atrium
FossaOvalis
Myocardium
SinusVenarum
SANode
Myocardiumof Right Atrium
CardiacChamber
HollowViscus
InternalFeature
OrganCavity
Organ CavitySubdivision
AnatomicalSpatial Entity
AnatomicalFeature
BodySpace
OrganComponent
OrganSubdivision
Viscus
OrganPart
Organ
AnatomicalStructure
Parts of the heart
Classes of anatomical structures
Is-a
Part-of
But we really want ontologies in electronic form
• Ontology contents can be processed and interpreted by computers
• Interactive tools can assist developers in ontology authoring
The FMA demonstrates that distinctions are not universal
• Blood is not a tissue, but rather a body substance (like saliva or sweat)
• The pericardium is not part of the heart, but rather an organ in and of itself
• Each joint, each tendon, each piece of fascia is a separate organ
These views are not shared by many anatomists!
Ontologies are cropping up everywhere!
• Indexing of online information for access by humans or search engines
• Product catalogs for e-commerce
• Reference terminologies for machine translation and data interchange
• Standard terms for describing experimental data
• Frameworks for structuring knowledge for decision support
The New Philosophers
• Categorizing “what exists” in machine-understandable form
• Providing a structure that enables– Developers to locate and update relevant
descriptions – Computers to infer relationships and properties
• Creating new abstractions about the world to facilitate the creation of this structure
Lots of ontology builders are not very good philosophers
• Nearly always, ontologies are created to address pressing professional needs
• The people who have the most insight into professional knowledge may have little appreciation for metaphysics, principles of knowledge representation, or computational logic
• There simply aren’t enough good philosophers to go around
A case in point: The International Classification of Diseases
• An enumeration of diseases that forms the basis for all medical claims and reimbursements in the United States
• A “legacy” terminology that has its roots in 19th century epidemiology
• Created initially by biostatisticians with a pressing need to compare death statistics in different European countries
• A system that won’t go away—and yet we would never create anything like it again
A Small Portion of ICD9-CM724 Unspecified disorders of the
back724.0 Spinal stenosis, other than
cervical724.00 Spinal stenosis,
unspecified region724.01 Spinal stenosis, thoracic
region724.02 Spinal stenosis, lumbar
region724.09 Spinal stenosis, other724.1 Pain in thoracic spine724.2 Lumbago724.3 Sciatica724.4 Thoracic or lumbosacral neuritis724.5 Backache, unspecified724.6 Disorders of sacrum724.7 Disorders of coccyx724.70 Unspecified disorder of
coccyx724.71 Hypermobility of coccyx724.71 Coccygodynia724.8 Other symptoms referable to back724.9 Other unspecified back disorders
ICD9 (1977): A Handful of Codes for Traffic Accidents
ICD10 (1999): 587 codes for such accidents
•V31.22 Occupant of three-wheeled motor vehicle injured in collision with pedal cycle, person on outside of vehicle, nontraffic accident, while working for income
•W65.40 Drowning and submersion while in bath-tub, street and highway, while engaged in sports activity
•X35.44 Victim of volcanic eruption, street and highway, while resting, sleeping, eating or engaging in other vital activities
ICD is used for lots of (too many?) things!
• ICD is used to code all patient encounters with the health-care system for purposes of– Billing and reimbursement– Institutional planning– Disease surveillance and public health– Quality assurance– Economic modeling by third-party payors
• ICD was never intended to make the distinctions relevant to all these tasks!
• When patient encounters are encoded with ICD, it is impossible to keep all these uses in mind
If real ontologists could build the ICD from scratch …
• Diseases would be organized with well-defined relationships
• Diseases would be associated with computer-understandable definitions
• There would be well-defined rules to enable aggregation of primitive concepts into complex descriptions—and for ensuring that those descriptions were sensible
• There would be well-defined mechanisms for creating use-specific views of the ICD
The components of ontologies
• Classes: The primary entities in the world being models (e.g., “organ”)
• Attributes: The properties of classes (e.g., “shape”, “location”)
• Relations: Statements regarding how one class may relate to others (e.g., “the heart” is-a “organ”)
• Axioms: More complex logical statements (e.g., “only paired organs can be left-sided or right-sided”)
Classes and attributes in the FMAClasses and attributes in the FMA
Attributes of a class (e.g., “Esophagus”)
“is-a” is a special relation
If a sub-class is-a member of a super-class, then – every instance of the sub-
class is also an instance of the super-class (e.g., every member of the set aorta is necessarily a member of the set artery)
– Values of attributes of the super-class are inherited by every instance of the sub-class (e.g., if arteries have cylindrical shape, then aorta has cylindrical shape)
“Frame-based” knowledge-representation systems
• Allow developers to encode – Taxonomic hierarchies of classes– Other relations among classes
(e.g., “part-of”) in addition to the is-a hierarchy
– Attributes of classes that take on particular values to define instances of the classes
• Support inheritance of attributes and values along taxonomic relations
Distinctions about ontologies
• “Light” versus “heavy”: Is the ontology a simple taxonomy or does the ontology additional detail regarding the nature of classes?
• “Upper-level” versus “domain-oriented”: Does the ontology try to describe general, abstract concepts or concepts tied to a particular application area?
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Suggested Upper Merged Ontology (SUMO)
Part of the CYC Upper Ontology
The story so far …
• Ontologies define the entities—and relationships among entities—in some application area
• The authors’ point of view determines which distinctions are appropriate in a particular ontology
• Ontologies often use frame-based representations (including classes, attributes, relationships, and axioms) to encode knowledge
• People are building ontologies for nearly every niche of biomedicine
The pressing need to standardize the names of human genes
But the human genome is only part of the problem …
• Scientist maintain huge databases of gene sequences and gene expression for a wide range of “model organisms” (e.g., mouse, rat, yeast, fruit fly, round worm, slime mold)
• Database entries are annotated with entries such as the name of a gene, the function of the gene, and so on
• How do you ensure uniformity in the nature of these annotations?
Gene Ontology Consortium
• Founded in 1998 as a collaboration among scientists responsible for developing different databases of genomic data for model organisms (fruit fly, yeast, mouse)
• Now, essentially all developers of all model-organism databases participate
• Goal: To produce a dynamic, controlled vocabulary that can be applied to all organism databases even as knowledge of gene and protein roles in cells is accumulating and changing
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Gene Ontology (GO)
• Comprises three independent “ontologies”– molecular function of gene products– cellular component of gene products– biological process representing the gene product’s higher
order role.• Uses these terms as attributes of gene products in the
collaborating databases (gene product associations)• Allows queries across databases using GO terms, providing
linkage of biological information across species
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
GO has been wildly successful!!
• Dozens of biologists around the world contribute to GO on a regular basis
• The ontology is updated every 30 minutes!
• It’s now impossible to work in most areas of computational biology without making use of GO terms
But GO has had real problems …
• Ontologies initially were represented in an idiosyncratic format that was not compatible with standard knowledge-representation systems (DAG-Edit)
• The format was based on directed acyclic graphs of concepts, without the general ability to specify machine interpretable properties of entities or definitions of entities
• Because of the informal knowledge-representation system, lots of errors crept into GO– Terms that were duplicated in different places– Terms with no superclasses– Uncertain relationships between terms
• The GO consortium is working hard to rectify these problems by means of a new representation (OBO-Edit) and enhanced quality control
Creating ontologies has become a widespread cottage industry
• Professional Societies– HL7: Reference Information Model– MGED: Microarray Gene Expression Data Society Ontology– HUPO: Human Protein Organization Ontology
• Government– NCI Thesaurus– NIST: Process Specification Language
• Open Biological Ontologies– GO– Three dozen (and growing) other ontologies– Mostly in DAG-Edit, some in Protégé format
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
A Portion of the OBO Library
HL-7 Reference Information Model (RIM)
HL7 RIM• Provides a uniform framework for specification of
information required by health-care information systems
• Based on six top-level, very general classes: Act, Entity, Role, Participation, Act_relationship, and Role_link
• Designed to facilitate information exchange among distributed elements of clinical information systems
• Has the same limitations that all “upper level” ontologies share:– Abstract entities are hard to define– It’s hard to know what should be “in” and what should be
“out”
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Description Logic (DL)
• A subset of logic designed to focus on categories and their definitions in terms of existing relations
• More expressive than frame-based representations systems (as in FMA) but less expressive than first-order logic (as in CYC)
• Major inference tasks:– Subsumption
Is category C1 a subset of C2?– Classification
Does Object O belong to C?
Kinds of classes
• Defined– Have explicit necessary and sufficient properties
(roles)– Often are specializations of primitive concepts
• Primitive– Have no sufficient properties– May have other, necessary properties– Correspond to natural kinds
A simple network of Generic Concepts
THING
WOMAN
HUMAN
MAMMALFEMALE-ANIMAL
MALE-ANIMAL
PLANT
ANIMAL
MINERAL
FISH
HORSE
*
*
* **
*
*
**
*
MAN
Defined concepts are in yellow;Primitive concepts are in green.
A classifier is a program that can use DL to conclude:
• All WOMEN are FEMALE ANIMALS
• A HORSE may not also be a PLANT
• HUMAN subsumes MAN and WOMAN
• A MAN may not also be a WOMAN
The Primitive Concept MESSAGE
THING
DATE MESSAGE PERSON
TEXT
*
**
*
*
A MESSAGE is, among other things, a THING with at least one Sender, all of which are PERSONs, at least one Recipient, all of which are PERSONs, a Body, which is a TEXT, a SendDate, which is a DATE, and a ReceivedDate, which is a DATE.
SendDate(1,1)
ReceiveDate(1,1)
Body(1,1)
Recipient(1,NIL)
Sender(1,NIL)
v/r
v/r
v/r
v/r
v/r
Defined concepts are derived from primitive concepts
DATE MESSAGE PERSON
TEXT
**
*
*
A STARFLEET-MESSAGE is a MESSAGE, all of whose Senders are STARFLEET-COMMANDERS.
SendDate(1,1)
ReceivedDate(1,1)
Body(1,1)
Recipient(1,NIL)
Sender(1,NIL)
v/r
v/r
v/r
v/r
v/r
STARFLEET-MESSAGE
STARFLEET-COMMANDER
v/r
restricts
A DL Classifier
• Takes a new Concept and automatically determines all subsumption relations between it and all other Concepts in the network
• Adds new links when new subsumption relations are discovered
• Automates the placement of new Concepts in the taxonomy
Before Classifying the Concept X
DATE MESSAGE PERSON
TEXT
**
*
*
A MESSAGE with exactly one Recipient, and all of whose Senders are STARFLEET-COMMANDERs.
SendDate(1,1)
ReceivedDate(1,1)
Body(1,1)
Recipient(1,NIL)
Sender(1,NIL)
v/r
v/r
v/r
v/r
v/r
STARFLEET-MESSAGE
STARFLEET-COMMANDER
v/r
restricts
v/r
restricts
restricts
(1,1)
X
After Classifying the Concept X
DATE MESSAGE PERSON
TEXT
**
*
*SendDate(1,1)
ReceivedDate(1,1)
Body(1,1)
Recipient(1,NIL)
Sender(1,NIL)
v/r
v/r
v/r
v/r
v/r
STARFLEET-MESSAGE
STARFLEET-COMMANDER
v/r
restricts
restr
icts
(1,1)
X
X IS-A STARFLEET MESSAGE!
The Beauty of Classification for Ontologies
• The classifier takes care of where to place a new concept in the hierarchy
• All inheritance relationships are automatically propagated to the new concept
• Relationships among a new concept and other entities are automatically simplified by classifying the new concept as a specialization of existing concepts
Classification generates a new, inferred hierarchy
The Ontology Web Language (OWL)
• Comes in three flavors:– OWL Lite (frame-based)– OWL DL (decription logic)– OWL Full (first-order logic and then some)
• Rapidly being adopted for use in biomedical ontologies, including:– NCI Thesaurus (cancer biology and oncology)– MGED Ontology (DNA micro-array experiments)– BioPAX (metabolic pathways)
• The new editor and representation system for OBO ontologies (OBO-Edit) uses a subset of OWL
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
DL and Ontologies
• There is not just one “description logic”; DLs come in different varieties with different expressivity
• DLs are of value primarily to ontology developers, to see the implications of modeling decisions
• DLs also can be used by end users, when reasoning about systems that ontologies model
A thousand flowers are blooming!
• Ontologies are being developed by interested groups from every sector of academia, industry, and government
• Many of these ontologies have been proven to be extraordinarily useful to wide communities
• We finally have tools and representation languages that can enable us to create durable and maintainable ontologies with rich semantic content