GO terms implicitly refer to other term
• cysteine biosynthesis • myoblast fusion• hydrogen ion transporter activity• snoRNA catabolism• wing disc pattern formation• epidermal cell differentiation• regulation of flower development• interleukin-18 receptor complex• B-cell differentiation• dorsal ectoderm
biosynthesisis_ametabolism
cysteineis_aserine family amino acidis_aamino acidis_aamine
cysteineis_aserine family amino acidis_aamino acidis_aserine
Composed terms currently cause problems
– No link to external ontology term– Redundancy– Inconsistency– Extra work– Annotation bottleneck– Tangled DAGs and confusing displays
• we have no way to disentangle
• Solution so far:– fix errors based on results of term name
parsing (Obol)• reactive, not proactive
Solution: actively manage composed terms
• Explicit pre-coordination– Composed terms should now/soon be
coordinated using oboedit plugin• building block terms are recorded in ontology along
with composite term
• Benefits:– Correct DAG structure can be inferred from
external ontologies• e.g. make sure GO + CHEBI “align”
– placement & consistency checking automated– additional work can be automated
• synonyms, text definitions
How will terms be pre-coordinated by oboedit?
• How do we record a definition for a composite term?– using a logical definition (computational essence)
• A logical definition consists of:– a generic term (aka genus)– relationships to other terms which serve to
discriminate this specific term from other is_a children of the generic term (aka differentiae)
• Can be written in natural language as:– A <generic term> which <discriminating
characteristics>
Example of pre-coordination
• cysteine biosynthesis• generic term:
– biosynthesis
• discriminating characteristics:– outputs cysteine
– natural language (Aristotelian style):• a biosynthesis process which outputs
cysteine
Example in Obo format
[Term]id: GO:0019344name: cysteine biosynthesisintersection_of: GO:0009058 ! biosynthesisintersection_of: outputs CHEBI:15356 ! cysteineis_a: GO:0009070 ! serine family amino acid biosynthesisis_a: GO:0006534 ! cysteine metabolism
Alternate syntax
• used in pheno-syntax• more compact• similar to OWL abstract syntax• I use Obo1.2 format or natural language in the rest of this presentation
GO:cysteine_biosynthesis == GO:biosynthesis ∏ outputs(CHEBI:cysteine)
This allows us to dynamically untangle
• Process axis view (primary is_as, via generic term):– biological_process
• metabolism– biosynthesis
» cysteine biosynthesis
• Process participant axis view:– amine
• amino acid– serine family amino acid
» cysteine
• Combined view– (same as current tangled diamond lattice)
Recording the relationship is important
• Why not just a simple cross-product?– e.g. biosynthesis x cysteine
• Relationships are important for reasoning and querying– Consider:
• cysteine biosynthesis from serine• mRNA export from nucleus during heat stress
• Without the relations, the logical definition is not specific enough– the essence is not captured
• Relations should come from RO– more required
Multiple discriminating characteristics are allowed• Cysteine biosynthesis from serine– Generic term:
• biosynthesis
– Discriminating characteristics:• output cysteine• input serine
[Term]name: cysteine biosynthesis from serineintersection_of: GO:0009058 ! biosynthesisintersection_of: outputs CHEBI:15356 ! cysteineintersection_of: input CHEBI:17822 ! serine
Composite terms can be nested
[Term]id: GO:xxxxxxxname: regulation of cysteine biosynthesisintersection_of: GO:0050789 ! regulation of biological processintersection_of: regulates GO:0019344 ! cysteine biosynthesis
[Term]id: GO:0019344name: cysteine biosynthesisintersection_of: GO:0009058 ! biosynthesisintersection_of: outputs CHEBI:15356 ! cysteine
regulation^regulates(biosynthesis^outputs(cysteine))regulation^regulates(biosynthesis)^outputs(cysteine)
YES
NO
Composite terms can optionally be
manufactured in bulk• Generic term:
{metabolism,biosynthesis}• Differentia: has_output {serine,
cysteine, …}• With caution…
– Sparse vs dense matrices– not all combinations are types
On the importance of necessary and sufficient
conditions• Why intersection_of?• Why not just make normal links in
the GO DAG?– normal relationships are for
necessary conditions only– we want both necessary and
sufficient conditions • captures the essence of the term
Normal DAG links only capture necessary
conditions, not essence
immune cellactivation
inflammatoryresponse
part_ofA change in morphology and behavior of a macrophage resulting from exposure to a cytokine, chemokine, cellular ligand, pathogen, or soluble factor
text def:macrophage
activation
is_a
Indistinguishable by DAG
immune cellactivation
inflammatoryresponse
part_ofA change in morphology and behavior of a monocyte resulting from exposure to a cytokine, chemokine, cellular ligand, pathogen, or soluble factor
text def:monocyteactivation
is_a
essence captured by genus-differentia
macrophageactivation
immune cellactivation
is_ainflammatory
response
part_of
id: GO:macrophage_activationintersection_of: GO:cell_activationintersection_of: activates CL:macrophage
essence captured by genus-differentia
macrophageactivation
immune cellactivation
is_a
inflammatoryresponse
part_of
id: GO:macrophage_activationintersection_of: GO:cell_activationintersection_of: activates CL:macrophage
CL:macrophage
cellactivation is_a
genus
activates
Current status of pre-coordinated terms
• SO already contains composite terms– 46 pre-coordinated terms– A silenced gene is a gene which has the
quality of being silenced
• GO-BP/CL integration underway– retrospectively pre-coordinated terms
• Obol page has pre-coordinated terms from automatic parsing– http://www.fruitfly.org/~cjm/obol
Pre- vs post- coordinated
• Pre-coordination– terms are in ontology with IDs and
computable definitions– increases complexity of ontology– complexity can be managed by tools
• e.g. new oboedit features
• Post-coordination– terms are combined in the database– forces more complexity in database schema
and database applications
Pre-coordination is useful in moderation
• Commonly used terms should be pre-coordinated
• eg cysteine biosynthesis; oocyte differentiation; pectoral fin
• Avoid taking to extremes• cf ICD-9
• Where do we draw the line?– ontologies should be built around one or a few
axes of classification• term ‘explosion’ typically gets large when multiple axes
are combined
– we can change our minds later• pre- and post- coordination is commensurable
Commensurability
• Annotator annotates to– nucleus^part_of(astrocyte)
• Anatomy editor creates new term– uses oboedit cross-product plugin– astrocyte_nucleus = nucleus^part_of(astrocyte)
• Annotation can be dynamically ‘promoted’ to new term in answer to queries– various software techniques for achieving this
Post-coordination in GO annotations
• Pre- and post- coordination are compatible and commensurable
• We should extend the annotation format to allow denoting more specific classes– e.g.
• cholesterol transport in liver
– advanced applications can query this– standard applications suffer no loss– extended annotations can be used to help seed new
terms in the ontology
• This is already being done (MGI,Dicty)– we just want to capture this in interopeable way
Post-composition in gene association files
• New column in GA file format
Gene Product
Term ID … Properties
AABC1 GO:0030301(cholesterol transport)
located_in(MA:liver)
AABC2 GO:0048663(neuron fate development)
has_participant(FBbt:Y_neuron)
Database issues
• Chado and GO DB can handle pre- and post- coordination– in theory anyway
• not yet fully tested
• How does it work?– ‘anonymous term’ created for
coordinated term– documentation in chado cvs
• chado/modules/cv/doc/