The real world of ontologies and phenotype representation: perspectives from the Neuroscience Information Framework Maryann Martone, Ph. D. University of California, San Diego
Jun 25, 2015
The real world of ontologies and phenotype
representation: perspectives from the
Neuroscience Information Framework
Maryann Martone, Ph. D.University of California, San Diego
“Neural Choreography”
“A grand challenge in neuroscience is to elucidate brain function in relation to its multiple layers of organization that operate at different spatial and temporal scales. Central to this effort is tackling “neural choreography” -- the integrated functioning of neurons into brain circuits-- Neural choreography cannot be understood via a purely reductionist approach. Rather, it entails the convergent use of analytical and synthetic tools to gather, analyze and mine information from each level of analysis, and capture the emergence of new layers of function (or dysfunction) as we move from studying genes and proteins, to cells, circuits, thought, and behavior....
However, the neuroscience community is not yet fully engaged in exploiting the rich array of data currently available, nor is it adequately poised to capitalize on the forthcoming data explosion. “
Akil et al., Science, Feb 11, 2011
“Data choreography” In that same issue of Science
Asked peer reviewers from last year about the availability and use of data About half of those polled store their data only in their
laboratories—not an ideal long-term solution. Many bemoaned the lack of common metadata and
archives as a main impediment to using and storing data, and most of the respondents have no funding to support archiving
And even where accessible, much data in many fields is too poorly organized to enable it to be efficiently used.
“...it is a growing challenge to ensure that data produced during the course of reported research are appropriately described, standardized, archived, and available to all.” Lead Science editorial (Science 11 February 2011: Vol. 331 no. 6018 p. 649 )
NIF is an initiative of the NIH Blueprint consortium of institutes What types of resources (data, tools, materials,
services) are available to the neuroscience community?
How many are there? What domains do they cover? What domains do
they not cover? Where are they?
Web sites Databases Literature Supplementary material
Who uses them? Who creates them? How can we find them? How can we make them better in the future?
http://neuinfo.org
• PDF files
• Desk drawers
In an ideal world...
We’d like to be able to find:What is known****:
What is the average diameter of a Purkinje neuron
Is GRM1 expressed In cerebral cortex? What are the projections of hippocampus? What genes have been found to be
upregulated in chronic drug abuse in adults Is alpha synuclein in the striatum? What studies used my polyclonal antibody
against GABA in humans? What rat strains have been used most
extensively in research during the last 20 years?
What is not known: Connections among data Gaps in knowledgeWithout some sort of framework, very
difficult to do
Required Components:– Query interface– Search strategies
– Data sources– Infrastructure– Results display
– Why did I get this result?
– Analysis tools
The Neuroscience Information Framework: Discovery and utilization of web-based
resources for neuroscience
A portal for finding and using neuroscience resources
A consistent framework for describing resources
Provides simultaneous search of multiple types of information, organized by category
Supported by an expansive ontology for neuroscience
Utilizes advanced technologies to search the “hidden web”
http://neuinfo.org
UCSD, Yale, Cal Tech, George Mason, Washington Univ
Supported by NIH Blueprint
Literature
Database Federation
Registry
We need more databases !?
•NIF Registry: A catalog of neuroscience-relevant resources
• > 5000 currently listed
• > 2000 databases
•And we are finding more every day
NIF must work with ecosystem as it is today
NIF was one of the first projects to attempt data integration in the neurosciences on a large scale
NIF is supported by a contract that specified the number of resources to be added per year Designed to be populated rapidly; set up process for
progressive refinement No budget was allocated to retrofit existing resources;
had to work with them in their current state We designed a system that required little to no
cooperation or work from providers NIF was required to assemble (not create) ontologies very fast
and to provide a platform through which the community could view, comment and add NIF is enriched by ontologies but does not depend on them Took advantage of community ontologies But needed to take a very pragmatic and aggressive approach to
incorporating and using them Neurolex semantic wiki
What are the connections of the hippocampus?
Hippocampus OR “Cornu Ammonis” OR “Ammon’s horn” Query expansion:
Synonyms and related concepts
Boolean queriesData sources
categorized by “data type” and level of nervous
system
Common views across multiple
sources
Tutorials for using full
resource when getting there
from NIF
Link back to record in
original source
Imminent: NIF 5.0
NIF 5.0 about to be released
New design New query
features New
analytics
What do you mean by data?Databases come in many shapes and
sizes Primary data:
Data available for reanalysis, e.g., microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)
Secondary data Data features extracted
through data processing and sometimes normalization, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connectivity statements (BAMS)
Tertiary data Claims and assertions
about the meaning of data E.g., gene
upregulation/downregulation, brain activation as a function of task
Registries: Metadata Pointers to data sets or
materials stored elsewhere Data aggregators
Aggregate data of the same type from multiple sources, e.g., Cell Image Library ,SUMSdb, Brede
Single source Data acquired within a
single context , e.g., Allen Brain Atlas
Researchers are producing a variety of information resources using a multitude of technologies
Exploration: Where is alpha synuclein?
•Spatially:• Gene• Protein
• Subcellular• Cellular• Regional• Organism
•Semantically:• Gene regulation
networks• Protein pathways• Cellular local
connectivity• Regional connectivity• Who is studying it?• Who is funding its
study?Networks exist across scales; all important in the nervous system
Set of modular ontologies 86, 000 + distinct
concepts + synonyms Bridge files between
modules Expressed in OWL-DL
language Currently supports OWL
2 Tries to follow OBO
community best practices Standardized to the
same upper level ontologies e.g., Basic Formal
Ontology (BFO), OBO Relations Ontology (OBO-RO),
Imports existing community ontologies e.g., CHEBI, GO,
PRO, DOID, OBI etc.
Retains identifiers in most recent additions but reflects history
13
Covers major domains of neuroscience: Organisms, Brain Regions, Cells,
Molecules, Subcellular parts, Diseases, Nervous system functions,
Techniques
NIFSTD Ontologies
Fahim Imam, William Bug
“Search computing”: Query by concept
What genes are upregulated by drugs of abuse in the adult mouse? (show
me the data!) MorphineIncreased expression
Adult Mouse
Reasonable standards make it easy to search for and compare results
Diseases of nervous system
New: Data analytics
NIF is in a unique position to answer questions about the neuroscience ecosystem using new analytics tools
Neuro
deg
enera
tive
Seizu
re d
isord
ers
Neopla
stic dise
ase
of n
erv
ous
syste
m
NIH Reporte
r
NIF
data
federa
ted
sou
rces
Results are organized within a common framework
Connects to
Synapsed with
Synapsed by
Input region
innervates
Axon innervates
Projects to
Cellular contact
Subcellular contact
Source site
Target site
Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases
NIF Concept Mapper
The scourge of neuroanatomical nomenclature: Importance of NIF
semantic framework•NIF Connectivity: 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
• Brain Architecture Management System (rodent)• Temporal lobe.com (rodent)• Connectome Wiki (human)• Brain Maps (various)• CoCoMac (primate cortex)• UCLA Multimodal database (Human fMRI)• Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42•Number of synonym matches: 99•Number of 1st order partonomy matches: 385
Why so many names? The brain is perhaps unique among major organ
systems in the multiplicity of naming schemes for its major and minor regions.
The brain has been divided based on topology of major features, cyto- and myelo-architecture, developmental boundaries, supposed evolutionary origins, histochemistry, gene expression and functional criteria.
The gross anatomy of the brain reflects the underlying networks only superficially, and thus any parcellation reflects a somewhat arbitrary division based on one or more of these criteria.
The “activation map” images that commonly accompany brain imaging papers can be misleading to inexperienced readers, by seeming to suggest that the boundaries between “activated” and “unactivated” patches of cortex are unambigous and sharp. Instead, as most researchers are aware, the apparent sharp boundaries are subject to the choice of threshold applied to the statistical tests that generate the image. What, then, justifies dividing the cortex into regions with boundaries based on this fuzzy, mutable measure of functional profile?(Saxe et al., 2010, p. 39). Brainmaps.org
Program on Ontologies for Neural Structures
International Neuroinformatics Coordinating Committee Structural Lexicon Task Force
Defining brain structures Translate among terminologies
Neuronal Registry Task Force Consistent naming scheme for neurons Knowledge base of neuron properties
Representation and Deployment Task Force Formal representation
Also interacts with Digital Atlasing Task Force
http://incf.org
NeuroLex Wiki
http://neurolex.org Stephen Larson
•Provide a simple framework for defining the concepts required
• Light weight semantics
• Good teaching tool for learning about semantic integration and the benefits of a consistent semantic framework
•Community based:• Anyone can
contribute their terms, concepts, things
• Anyone can edit• Anyone can link
•Accessible: searched by Google
•Building an extensive cross-disciplinary knowledge base for neuroscience
Demo D03
Defining nervous system structures
Parcellation scheme: Set of parcels occupying part or all of an anatomical entity that has been delineated using a common approach or set of criteria, often in a single study. A parcellation scheme for any given individual entity may include gaps, transitional zones, or regions of uncertainty. A parcellation scheme derived from a set of individuals registered to a common target (atlas) may be probabilistic and include overlap of parcels in regions that reflect individual variability or imperfections in alignment. 14 parcellation schemes currently represented in Neurolex
Documentation available INCF task force on
ontologies
Basic model: do not conflate conceptual structures with parcels
Regional part of nervous system
Functional part of nervous system
Parceloverlaps
overlaps overlaps
Parcel Parcel
Neuroscientists have a lot of different parcellation schemes because they have a lot of different ways of classifying brain structures and techniques to match them are imperfect
Linking semantics to space: INCF Atlasing
www.neurolex.org
Link to spatial representation
in scalable brain atlas
Waxholm space
Seth Ruffins, Alan Ruttenberg, Rembrandt Bakker
Neurons in Neurolex International
Neuroinformatics Coordinating Facility (INCF) building a knowledge base of neurons and their properties via the Neurolex Wiki
Led by Dr. Gordon Shepherd
Consistent and parseable naming scheme
Knowledge is readily accessible, editable and computable
While structure is imposed, don’t worry too much about the upper level classes of the ontology
Stephen Larson
26
A KNOWLEDGE BASE OF NEURONAL PROPERTIES
Additional semantics added in NIFSTD by ontology engineer
Concept-based search: search by meaning
Search Google: GABAergic neuron Search NIF: GABAergic neuron
NIF automatically searches for types of GABAergic neurons
Types of GABAergic neurons
Challenges of multiscale neurodegenerative disease
phenotypes
•Neurodegenerative diseases target very specific cell populations•Model systems only replicate a subset of features of the disease•Related phenotypes occur across anatomical scales•Different vocabularies are used by different communities
not
not
Midbrain degenerated
Substantia nigra decreased in volume
Substantia nigra pars compacta atrophied
Loss of Snpc dopaminergic neurons
Degeneration of nigrostriatal terminals
Tyrosine-hydroxylase containing neurons
degenerate
Approach: Use ontologies to provide necessary knowledge for matching
related phenotypes
Sarah Maynard, Chris Mungall, Suzie Lewis, Fahim Imam
Midbrain
Substantia nigra
Substantia nigra pars compacta
Substantia nigra pars compacta dopamine cell
Dopamine
Neuron cell soma
Neuron (CL)
Part of neuron (GO)
Small molecule (Chebi)
Atrophied
Decreased volume
Fewer in number
Degenerate
Decreased in magnitude relative to
some normal
Has part
Has part
Is part of
Has part
Has part
Is a
Is a Is a
Is a
Entities
Qualities
NIFSTD/PKBOBO ontology
Alzheimer’s disease
Human(birnlex_516)
Neocortex pyramidal
neuron
Increased number of
Lipofuscin
has part
inheres in
inheres in
towards
EQ Representation of Phenotypes in Neurodegenerative Disease: PATO and
NIFSTD
Instance: Human with Alzheimer’s disease 050
Phenotype birnlex_2087_56
inheres in
about
Chris Mungall, Suzanna Lewis
Structured annotation model implemented in
WIB
OBD: Ontology based database
Provides a user interface for matching organisms based on similarity of phenotypes Based on EQ
model
Uses knowledge in the ontology to compute similarity scores and other statistical measures like information content
http://www.berkeleybop.org/pkb/Chris Mungall, Suzanna Lewis, Lawrence
Berkeley Labs
Thalamus
Cellular inclusion
Midline nuclear group
Lewy Body
Paracentral nucleus
Cellular inclusion
Computes common subsumers and information content among
phenotypes
*B6CBA-TgN (HDexon1)62) that express exon1 of the human mutant HD gene- Li et al., J Neurosci, 21(21):8473-8481
PhenoSim: What organism is most similar to a human with Huntington’s
disease?
Putamen atrophiedGlobus pallidus
neuropil degenerate
Part of basal ganglia decreased
in magnitude
Fewer neostriatum medium spiny
neurons in putamenNeurons in striatum
degenerate
Neuron in striatum decreased in magnitude
Increased number of astrocytes in
caudate nucleusNeurons in striatum
degenerate
Nervous system cell change in
number in striatum
Progressive enrichment
Understanding and comparing phenotypes will be enriched through community knowledge bases like Neurolex
Looking forward to continuing this as part of the Monarch project with Melissa Haendel, Chris Mungall and Suzie Lewis
Top Down vs Bottom upTop-down ontology construction
• A select few authors have write privileges• Maximizes consistency of terms with each other (automated consistency checking)• Making changes requires approval and re-publishing• Works best when domain to be organized has: small corpus, formal categories, stable entities, restricted entities, clear edges.• Works best with participants who are: expert catalogers, coordinated users, expert users, people with authoritative source of judgment
Bottom-up ontology construction• Multiple participants can edit the ontology instantly (many eyes to correct errors)• Semantics are limited to what is convenient for the domain• Not a replacement for top-down construction; sometimes necessary to increase flexibility• Necessary when domain has: large corpus, no formal categories, no clear edges• Necessary when participants are: uncoordinated users, amateur users, naïve catalogers• Neuroscience is a domain that is less formal and neuroscientists are more uncoordinated
NIFSTD
NEUROLEX
Important for Ontologists to define community contribution model
It’s a messy ecosystem (and that’s OK)
NIF favors a hybrid, tiered, federated system
Domain knowledge Ontologies
Claims about results Virtuoso RDF
triples
Data Data federation Workflows
Narrative Full text access
NeuronBrain part
Disease
Organism
Gene
Caudate projects to Snpc Grm1 is
upregulated in chronic cocaineBetz cells
degenerate in ALS
Musings from the NIF No one can be stopped from doing what they
need to do Every resource is resource limited: few have
enough time, money, staff or expertise required to do everything they would like If the market can support 11 MRI databases, fine Some consolidation, coordination is warranted though
Big, broad and messy beats small, narrow and neat Without trying to integrate a lot of data, we will not know
what needs to be done A lot can be done with messy data; neatness helps though Progressive refinement; addition of complexity through
layers
Be flexible and opportunistic A single optimal technology/container for all types of
scientific data and information does not exist; technology is changing
Think globally; act locally: No source, not even NIF, is THE source; we are all a source
Grabbing the long tail of small data
Analysis of NIF shows multiple databases with similar scope and content
Many contain partially overlapping data
Data “flows” from one resource to the next Data is
reinterpreted, reanalyzed or added to
Is duplication good or bad?
Same data: different analysis
Chronic vs acute morphine in
striatum
Drug Related Gene database: extracted statements from figures, tables and supplementary data from published article
Gemma: Reanalyzed microarray results from GEO using different algorithms
Both provide results of increased or decreased expression as a function of experimental paradigm 4 strains of mice 3 conditions: chronic
morphine, acute morphine, saline
Mined NIF for all references to GEO ID’s: found small number where the same dataset was represented in two or more databaseshttp://www.chibi.ubc.ca/Gemma/
home.html
How easy was it to compare?
Gemma: Gene ID + Gene SymbolDRG: Gene name + Probe ID
Gemma: Increased expression/decreased expressionDRG: Increased expression/decreased expression
But...Gemma presented results relative to baseline chronic morphine; DRG with respect to saline, so direction of change is opposite in the 2 databases
Analysis: 1370 statements from Gemma regarding gene expression as a
function of chronic morphine 617 were consistent with DRG; over half of the claims of the
paper were not confirmed in this analysis Results for 1 gene were opposite in DRG and Gemma 45 did not have enough information provided in the paper to make
a judgment
NIF annotation standard
Beware of False Dichotomies
Top-down vs bottom up
Light weight vs heavy weight
“Chaotic Nihilists and Semantic Idealists” Text mining vs annotation
Curators vs scientists
Human vs machine
DOI’s vs URI’s
http://www.datanami.com/datanami/2013-02-05/chaotic_nihilists_and_semantic_idealists.html
NIF team (past and present)
Jeff Grethe, UCSD, Co Investigator, Interim PIAmarnath Gupta, UCSD, Co InvestigatorAnita Bandrowski, NIF Project LeaderGordon Shepherd, Yale UniversityPerry MillerLuis MarencoRixin WangDavid Van Essen, Washington UniversityErin ReidPaul Sternberg, Cal TechArun RangarajanHans Michael MullerYuling LiGiorgio Ascoli, George Mason UniversitySridevi Polavarum
Fahim Imam, NIF Ontology EngineerLarry LuiAndrea Arnaud StaggJonathan CachatJennifer LawrenceLee HornbrookBinh NgoVadim AstakhovXufei QianChris ConditMark EllismanStephen LarsonWillie WongTim Clark, Harvard UniversityPaolo CiccareseKaren Skinner, NIH, Program Officer