Maryann E. Martone, Ph. D. University of California, San Diego INCF Neuroinforma>cs Short Course, Stockholm, August 2013
Jan 27, 2015
Maryann E. Martone, Ph. D. University of California, San Diego
INCF Neuroinforma>cs Short Course, Stockholm, August 2013
• Introduc>on • Introduc>on to the Neuroscience Informa>on Framework
• Structured informa>on: data, databases • Federa>ng neuroscience-‐relevant databases • Informa>on frameworks • Ontologies • What can we do with informa>on in the NIF? • Conclusions
Scholar
Library
Scholar
Publisher
FORCE11.org: Future of research communica>ons and e-‐scholarship
Scholar
Consumer
Libraries
Data Repositories
Code Repositories Community databases/plaRorms
OA
Curators
Social Networks
Social Networks Social
Networks
Peer Reviewers
Narra>ve
Workflows
Data
Models
Mul>media
Nanopublica>ons
Code
hTp://neuinfo.org
• NIF’s mission is to maximize the awareness of, access to and u>lity of research resources produced worldwide to enable beTer science and promote efficient use – NIF unites neuroscience informa>on without respect to domain,
funding agency, ins>tute or community
– NIF is like a “Pub Med” for all biomedical resources and a “Pub Med Central” for databases
– Makes them searchable from a single interface – Prac>cal and cost-‐effec>ve; tries to be sensible – Learned a lot about the effec0ve data sharing
The Neuroscience Informa>on Framework is an ini>a>ve of the NIH Blueprint consor>um of ins>tutes hTp://neuinfo.org
We’d like to be able to find: • What is known****:
– What are the projec>ons of hippocampus? – Is GRM1 expressed In cerebral cortex? – What genes have been found to be upregulated in
chronic drug abuse in adults – What animal models have similar phenotypes to
Parkinson’s disease? – What studies used my polyclonal an>body against
GABA in humans?
• What is not known: – Connec>ons among data – Gaps in knowledge
A framework makes it easier to address these ques>ons
Neuroscience is unlikely to be served by a few large databases like the genomics and proteomics community
Whole brain data (20 um
microscopic MRI)
Mosiac LM images (1 GB+)
Conven>onal LM images
Individual cell morphologies
EM volumes & reconstruc>ons
Solved molecular structures
No single technology serves these all equally well. Mul0ple data types; mul0ple scales; mul0ple
databases
• Data warehouse: May contain data from diverse sources; schemas are integrated. Data are “cleaned” to fit unified data model. One database to rule them all...
• Data federa>on: a virtual database that stores data defini>ons and not the data itself. The virtual database will have informa>on about the loca>on of the data. When a single call is made to a virtual database, the technology ensures mul>ple calls to underlying databases and is also responsible for meaningfully aggrega>ng the returned result sets.
From wikipedia and hTp://www.infosysblogs.com/oracle/2010/01/data_federa>on_a_potent_subst_1.html
Subject 473
• Species: mouse (string) • Age: 50 days (integer)
• Age category: adult
• Protocol: 2
Rela0onal Database
“Mice (aged 50 days) were perfused with 4% paraformaldehyde and brains were sec>oned at a thickness of 50 um. Sec>ons were labeled using an>bodies against calbindin and imaged on a Zeiss confocal microscope.”
Data model; data types, formal query language
Free text
En>ty recogni>on; Natural language processing
∞
What is easily machine processable and accessible
What is poten>ally knowable
What is known: Literature, images, human
knowledge
Unstructured; Natural language processing, en>ty recogni>on, image processing and
analysis; paywalls communica>on
Abstracts vs full text vs tables etc
hGp://neuinfo.org June10, 2013 dkCOIN Inves>gator's Retreat 13
• A portal for finding and using neuroscience resources
A consistent framework for describing resources
Provides simultaneous search of mul>ple types of informa>on, organized by category
Supported by an expansive ontology for neuroscience
U>lizes advanced technologies to search the “hidden web”
UCSD, Yale, Cal Tech, George Mason, Washington Univ
Literature
Database Federa>on
Registry
With the thousands of databases and other informa>on sources available, simple descrip>ve metadata will not suffice
• NIF curators • Nomina>on by the community • Semi-‐automated text mining pipelines
NIF Registry Requires no special skills Site map available for local hos>ng
• NIF Data Federa>on • DISCO interop • Requires some programming skill • Open Source Brain < 2 hr
Low barrier to entry; incremental refinement
NIF was designed to be populated rapidly with progressive refinement
Databases come in many shapes and sizes
• Primary data: – Data available for reanalysis, e.g.,
microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)
• Secondary data – Data features extracted through
data processing and some>mes normaliza>on, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connec>vity statements (BAMS)
• Ter>ary data – Claims and asser>ons about the
meaning of data • E.g., gene upregula>on/
downregula>on, brain ac>va>on as a func>on of task
• Registries: – Metadata – Pointers to data sets or
materials stored elsewhere • Data aggregators
– Aggregate data of the same type from mul>ple sources, e.g., Cell Image Library ,SUMSdb, Brede
• Single source – Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of informa>on ar>facts using a mul>tude of technologies
• Data: values of qualita>ve or quan>ta>ve variables, belonging to a set of items... oten the results of measurements (Wikipedia)
• Metadata: “Data about data” • Structural metadata:
• the design and specifica>on of data structures and is more properly called "data about the containers of data” (Wikipedia)
• e.g., image size, bit depth, integer vs string
• Descrip>ve metadata:
• individual instances of applica>on data, the data content “data about data content”
• e.g., creator, subject,
• Data type: the form of the data for the purposes of data opera>ons
• Data Integra>on: combining data residing in different sources and providing users with a unified view of these data
“Metadata are data” -‐Wikipedia
0
50
100
150
200
250
0.01
0.1
1
10
100
1000
6-‐12 12-‐12 7-‐13 1-‐14 8-‐14 2-‐15 9-‐15 4-‐16 10-‐16 5-‐17
Num
ber of Fed
erated
Datab
ases
Num
ber of Fed
erated
Records (M
illions)
NIF searches the largest colla>on of neuroscience-‐relevant data on the web
DISCO
June10, 2013 dkCOIN Inves>gator's Retreat 20
• Long tail data: large numbers of small data sets
hTp://en.wikipedia.org/wiki/Long_tail
Hippocampus OR “Cornu Ammonis” OR “Ammon’s horn” Query expansion: Synonyms
and related concepts Boolean queries
Data sources categorized by “data type” and level of nervous
system
Common views across mul>ple
sources
Tutorials for using full resource when ge{ng there from
NIF
Link back to record in
original source
Connects to
Synapsed with
Synapsed by
Input region
innervates
Axon innervates Projects to Cellular contact
Subcellular contact
Source site
Target site
Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases
• Current web is designed to share documents – Documents are unstructured data
• Much of the content of digital resources is part of the “hidden web”
• Wikipedia: The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.
Even Google needs a knowledge framework
Knowledge in space and spa>al rela>onships (the “where”)
Knowledge in words, terminologies and logical rela>onships (the “what”)
Purkinje Cell
Axon Terminal
Axon Dendri>c Tree
Dendri>c Spine
Dendrite
Cell body
Cerebellar cortex
There is liTle obvious connec>on between data sets taken at different scales using different microscopies without an explicit representa>on of the biological objects that the data represent
• NIF covers mul>ple structural scales and domains of relevance to neuroscience • Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene
Ontology, Chebi, Protein Ontology
NIFSTD
Organism
NS Func>on Molecule Inves>ga>on Subcellular structure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunc>on Quality Anatomical Structure
Brain
Cerebellum
Purkinje Cell Layer
Purkinje cell
neuron
has a
has a
has a
is a
• Ontology: an explicit, formal representa>on of concepts rela>onships among them within a par>cular domain that expresses human knowledge in a machine readable form
• Branch of philosophy: a theory of what is
• e.g., Gene ontologies
• Express neuroscience concepts in a way that is machine readable – Synonyms, lexical variants – Defini>ons
• Provide means of disambigua>on of strings – Nucleus part of cell; nucleus part of brain; nucleus part of atom
• Rules by which a class is defined, e.g., a GABAergic neuron is neuron that releases GABA as a neurotransmiTer
• Proper>es – Support reasoning
• Provide universals for naviga>ng across different data sources – Seman>c “index” – Link data through rela>onships not just one-‐to-‐one mappings
• Provide the basis for concept-‐based queries to probe and mine data • Establish a seman>c framework for landscape analysis
Mathema>cs, Computer code or Esperanto
June10, 2013 32
Aligns sources to the NIF seman>c framework
birnlex_1741 Brodmann.10
Explicit mapping of database content helps disambiguate non-‐unique and custom terminology
birnlex_1204 CA3
• Search Google: GABAergic neuron
• Search NIF: GABAergic neuron
– NIF automa>cally searches for types of GABAergic neurons
Types of GABAergic neurons
Neuroscience Information Framework – http://neuinfo.org
Equivalence classes; restric>ons
Arbitrary but defensible
• Neurons classified by • Circuit role: principal neuron vs interneuron • Molecular cons>tuent: Parvalbumin-‐neurons, calbindin-‐neurons • Brain region: Cerebellar neuron • Morphology: Spiny neuron
• Molecule Roles: Drug of abuse, anterograde tracer, retrograde tracer • Brain parts: Circumventricular organ • Organisms: Non-‐human primate, non-‐human vertebrate • Quali>es: Expression level • Techniques: Neuroimaging
What genes are upregulated by drugs of abuse in the adult mouse? (show me the data!)
Morphine Increased expression
Adult Mouse
• NIF Connec>vity: 7 databases containing connec>vity primary data or claims from literature on connec>vity between brain regions
• Brain Architecture Management System (rodent) • Temporal lobe.com (rodent) • Connectome Wiki (human) • Brain Maps (various) • CoCoMac (primate cortex) • UCLA Mul>modal database (Human fMRI) • Avian Brain Connec>vity Database (Bird)
• Total: 1800 unique brain terms (excluding Avian)
• Number of exact terms used in > 1 database: 42 • Number of synonym matches: 99 • Number of 1st order partonomy matches: 385
• Realism vs conceptualism
• Controlled vocabularies vs taxonomies vs ontology? • How do I name classes? • Shared vs custom ontologies
• Single vs mul>ple inheritance • RDF vs OWL? • Top down vs boTom up: heavy weight vs light weight ontologies
• Should I encode everything in my ontology?
Many schools of thought about ontologies-‐their construc>on and use
• Controlled vocabularies: prescribed list of terms or headings each one having
an assigned meaning
• Lexicon/Thesaurus: Vocabularies + their lexical proper>es, e.g., synonyms,
lexical variants
• Taxonomy: monohierarchical
classifica>on of concepts, as used, for
example, in the classifica>on of biological
organisms, built on the “is a “ rela>onship
• Ontology: specifica>on of the concepts of a domain and their rela>onships,
structured to allow computer processing
and reasoning
hTp://www.willpowerinfo.co.uk/glossary.htm Mike Bergman
• Iden>ty: – En>>es are uniquely iden>fiable – Name is a meaningless numerical iden>fier (URI: Uniform resource iden>fier) – Any number of human readable labels can be assigned to it
• Defini>on: – Genera: is a type of (cell, anatomical structure, cell part)
– Differen>a: “has a” A set of proper>es that dis>nguish among members of that class
– Can include necessary and sufficient condi>ons
• Implementa>on: How is this defini>on expressed – Depending on the nature of the concept or en>ty and the needs of the
informa>on system, we can say more or fewer things – Different languages; can express different things about the concept that can be
computed upon
• OWL W3C standard, RDF
birnlex_1362 CA2
CHEBI_29108 CA2
NIF follows OBO Foundry best prac>ces for naming and defining classes
• XML: Extensible Mark Up language: Mark up language for data. XML itself is not very much concerned with meaning. XML nodes don't need to be associated with par>cular concepts, and the XML standard doesn't indicate how to derive a fact from a document.
• RDF: Resource Descrip>on Framework: a general method to decompose knowledge into small pieces, with some rules about the seman>cs, or meaning, of those pieces. What sets RDF apart from XML is that RDF is designed to represent knowledge in a distributed world. That RDF is designed for knowledge, and not data, means RDF is par>cularly concerned with meaning.
– Small pieces are called “triples”: Subject predicate object
– Purkinje neuron (S) has neurotransmiDer (P) GABA (O)
• RDFS -‐ a method of specifying metadata about proper>es/characteris>cs of things and classes of things such that inference an be carried out (conceptualized in RDF)
• OWL (Web Ontology Language) -‐ a more complex(/powerful) extension of RDFS
• SPARQL -‐ Is a query language designed for RDF (similar to how SQL was designed for rela>onal databases)
hTp://answers.seman>cweb.com/ques>ons/15215/whats-‐the-‐difference-‐between-‐using-‐rdfsowl-‐versus-‐xml hTp://www.rdfabout.com/intro/#Introducing%20RDF
Rela>onal model • Mouse has age 50 days
• Protocol uses instrument confocal microscope
• A confocal imaging protocol is a protocol that uses instrument confocal microscope
RDF: The computer doesn't need to know what has actually means in English for this to be useful. It is let up to the applica>on writer to choose appropriate names for things (confocal microscope) and to use the right predicates (uses, has). RDF tools are ignorant of what these names mean, but they can s>ll usefully process the informa>on.-‐hTp://www.rdfabout.com/intro/#Introducing%20RDF
May link to other informa>on, e.g., mouse is a rodent
The thalamus projects to the cortex in mammals • Universal: allValuesFrom: If a mammal has a cortex and a
thalamus, then the thalamus must project to the cortex • Existen>al: SomeValuesFrom: The thalamus projects to
the cortex in at least one member of the class mammal • Disjointness: owl:disjointWith: a member of one class
cannot simultaneously be an instance of a specified other class: Rep>les are disjoint from mammals
W3C OWL guide: www.w3.org/TR/2004/REC-‐owl-‐guide-‐20040210/
Restric>ons places on classes allow us to reason over the ontology and check for consistency
46
1. Look brain region up in NeuroLex 2. Look up cells contained in the brain
region 3. Find those cells that are known to project
out of that brain region 4. Look up the neurotransmiTers for those
cells 5. Determine whether those
neurotransmiTers are known to be excitatory or inhibitory
6. Report the projec>on as excitatory or inhibitory, and report the en>re chain of logic with links back to the wiki pages where they were made
7. Make sure user can get back to each statement in the logic chain to edit it if they think it is wrong
Stephen Larson CHEBI:18243
Brain
Cerebellum
Cortex
Cerebellar Purkinje cell
Purkinje neuron
Purkinje cell soma
Purkinje cell layer
Cerebellar cortex
IP3
Cerebellum
• To create the linkages requires mapping • Mapping is usually incomplete and not always possible • Can’t take advantage of others’ work
Gross anatomy ontology Cell centered anatomy ontology
Reuse iden>fiers rather than recreate them
• “The trouble is that if I make up all of my own URIs, my RDF document has no meaning to anyone else unless I explain what each URI is intended to denote or mean. Two RDF documents with no URIs in common have no informa>on that can be interrelated.”
• NIF favors reuse of iden>fiers rather than mapping
• Crea>ng ontologies to be used as common building blocks: modularity, low seman>c overhead, is important
hTp://www.rdfabout.com/intro/#Introducing%20RDF
Cerebellum Purkinje cell soma
Cerebellum Purkinje cell dendrite
Cerebellum Purkinje cell axon
(Cell part ontology)
Cerebellum granule cell layer (Anatomy ontology)
Cerebellum Purkinje cell layer
Cerebellum molecular layer
Has part
Has part
Has part
Is part of
Is part of
Is part of
Calbindin IP3 (CHEBI:16595)
Cerebellum Purkinje neuron (Cell Ontology)
Cerebellar cortex
Has part Has part
Has part
• Neuroscience Informa>on Framework – NIFSTD available for download – Ontoquest web services
– NIF annota>on services and mapping tools available
– Neurolex available via SPARQL endpoint
• Bioportal: Collec>on of > 300 ontologies covering many domains – automated mapping between ontologies – Annota>on services – Web services for access
• OBO Foundry: hTp://www.obofoundry.org/ – Collec>on of community ontologies designed
according to OBO Foundry principles
• Protégé Ontology editor: Edi>ng tool for construc>ng ontologies. Excellent short course available for Protégé/OWL.
• Program on Ontologies of Neural Structures (INCF): CUMBO, Neurolex Wiki, Scalable Brain Atlas
You can enhance your tools and annota>on with community ontologies
hTp://neurolex.org Larson et al, Fron>ers in Neuroinforma>cs, in press
• Seman>c MediWiki
• Provide a simple interface for defining the concepts required
• Light weight seman>cs
• Good teaching tool for learning about seman>c integra>on and the benefits of a consistent seman>c framework
• Community based: • Anyone can contribute their terms, concepts, things
• Anyone can edit • Anyone can link
• Accessible: searched by Google • Growing into a significant knowledge base for neuroscience
Demo D03
200,000 edits 150 contributors
Red Links: Informa>on is missing (or misspelled)
• Neurolex provides an on-‐line computable index for expressing models in seman>c terms, and linking to other knowledge and data
• INCF task forces are contribu>ng knowledge
• Neuroscience knowledge in the web
Builds a knowledge base by cross-‐modular rela>ons and links to data
Once terms have been proposed and veTed by neuroscience community, NIF feeds them back to general ontologies to enrich coverage of neuroscience
Because they are sta>c URL’s, Wikis are searchable by Google
• INCF Project – Neuron Registry – > 30 experts worldwide
– Fill out neuron pages in Neurolex Wiki
– Led by Dr. Gordon Shepherd
Soma loca>on Dendrite loca>on
Axon loca>on 0
50
100
150
200
250
300
Number Total redlinks
easy fixes
hard fixes
Soma loca>on
Dendrite loca>on
Axon loca>on
Social networks and community sites let us learn things from the collec>ve behavior of contributors INCF Knowledge Space
• Of the ~ 4000 columns that NIF queries, ~1300 map to one of our core categories: – Organism
– Anatomical structure
– Cell – Molecule
– Func>on – Dysfunc>on – Technique
• 30-‐50% of NIF’s queries autocomplete
• When NIF combines mul>ple sources, a set of common fields emerges – >Basic informa>on models/seman>c models exist for certain types of en>>es
Biomedical science does have a conceptual framework; but we don’t place undo importance on it must >e to data
• NIF can be used to survey the data landscape
• Analysis of NIF shows mul>ple databases with similar scope and content
• Many contain par>ally overlapping data
• Data “flows” from one resource to the next – Data is reinterpreted, reanalyzed or
added to
• Is duplica>on good or bad? NIF is trying to make it easier to work with diverse data
NIF is in a unique posi>on to answer ques>ons about the neuroscience landscape
Where are the data?
Striatum Hypothalamus Olfactory bulb
Cerebral cortex
Brain
Brain region
Data source
∞
What is easily machine processable and accessible
What is poten>ally knowable
What is known: Literature, images, human
knowledge
Unstructured; Natural language processing, en>ty recogni>on, image processing and
analysis; communica>on
“Known unknowns vs unknown unknowns”
Open world meets closed world
Comprehensive and unbiased?
We know a lot about some things and less about others; some of NIF’s sources are comprehensive; others are highly biased
But...NIF has > 2M an>bodies, 338,000 model organisms, and 3 million microarray records
Neocortex
Olfactory bulb
Neostriatum
Cochlear nucleus
All neurons with cell bodies in the same brain region are grouped together
Proper>es in Neurolex
NIF is in a unique posi>on to answer ques>ons about the neuroscience landscape
Where are the data?
Striatum Hypothalamus Olfactory bulb
Cerebral cortex
Brain
Brain region
Data source Funding
• Requires account in MyNIF • S>ll a work in progress, i.e., it breaks a lot • If you are interested, contact us!
Vadim Astakhov, Kepler Workflow Engine
• Gemma: Gene ID + Gene Symbol • DRG: Gene name + Probe ID
• Gemma presented results rela>ve to baseline chronic morphine; DRG with respect to saline, so direc>on of change is opposite in the 2 databases
• Analysis: • 1370 statements from Gemma regarding gene expression as a func>on of chronic morphine • 617 were consistent with DRG; over half of the claims of the paper were not confirmed in this analysis • Results for 1 gene were opposite in DRG and Gemma • 45 did not have enough informa>on provided in the paper to make a judgment
Rela>vely simple standards would make life easier
47/50 major preclinical published cancer studies could not be replicated
• “The scien>fic community assumes that the claims in a preclinical study can be taken at face value-‐that although there might be some errors in detail, the main message of the paper can be relied on and the data will, for the most part, stand the test of >me. Unfortunately, this is not always the case.”
• Ge{ng data out sooner in a form where they can be exposed to many eyes and many analyses may allow us to expose errors and develop beTer metrics to evaluate the validity of data
Begley and Ellis, 29 MARCH 2012 | VOL 483 | NATURE | 531
NIF favors a hybrid, >ered, federated system
• Domain knowledge – Ontologies
• Claims, models and observa>ons – Virtuoso RDF triples – Model repositories
• Data – Data federa>on – Spa>al data – Workflows
• Narra>ve – Full text access
Neuron Brain part Disease Organism Gene
Caudate projects to Snpc Grm1 is upregulated in
chronic cocaine Betz cells
degenerate in ALS
NIF provides the tentacles that connect the pieces: a new type of en>ty for 21st century science
Technique People
• Several powerful trends should change the way we think about our data: One Many – Many data
• Genera>on of data is ge{ng easier shared data • Data space is ge{ng richer: more –omes everyday • But...compared to the biological space, s>ll sparse
– Many eyes • Wisdom of crowds • More than one way to interpret data
– Many algorithms • Not a single way to analyze data
– Many analy>cs • “Signatures” in data may not be directly related to the ques>on for which they were acquired but tell us something really interes>ng
Are you exposing or burying your work?
• You (and the machine) have to be able to find it – Accessible through the web – Structured or semi-‐structured – Annota>ons
• You (and the machine) have to be able to use it – Data type specified and in an ac>onable form
• You (and the machine) have to know what the data mean
• Seman>cs • Context: Experimental metadata • Provenance: where did they come from
Repor>ng neuroscience data within a consistent framework helps enormously, but the frameworks need not be onerous
A data sharing snafu in 3 acts
hTp://force11.org
Jeff Grethe, UCSD, Co Inves>gator, Interim PI
Amarnath Gupta, UCSD, Co Inves>gator
Anita Bandrowski, NIF Project Leader
Gordon Shepherd, Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen, Washington University
Erin Reid
Paul Sternberg, Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli, George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark, Harvard University
Paolo Ciccarese
Karen Skinner, NIH, Program Officer (re>red)
Jonathan Pollock, NIH, Program Officer
And my colleagues in Monarch, dkNet, 3DVC, Force 11