Creating an Urban Legend: A System for Electrophysiology Data Management and Exploration Anita de Waard VP Research Data Collaborations [email protected]
May 10, 2015
Creating an Urban Legend: A System for Electrophysiology Data
Management and Exploration
Anita de WaardVP Research Data Collaborations
Outline:
• Life is complicated • A small pilot • Context and next steps
Life is complicated!
http://en.wikipedia.org/wiki/File:Duck_of_Vaucanson.jpg
1. Interspecies variability > A specimen is not a species!2. Gene expression variability > Knowing genes is not
knowing how they are expressed!3. Microbiome > An animal is an ecosystem!4. Systems biology > Whole is more than the sum of its parts!5. Models vs. experiment > Are we talking about the same
things? In a way we can all use? 6. Dynamics > Life is not in equilibrium! => Reductionism doesn’t
work for living systems!
Statistics could help… With enough observations, trends and anomalies can be detected:
• “Here we present resources from a population of 242 healthy adults sampled at 15 or 18 body sites up to three times, which have generated 5,177 microbial taxonomic profiles from 16S ribosomal RNA genes and over 3.5 terabases of metagenomic sequence so far.”
The Human Microbiome Project Consortium, Structure, function and diversity of the healthy human microbiome, Nature 486, 207–214 (14 June 2012) doi:10.1038/nature11234
• “The large sample size — 4,298 North Americans of European descent and 2,217 African Americans — has enabled the researchers to mine down into the human genome.”
Nidhi Subbaraman, Nature News, 28 November 2012, High-resolution sequencing study emphasizes importance of rare variants in disease.
…but biological research is insular.• Biology is small: size 10^-5 – 10^2 m,
scientist can work alone (‘King’ and ‘subjects’).
• Biology is messy: it doesn’t happen behind a terminal.
• Biology is competitive: many people with similar skill sets, vying for the same grants
• In summary: the structure of biological research does not inherently promote collaboration (vs., for instance, HE physics or astronomy (and they’re not all they’re cracked up to be,
either…)).
Prepare
Observe
Analyze
Ponder
Communicate
What if we could connect experiments?
Prepare
Analyze Communicate
Prepare
Analyze Communicate
Observations
Observations
Observations
Across labs, experiments: track reagents and how they are used
Prepare
Analyze Communicate
Prepare
Analyze Communicate
Observations
Observations
Observations
Compare outcome of interactions with these entities
What if we could connect experiments?
Prepare
Analyze Communicate
Prepare
AnalyzeCommunicate
Observations
Observations
Observations
Build a ‘virtual reagent spectrogram’ by comparing how different entities interacted in different experiments Think
Reason collectively!
What if we could connect experiments?
Using antibodiesand squishy bits Grad Students experimentand enter details into theirlab notebook. The PI then tries to make sense of their slides,and writes a paper. End of story.
Research Data Management today:
An Urban Legend is born:
• How can we make a standard neuroscience wet lab more data-sharing savvy?
• Incorporate structured workflows into the daily practice of a typical electrophysiology lab (the Urban Lab at CMU)– What does it take?– Where are points of conflict?
• 1-year pilot, funded by Elsevier RDS: – CMU: Shreejoy Tripathy, manage/user test– Elsevier: development, UI, project management
Goal: Enable Effective data sharing: • Effective data sharing = “someone who is not the
person who collected the data can understand the experiment and data” (Shreejoy’s definition)– So datasets should be more or less self-describing– > 90% of data sharing use cases are an experimentalist
sharing data with a future version of herself or with a labmate
• Not just experimental data file, but also the experimental metadata: – What was done? What does this variable mean? – This is usually stored in paper lab notebooks,
understandable by only the experimenter
Main Assumptions:SDB_MC_12_voltages.mat1. Effective data sharing
includes raw data files + experimental metadata (typically stored in a lab notebook)
2. You know most about an experiment while you’re performing it
3. Improved data practices can make labs more productive and more creative
Components:
Metadata App:
Data integration:
• Syncing of metadata app and electrophysiology data acquisition via server
• Each trace of experimental data annotated with metadata
• IGOR-Pro specific, support pClamp, other acquisition packages as needed later
Electrophysiology Data Looks like this:
Semantic Integration:Entity tables uses a scope and
an attributes field to create a NoSQL like, hierarchical key/value structure in PostgreSQL with the built-in hstore extension.
Ontology Information (in normalized sql tables) map keys, values & scopes to ontology information.
Entity
ID : UUID
Investigator : references investigators table
created : timestamp
last_modified : timestamp
scope : string ~ /[A-Z]\d+(::[A-Z]\d+)*/
attributes : hstore (string → string mapping)
Data dashboard (planned):
• Use collected metadata to sort experiments: organize by mouse strain, neuron type, animal age
• Enable in-browser analyses: track provenance of analyzed data back to raw data: “what was that outlier?”
• Simple link in to publishing/data sharing tools: “we can publish papers no one else can”
Next steps Urban Legend Project:• Populate data server with many experiments:– Are people using it? Why/why not?– What questions can we answer now that we
couldn’t before?• Export data to neuroscience databases: NIF, INCF
Dataspace, neuroelectro.org• How adaptable is this solution for use in other labs?• Can we scale this up and make it sustainable? • Software is available! Ready to swap this simple system
for something better: point is process! • How does it fit into a larger data infrastructure within
the institution/nationally/internationally?
Elsevier Research Data Services:• Main goal: make research data optimally available,
discoverable and reusable• Collaboration is tailored to partner’s unique needs: – Working with a few domain-specific and institutional
repositories and institutions– Aspects where collaboration is needed are discussed– Collaboration plan is drawn up using SLA: agree on time,
conditions, etc. • 2013/2014: series of pilots, studies and reports to enable
feasibility study: – What are key needs? – Can Elsevier play a role: skillsets, partnerships? – Is there a (transparent) business model for this?
Researchers
Funding Agencies
InstitutionLibrary
Data Flow
Institutional Repository
Research Data Repositories
Research Office
Performance reporting
Indexing & Search
Performance ReportingGeneric Data Storage
(such as Dropbox)
Electronic Lab Notebooks
Unified Metadata Layer
Indexing
Indexing
Indexing
Usage/Citation reporting
Usage/Citation reporting
Integrated Performance
Query
CurationDeposit /
Store
Deposit / Store
IntegratedData Search
InstitutionalContext:
Data Initiatives:
• Data Citation group: – Synthesize principles of proper data citation– ‘Declaration of Data Citation Principles’, 8 principles of successful
data citation -http://www.force11.org/datacitation
• Resource Identification Initiative: – Promote research resource identification, discovery, and reuse– Resource Identification Portal http://scicrunch.com/resources – Central location for obtaining research resource identifiers (RRIDs)
for materials and software used in biomedical research• Antibody: Abgent Cat# AP7251E, ABR:AB_2140114• Tool: CellProfiler Image Analysis Software, NIFRegistry:nif-0000-00280• Organism: MGI:MGI:3840442
Summary:
• Life is complicated: knowledge needs to be connected!
• A small pilot: “Urban Legend”• Context and next steps: – Working with institutions and databases to piece
together this puzzle– Force11 is contributing some pieces
Thank you!Collaborations and discussions gratefully acknowledged: • CMU: Nathan Urban, Shreejoy Tripathy, Shawn Burton, Rick
Gerkin,• Santosh Chandrasekaran, Matthew Geramita, Eduard Hovy• UCSD: Phil Bourne, Brian Shoettlander, David Minor, Declan
Fleming, Ilya Zaslavsky• NIF/Force11: Maryann Martone, Anita Bandrowski• OHSU: Melissa Haendel, Nicole Vasilevsky• California Digital Library: Carly Strasser, John Kunze, Stephen
Abrams• Elsevier: Mark Harviston, Jez Alder, David Marques
Questions?
Anita de WaardVP Research Data Collaborations
http://researchdata.elsevier.com/
ScopesFollows the format L#::L#::L#...where L is a letter identifier and # is any number of decimal
digits.Example: P1::S1::R3 = Animal Prep 1, Slice 1, Run 3The Letter need not be globally unique but only chain unique.
Example: P1::S1::E1(Electrode) is different from P1::S1::R1::E1 (Run-Electrode)
Scopes are 1 indexed.
Attributes
Each scope has an attributes field that consists of multiple key, value pairs.
The keys are unique and not tied to scope. (e.g. electrode_name instead of name).
Keys can be a choice, scalar (with units), or free-text field and which is determined by the ontology tables.
Downsides to Flexible SchemaConverting to/from the flat scopes to a true hierarchy
(say in JSON) is rather complicated and led to many errors in the App.
Very easy to get corrupted data in the App.Schema is closely aligned to the way the lua App did
things.A flexible schema was a good choice, but not scopes for
hierarchies.
Raw Data
For use in data-dashboard.Standardized on HDF5.Files uploaded via FTP.Username, filename, and metadata w/i the
HDF5 file used to identify associated metadata records.
Batch or individually uploaded.