Creating an Urban Legend: A System for Electrophysiology Data Management and Exploration

Creating an Urban Legend: A System for Electrophysiology Data

Management and Exploration

Anita de WaardVP Research Data Collaborations

[email protected]

mailto:[email protected]

Outline:

• Life is complicated • A small pilot • Context and next steps

Life is complicated!

http://en.wikipedia.org/wiki/File:Duck_of_Vaucanson.jpg

1. Interspecies variability > A specimen is not a species!2. Gene expression variability > Knowing genes is not

knowing how they are expressed!3. Microbiome > An animal is an ecosystem!4. Systems biology > Whole is more than the sum of its parts!5. Models vs. experiment > Are we talking about the same

things? In a way we can all use? 6. Dynamics > Life is not in equilibrium! => Reductionism doesn’t

work for living systems!

Statistics could help… With enough observations, trends and anomalies can be detected:

• “Here we present resources from a population of 242 healthy adults sampled at 15 or 18 body sites up to three times, which have generated 5,177 microbial taxonomic profiles from 16S ribosomal RNA genes and over 3.5 terabases of metagenomic sequence so far.”

The Human Microbiome Project Consortium, Structure, function and diversity of the healthy human microbiome, Nature 486, 207–214 (14 June 2012) doi:10.1038/nature11234

• “The large sample size — 4,298 North Americans of European descent and 2,217 African Americans — has enabled the researchers to mine down into the human genome.”

Nidhi Subbaraman, Nature News, 28 November 2012, High-resolution sequencing study emphasizes importance of rare variants in disease.

…but biological research is insular.• Biology is small: size 10^-5 – 10^2 m,

scientist can work alone (‘King’ and ‘subjects’).

• Biology is messy: it doesn’t happen behind a terminal.

• Biology is competitive: many people with similar skill sets, vying for the same grants

• In summary: the structure of biological research does not inherently promote collaboration (vs., for instance, HE physics or astronomy (and they’re not all they’re cracked up to be,

either…)).

Prepare

Observe

Analyze

Ponder

Communicate

What if we could connect experiments?

Prepare

Analyze Communicate

Prepare

Analyze Communicate

Observations

Observations

Observations

Across labs, experiments: track reagents and how they are used

Prepare

Analyze Communicate

Prepare

Analyze Communicate

Observations

Observations

Observations

Compare outcome of interactions with these entities


Prepare

Analyze Communicate

Prepare

AnalyzeCommunicate

Observations

Observations

Observations

Build a ‘virtual reagent spectrogram’ by comparing how different entities interacted in different experiments Think

Reason collectively!


Using antibodiesand squishy bits Grad Students experimentand enter details into theirlab notebook. The PI then tries to make sense of their slides,and writes a paper. End of story.

Research Data Management today:

An Urban Legend is born:

• How can we make a standard neuroscience wet lab more data-sharing savvy?

• Incorporate structured workflows into the daily practice of a typical electrophysiology lab (the Urban Lab at CMU)– What does it take?– Where are points of conflict?

• 1-year pilot, funded by Elsevier RDS: – CMU: Shreejoy Tripathy, manage/user test– Elsevier: development, UI, project management

Goal: Enable Effective data sharing: • Effective data sharing = “someone who is not the

person who collected the data can understand the experiment and data” (Shreejoy’s definition)– So datasets should be more or less self-describing– > 90% of data sharing use cases are an experimentalist

sharing data with a future version of herself or with a labmate

• Not just experimental data file, but also the experimental metadata: – What was done? What does this variable mean? – This is usually stored in paper lab notebooks,

understandable by only the experimenter

Main Assumptions:SDB_MC_12_voltages.mat1. Effective data sharing

includes raw data files + experimental metadata (typically stored in a lab notebook)

2. You know most about an experiment while you’re performing it

3. Improved data practices can make labs more productive and more creative

Components:

Metadata App:

http://researchdata.elsevier.com/urbanlegend

Data integration:

• Syncing of metadata app and electrophysiology data acquisition via server

• Each trace of experimental data annotated with metadata

• IGOR-Pro specific, support pClamp, other acquisition packages as needed later

Electrophysiology Data Looks like this:

Semantic Integration:Entity tables uses a scope and

an attributes field to create a NoSQL like, hierarchical key/value structure in PostgreSQL with the built-in hstore extension.

Ontology Information (in normalized sql tables) map keys, values & scopes to ontology information.

Entity

ID : UUID

Investigator : references investigators table

created : timestamp

last_modified : timestamp

scope : string ~ /[A-Z]\d+(::[A-Z]\d+)*/

attributes : hstore (string → string mapping)

Data dashboard (planned):

• Use collected metadata to sort experiments: organize by mouse strain, neuron type, animal age

• Enable in-browser analyses: track provenance of analyzed data back to raw data: “what was that outlier?”

• Simple link in to publishing/data sharing tools: “we can publish papers no one else can”

Next steps Urban Legend Project:• Populate data server with many experiments:– Are people using it? Why/why not?– What questions can we answer now that we

couldn’t before?• Export data to neuroscience databases: NIF, INCF

Dataspace, neuroelectro.org• How adaptable is this solution for use in other labs?• Can we scale this up and make it sustainable? • Software is available! Ready to swap this simple system

for something better: point is process! • How does it fit into a larger data infrastructure within

the institution/nationally/internationally?

Elsevier Research Data Services:• Main goal: make research data optimally available,

discoverable and reusable• Collaboration is tailored to partner’s unique needs: – Working with a few domain-specific and institutional

repositories and institutions– Aspects where collaboration is needed are discussed– Collaboration plan is drawn up using SLA: agree on time,

conditions, etc. • 2013/2014: series of pilots, studies and reports to enable

feasibility study: – What are key needs? – Can Elsevier play a role: skillsets, partnerships? – Is there a (transparent) business model for this?

Researchers

Funding Agencies

InstitutionLibrary

Data Flow

Institutional Repository

Research Data Repositories

Research Office

Performance reporting

Indexing & Search

Performance ReportingGeneric Data Storage

(such as Dropbox)

Electronic Lab Notebooks

Unified Metadata Layer

Indexing

Indexing

Indexing

Usage/Citation reporting

Usage/Citation reporting

Integrated Performance

Query

CurationDeposit /

Store

Deposit / Store

IntegratedData Search

InstitutionalContext:

Data Initiatives:

• Data Citation group: – Synthesize principles of proper data citation– ‘Declaration of Data Citation Principles’, 8 principles of successful

data citation -http://www.force11.org/datacitation

• Resource Identification Initiative: – Promote research resource identification, discovery, and reuse– Resource Identification Portal http://scicrunch.com/resources – Central location for obtaining research resource identifiers (RRIDs)

for materials and software used in biomedical research• Antibody: Abgent Cat# AP7251E, ABR:AB_2140114• Tool: CellProfiler Image Analysis Software, NIFRegistry:nif-0000-00280• Organism: MGI:MGI:3840442

http://www.force11.org/datacitation




http://www.force11.org/node/4463

http://scicrunch.com/resources

Summary:

• Life is complicated: knowledge needs to be connected!

• A small pilot: “Urban Legend”• Context and next steps: – Working with institutions and databases to piece

together this puzzle– Force11 is contributing some pieces

Thank you!Collaborations and discussions gratefully acknowledged: • CMU: Nathan Urban, Shreejoy Tripathy, Shawn Burton, Rick

Gerkin,• Santosh Chandrasekaran, Matthew Geramita, Eduard Hovy• UCSD: Phil Bourne, Brian Shoettlander, David Minor, Declan

Fleming, Ilya Zaslavsky• NIF/Force11: Maryann Martone, Anita Bandrowski• OHSU: Melissa Haendel, Nicole Vasilevsky• California Digital Library: Carly Strasser, John Kunze, Stephen

Abrams• Elsevier: Mark Harviston, Jez Alder, David Marques

Questions?

Anita de WaardVP Research Data Collaborations

[email protected]

http://researchdata.elsevier.com/

mailto:[email protected]

http://researchdata.elsevier.com/

ScopesFollows the format L#::L#::L#...where L is a letter identifier and # is any number of decimal

digits.Example: P1::S1::R3 = Animal Prep 1, Slice 1, Run 3The Letter need not be globally unique but only chain unique.

Example: P1::S1::E1(Electrode) is different from P1::S1::R1::E1 (Run-Electrode)

Scopes are 1 indexed.

Attributes

Each scope has an attributes field that consists of multiple key, value pairs.

The keys are unique and not tied to scope. (e.g. electrode_name instead of name).

Keys can be a choice, scalar (with units), or free-text field and which is determined by the ontology tables.

Downsides to Flexible SchemaConverting to/from the flat scopes to a true hierarchy

(say in JSON) is rather complicated and led to many errors in the App.

Very easy to get corrupted data in the App.Schema is closely aligned to the way the lua App did

things.A flexible schema was a good choice, but not scopes for

hierarchies.

Raw Data

For use in data-dashboard.Standardized on HDF5.Files uploaded via FTP.Username, filename, and metadata w/i the

HDF5 file used to identify associated metadata records.

Batch or individually uploaded.

Creating an Urban Legend: A System for Electrophysiology Data Management and Exploration

Technology

data integration

data dashboard

effective data sharing

research data management

trace of experimental

experimental data file

raw data files

data sharing use cases