Small Data: Bridging the Gap Between Generic and Specific Repositories

Small Data, or: Bridging the Gap Between Specific and Generic Research Repositories

April 11, 2013 Anita de Waard

VP Research Data CollaboraDons [email protected]

hHp://researchdata.elsevier.com/

There are many efforts to enhance data storing and sharing...

•  Many different research databases– both generic (Dryad, Dataverse, …) and specific (NIF, IEDA, PDB, …)

•  Many systems for creaDng/sharing workflows (Taverna, MyExperiment, Vistrails, Workflow4Ever etc)

•  Many e-‐lab notebooks (LabGuru, LabArchives, LaBlog, etc) •  Scores of projects, commiHees, standards, bodies, grants, iniDaDves, conferences for discussing and connecDng all of this (KEfED, Pegasus, PROV, RDA, Science Gateways, Codata, BRDI, Earthcube, etc. etc)

•  You can make a living out of this ;-‐)! (and many of us do…)

…but this is what scienDsts do:

Using anDbodies and squishy bits Grad Students experiment and enter details into their lab notebook. The PI then tries to make sense of this, and writes a paper. End of story.

Why save research data?

A.  Data PreservaDon: –  Preserve record of scienDfic process,

provenance –  Enable reproducible research

B.  Data Use: –  Use results obtained by others –  Do beHer science! –  Improve interdisciplinary work

> 50 My Papers 2 M scienDsts

2 M papers/year

Where the data goes now:

Majority of data (90%?) is stored

on local hard drives Dryad:

7,631 files

Dataverse: 0.6 M

Datacite: 1.5 M

Some data (8%?) stored in large,

generic data repositories

MiRB: 25k

PetDB: 1,5 k

TAIR: 72,1 k

PDB: 88,3 k

SedDB: 0.6 k

A small porDon of data (1-‐2%?) stored in small,

topic-‐focused data repositories

> 50 My Papers 2 M scienDsts

2 M papers/year

So this needs to happen:

Dryad: 7,631 files

Dataverse: 0.6 M

Datacite: 1.5 M

MiRB: 25k

PetDB: 1,5 k

Majority of data (90%?) is stored

on local hard drives

Some data (8%?) stored in large,

generic data repositories

TAIR: 72,1 k

PDB: 88,3 k

SedDB: 0.6 k

A small porDon of data (1-‐2%?) stored in small,

topic-‐focused data repositories

INCREASE DATA PRESERVATION

Data PreservaDon Issues:

Example: create tailored metadata collecDon tools on mini-‐tablets in labs to replace paper notebooks

ObjecDon: “Our lab notebooks are all on paper – it’s how we do things” Response: Grao tools closely on scienDsts’ daily pracDce

ObjecDon: “I need to see a direct benefit of any effort I put in.” Response: Create tools to allow beHer insight in own and other’s results. Example: ‘PI-‐Dashboard’: allow immediate access/analysis of shared data: new science!

Data PreservaDon Issues:

ObjecDon: “I don’t really trust anyone else’s data – and don’t think they’ll trust mine”

Response: Create social networking context; allow data owner to provide granular access control. Example: •  In Urban Lab app, data stored by researcher name. •  PI decides who gets to see which data •  Match up with NIF and Eagle-‐I ontologies on back end so export of (part of) data is possible at any Dme.

c o n s o r t i u m

Data Use Issues:

•  ObjecDon: “I am afraid other people might scoop my discoveries”

•  Response: Reward system needs to move from direct compeDDon to a ‘shared mission’ approach (cf. Mars)

•  Example: Data Rescue Challenge in the geosciences: collect and reward stories/pracDces of data preservaDon, enable cross-‐disciplinary access and use of all data.

The 2013 Interna.onal Data Rescue Award in the Geosciences Organised by IEDA and Elsevier Research Data Services hHp://researchdata.elsevier.com/datachallenge

Data Use Issues:

Data PreservaDon and AnnotaDon: : Fine, I’ll do it– but where the hell do I put it?

Funding Agency: University:

Collaborators: Domain of study: Domain-‐Specific Data Repository

Local Data Repository

InsDtuDonal Data Repository

Generic Data Repository

AND

THEY ALL

WANT

DIFFERENT

METADATA!!!!

Comparing Repository Types: Repository Advantages Disadvantages

Local data repository

Easy! No one steals your data.

No one sees it. Not compliant with requirements

InsDtuDonal Repository

Not very difficult. Administrators are happy.

Data can’t easily be reused. Credit?

Generic data repository

Not very hard to do. Have complied!

Data can’t be easily reused. Credit…

Domain-‐specific data repository

Data can be reused. Credit!

Lot of work – for curators Eff

ort, Re

use, Credit, Co

mpliance

Habit, Ease, Priv

acy, Con

trol

MORE

ANNOTA

TION

Conclusions for data annotaDon: “Instead of building newer and larger weapons of mass destrucHon, I think mankind should try to get more use out of the ones we have”

Deep Thoughts by Jack Handy

•  Let’s use the data standards we already have – and agree on using the same ones

•  Work with exisDng data repositories in a field to come to a lowest common denominator of metadata

•  Tailor the systems to be opDmally easy to use for scienDsts in terms of metadata: add as liHle as you have to, as few Dmes as you can.

Summary: •  Data PreservaDon: –  Tailor tools to fit scienDsts’ workflow – follow the experiment! – We are creaDng repositories of shared experiments: Enable demonstrably beFer science!

•  Data Use: –  Allow owner full control over who sees which data -‐ create social networking context

–  CollecDvely pioneer long-‐term funding opDons; support/develop ‘shared mission’ funding challenges

•  How annotaDon can help reuse: –  Collaborate between (generic/specific, insDtuDonal, cross-‐naDonal) data faciliDes to integrate repositories, enable cross-‐repository usage and reuse exisIng metadata.

QuesDons?

Anita de Waard VP Research Data CollaboraDons

[email protected]

hHp://researchdata.elsevier.com/

Elsevier Research Data Services Goals: 1.  Increase Data PreservaDon:

Help increase the amount and quality of data preserved and shared

2.  Improve Data Use: Help increase the value and usability of the data shared by increasing annotaDon, normalizaDon, provenance enabling enhanced interoperability

3.  Develop Sustainable Models: Help measure and deliver credit for shared data, the researchers, the insDtute, and the funding body, enabling more sustainable plaworms.

Guiding Principles of RDS: •  In principle, all open data stays open and URLs, front end etc. stay where they are (i.e. with repository)

•  CollaboraDon is tailored to data repositories’ unique needs/interests-‐ ‘service-‐model’ type: – Aspects where collaboraDon is needed are discussed – A collaboraDon plan is drawn up using a Service-‐Level Agreement: agree on Dme, condiDons, etc.

•  Transparent business model •  Very small (2/3 people) department; immediate communicaDon; instant deployment of ideas.

“But aren’t you guys in it for the money?” •  Yes, we are-‐ like most businesses… •  Is your real quesDon perhaps: ‘Does no one want to work with you anymore because of the Open Access debate?’

•  The OA debate focuses on three issues: –  IPR and Access issues – Opaque business models

– Lack of perceived added value

E.g. BY-‐NC-‐SA? Github? ..?

E.g. Gold Open Access?Shared funding model? Commercial analyDcs with shared royalDes?

We offer a service: only use it if it’s any good!

Small Data: Bridging the Gap Between Generic and Specific Repositories

Technology

small data

data annotadon

data standards

data owner

elses data

data match

data repository credit

sciendsts data repositories