Small Data, or: Bridging the Gap Between Specific and Generic Research Repositories April 11, 2013 Anita de Waard VP Research Data CollaboraDons [email protected] hHp://researchdata.elsevier.com/
May 10, 2015
Small Data, or: Bridging the Gap Between Specific and Generic Research Repositories
April 11, 2013 Anita de Waard
VP Research Data CollaboraDons [email protected]
hHp://researchdata.elsevier.com/
There are many efforts to enhance data storing and sharing...
• Many different research databases– both generic (Dryad, Dataverse, …) and specific (NIF, IEDA, PDB, …)
• Many systems for creaDng/sharing workflows (Taverna, MyExperiment, Vistrails, Workflow4Ever etc)
• Many e-‐lab notebooks (LabGuru, LabArchives, LaBlog, etc) • Scores of projects, commiHees, standards, bodies, grants, iniDaDves, conferences for discussing and connecDng all of this (KEfED, Pegasus, PROV, RDA, Science Gateways, Codata, BRDI, Earthcube, etc. etc)
• You can make a living out of this ;-‐)! (and many of us do…)
…but this is what scienDsts do:
Using anDbodies and squishy bits Grad Students experiment and enter details into their lab notebook. The PI then tries to make sense of this, and writes a paper. End of story.
Why save research data?
A. Data PreservaDon: – Preserve record of scienDfic process,
provenance – Enable reproducible research
B. Data Use: – Use results obtained by others – Do beHer science! – Improve interdisciplinary work
> 50 My Papers 2 M scienDsts
2 M papers/year
Where the data goes now:
Majority of data (90%?) is stored
on local hard drives Dryad:
7,631 files
Dataverse: 0.6 M
Datacite: 1.5 M
Some data (8%?) stored in large,
generic data repositories
MiRB: 25k
PetDB: 1,5 k
TAIR: 72,1 k
PDB: 88,3 k
SedDB: 0.6 k
A small porDon of data (1-‐2%?) stored in small,
topic-‐focused data repositories
> 50 My Papers 2 M scienDsts
2 M papers/year
So this needs to happen:
Dryad: 7,631 files
Dataverse: 0.6 M
Datacite: 1.5 M
MiRB: 25k
PetDB: 1,5 k
Majority of data (90%?) is stored
on local hard drives
Some data (8%?) stored in large,
generic data repositories
TAIR: 72,1 k
PDB: 88,3 k
SedDB: 0.6 k
A small porDon of data (1-‐2%?) stored in small,
topic-‐focused data repositories
INCREASE DATA PRESERVATION
Data PreservaDon Issues:
Example: create tailored metadata collecDon tools on mini-‐tablets in labs to replace paper notebooks
ObjecDon: “Our lab notebooks are all on paper – it’s how we do things” Response: Grao tools closely on scienDsts’ daily pracDce
ObjecDon: “I need to see a direct benefit of any effort I put in.” Response: Create tools to allow beHer insight in own and other’s results. Example: ‘PI-‐Dashboard’: allow immediate access/analysis of shared data: new science!
Data PreservaDon Issues:
ObjecDon: “I don’t really trust anyone else’s data – and don’t think they’ll trust mine”
Response: Create social networking context; allow data owner to provide granular access control. Example: • In Urban Lab app, data stored by researcher name. • PI decides who gets to see which data • Match up with NIF and Eagle-‐I ontologies on back end so export of (part of) data is possible at any Dme.
c o n s o r t i u m
Data Use Issues:
• ObjecDon: “I am afraid other people might scoop my discoveries”
• Response: Reward system needs to move from direct compeDDon to a ‘shared mission’ approach (cf. Mars)
• Example: Data Rescue Challenge in the geosciences: collect and reward stories/pracDces of data preservaDon, enable cross-‐disciplinary access and use of all data.
The 2013 Interna.onal Data Rescue Award in the Geosciences Organised by IEDA and Elsevier Research Data Services hHp://researchdata.elsevier.com/datachallenge
Data Use Issues:
Data PreservaDon and AnnotaDon: : Fine, I’ll do it– but where the hell do I put it?
Funding Agency: University:
Collaborators: Domain of study: Domain-‐Specific Data Repository
Local Data Repository
InsDtuDonal Data Repository
Generic Data Repository
AND
THEY ALL
WANT
DIFFERENT
METADATA!!!!
Comparing Repository Types: Repository Advantages Disadvantages
Local data repository
Easy! No one steals your data.
No one sees it. Not compliant with requirements
InsDtuDonal Repository
Not very difficult. Administrators are happy.
Data can’t easily be reused. Credit?
Generic data repository
Not very hard to do. Have complied!
Data can’t be easily reused. Credit…
Domain-‐specific data repository
Data can be reused. Credit!
Lot of work – for curators Eff
ort, Re
use, Credit, Co
mpliance
Habit, Ease, Priv
acy, Con
trol
MORE
ANNOTA
TION
Conclusions for data annotaDon: “Instead of building newer and larger weapons of mass destrucHon, I think mankind should try to get more use out of the ones we have”
Deep Thoughts by Jack Handy
• Let’s use the data standards we already have – and agree on using the same ones
• Work with exisDng data repositories in a field to come to a lowest common denominator of metadata
• Tailor the systems to be opDmally easy to use for scienDsts in terms of metadata: add as liHle as you have to, as few Dmes as you can.
Summary: • Data PreservaDon: – Tailor tools to fit scienDsts’ workflow – follow the experiment! – We are creaDng repositories of shared experiments: Enable demonstrably beFer science!
• Data Use: – Allow owner full control over who sees which data -‐ create social networking context
– CollecDvely pioneer long-‐term funding opDons; support/develop ‘shared mission’ funding challenges
• How annotaDon can help reuse: – Collaborate between (generic/specific, insDtuDonal, cross-‐naDonal) data faciliDes to integrate repositories, enable cross-‐repository usage and reuse exisIng metadata.
QuesDons?
Anita de Waard VP Research Data CollaboraDons
hHp://researchdata.elsevier.com/
Elsevier Research Data Services Goals: 1. Increase Data PreservaDon:
Help increase the amount and quality of data preserved and shared
2. Improve Data Use: Help increase the value and usability of the data shared by increasing annotaDon, normalizaDon, provenance enabling enhanced interoperability
3. Develop Sustainable Models: Help measure and deliver credit for shared data, the researchers, the insDtute, and the funding body, enabling more sustainable plaworms.
Guiding Principles of RDS: • In principle, all open data stays open and URLs, front end etc. stay where they are (i.e. with repository)
• CollaboraDon is tailored to data repositories’ unique needs/interests-‐ ‘service-‐model’ type: – Aspects where collaboraDon is needed are discussed – A collaboraDon plan is drawn up using a Service-‐Level Agreement: agree on Dme, condiDons, etc.
• Transparent business model • Very small (2/3 people) department; immediate communicaDon; instant deployment of ideas.
“But aren’t you guys in it for the money?” • Yes, we are-‐ like most businesses… • Is your real quesDon perhaps: ‘Does no one want to work with you anymore because of the Open Access debate?’
• The OA debate focuses on three issues: – IPR and Access issues – Opaque business models
– Lack of perceived added value
E.g. BY-‐NC-‐SA? Github? ..?
E.g. Gold Open Access?Shared funding model? Commercial analyDcs with shared royalDes?
We offer a service: only use it if it’s any good!