Opportunities for Data Exchange: optimising the conditions for data sharing Susan Reilly LERU Doctoral Summer School, 9th Jul, 2012
May 10, 2015
Opportunities for Data Exchange: optimising the conditions for data sharingSusan Reilly
LERU Doctoral Summer School, 9th Jul, 2012
Thank you!
LIBER (Association of European Research Libraries)-Projects:
ContentEuropeana LibrariesEuropeana Newspapers
PolicyMEDOANET
InfrastructureAPARSENAAA StudyODE
LIBER & the European Research Infrastructure
Ready to ride the wave… ?
Rule #11: Don’t Publicize!Unless the break is a well known spot, like for e.g. Lahinch, Bundoran, or Strandhill, taking photo’s and posting them on the Internet is regarded as unacceptable in the surfing community. If you publicize a break in this manner you draw attention to it, which in turns draws more people to it, which means a place gets more crowded and there is more aggro in the water. The more you talk about a break to those who haven’t surfed it the more damage you do to it, and yourself in the long run because the more people there are in the water the less waves there are for you. Think about it.
http://www.boards.ie/vbulletin/showthread.php?s=fc082712ef1354ecf7cb0e53dc71d519&t=2055828999
Reason not to share surf info
• Other people will steal my wave
• Unethical to share e.g.inexperienced surfers on dangerous breaks get hurt
• We won’t get recognition e.g. local surfers loose out to visiting pros
• .............
15 petabytes (15 million gigabytes) of data annually – enough to fill more than 1.7 million dual-layer DVDs a year!
The Vision
“With a proper scientific einfrastructure, researchers in different domains can collaborate on the same data set,
finding new insights. They can share a data set easily across the globe, but also protect its integrity and ownership. They can use, re-use and combine data, increasing productivity.
They can more easily solve today’s Grand Challenges, such as climate change and energy supply. Indeed, they can engage in whole new forms of scientific inquiry, made
possible by the unimaginable power of the e-infrastructure to find correlations, draw inferences and trade ideas and information at a scale we are only beginning to see.”
Now and Next
• Authentication & authorisation
• New skills
The Opportunities for Data Exchange Project
• identify, collate, interpret and deliver evidence of emerging best practices in sharing, re-using, preserving and citing data, the drivers for these changes and barriers impeding progress, in forms suited to each audience
• policy makers, funders, infrastructure operators, data centres, data providers and users, libraries and publishers
Steps to creating the conditions for data sharing
• Understand data sharing today• Collection of "success stories”, “near misses” and “honourable
failures” in data sharing, re-use and preservation
• Data & scholarly communications• Integrating data and publications• Best practice in data citation• New roles
• Identify drivers and barriers• Interviews with stakeholder to seek consensus
Foto "Bell", Noordewierweg 116, Amersfoort.
Tales of Data sharing
• 21 stories• scientific communities• infrastructure initiatives • management• other relevant stakeholders
The Astronomical Importance of Discoverability
• Galaxy Zoo (Carolin Liefke)
• Pre-processed data shared with the public to carry out specific tasks (e.g. classifying galaxies)
• Discoverability a major challenge
in data sharing- easier, more
sophisticated data mining, more
complex automated processing
Hypotheses
“Without the infrastructure that helps scientists manage their data in a convenient and efficient way, no culture of data sharing will evolve.”
Stefan Winkler-Nees (German Research Foundation, DFG)
Hypotheses Expected
Category: Infrastructure
“An international research community needs an international data infrastructure and international support.”
"After decades of reports with data in their titles the community found inadequate services almost no international support and few solutions.”
Tension between hypotheses
Cat: Legislation, Education, Behaviour
“Premature data releases should not be enforced, but the mere possibility of data misinterpretation is no reason for not sharing data.”
“To avoid misuse and lack of acknowledgement of very special data, access should be restricted to skilled persons trained by the data creator.”
Hypotheses by Category
4.Attitudes
6.Policies
8.Infrastructure
10.DMPs, Citability
11.Dependency on discipline
Barriers & Drivers
data sharing
education
legislation funding
culture & attitude quality
policiescooperation
Infrastructure
publishing & visibility data flow improvements disciplines
accreditation & certification
career efficiency
Integrating Data & Publications
• 3 stakeholder groups• Publishers• Researchers• Libraries & data centres
How stakeholders interact
(1) Data contained and
explained within the article
(2) Further data explanations in
any kind of supplementary files to articles
(3) Data referenced from the article and
held in data centers and repositories
(4) Data publications, describing available datasets
(5) Data in drawers and on
disks at the institute
The Data Publication Pyramid
Where do you currently store your research data? (multiple answers possible)
Source: PARSE.Insight survey 2009, N = 1202
26
The Pyramid’s likely short term reality:
(1) Top of the pyramid is stable
but small(2) Risk that
supplements to articles turn into Data Dumping
places(3) Too many
disciplines lack a community
endorsed data archive
(4) Estimates are that at least
75 % of research data is
never made openly avaiable
27
The Ideal Pyramid(1) More
integration of text and data, viewers
and seamless links to interactive
datasets(2) Only if data
cannot be integrated in
article, and only relevant extra explanations
(3) Seamless links (bi-directional)
between publications and data, interactive
viewers within the articles
(4) More Data Journals that
describe datasets, data mgt plans and data methods
A famous paper in Nature:DNA structure - 1953
• 1 page• 2 authors• 1 figure• no data
Source: V. Kiermer, Nature Publishing Group, 2011
Nature in 2001: The human genome issue
• 62 pages, 49 figures, 27 tables
Source: V. Kiermer, Nature Publishing Group, 2011
A thousand genomes – 2010
http://www.nature.com/nature/journal/v467/n7319/full/nature09534.html
Raw data: 12,145 SRA run ids submitted to Short Read Archive
Raw data: 12,145 SRA run ids submitted to Short Read Archive
Source: V. Kiermer, Nature Publishing Group, 2011
31
Elsevier offers gene and protein viewers
from within the article, to data stored elsewhere:
Articles: the currency of Science
Issues for researchers
• Researchers need somewhere to put data and make it safe for reuse
• Researchers need to control its sharing and access• Researchers need the ability to integrate data and
publication• Researchers need to get credit for data as a first class research object• Researchers need someone to pay for the costs of data availability and re-use
Library support for the researcher
Libraries and data centres must support…
• data as first class research object: publishing, persistent identification/citation of datasets
• data description, metadata, standards documentation and retrieval
• proper documentation of data
• long-term data archiving including data curation and preservation
Availability
Findability
Interpretability
Re-usability
7 Areas of Opportunity
• Availability
• Findability
• Interpretability
• Reusability
• Citability
• Curation
• Preservation
Researcher Opportunities
Data Issue: Researchers opportunities:
Availability Researchers demand their data be treated as first class research objectsResearchers loosen control over dataDefine roles of responsibility and control
Findability Agree convention to propose to publishers regarding data citationUse of persistent identifiers such as DOI’sEnsure common citation practices
Interpretability Recognize that data require metadata and work towards community best practice in metadata development
Re-usability Be concerned about the long term ability for secondary use and consider or seek out responsible preservation actions
Citability Agree a convention for data citationFollow metadata standards for datasetsUse of persistent identifiers such as DOI’s
Curation Develop sustainable and realistic data management plansCollaboration with public data archives
Preservation Develop sustainable realistic preservation plansActive engagement with public data archives
Publishers’ Opportunties
Data Issue: Publishers opportunities (Chapter 3):
Availability Articles with data provide richer content and higher usageImpose stricter editorial policies about availability of underlying data which is in line with general funder’s trendsEnsure data is stored in a safe place, preferably a public repositoryBe transparent about curation and preservation of submitted data
Findability Ensure bi-directional links between data and publicationsEnsure common citation practices
Interpretability Provide services around data such as viewer apps for underlying data from within the article or interactive graphs, tables and images
Data Publications
Re-usability Interactive data from within articlesLinks to the relevant datasets, not just to the databaseData Publications
Citability Establish uniform data citation standardsFollow metadata standards for datasetsUse of persistent identifiers such as DOI’sData Publications
Curation Transparency about curation of submitted dataCollaboration with public data archives
Preservation Transparency about preservation of submitted dataCollaboration with public data archives
Libraries’ Opportunities
Data Issue: Libraries and data centres opportunities (Chapter 4):
Availability Lower barriers to researchers to make their data available. Integrate data sets into retrieval services.
Findability Support of persistent identifiers. Engage in developing common metadescription schemas and common citation practices. Promote use of common standards and tools among researchers
Interpretability Support crosslinks between publications and datasets. Provide and help researchers understand metadescriptions of datasets. Establish and maintain knowledge base about data and their context.
Re-usability Curate and preserve datasets. Archive software needed for re-analysis of data. Be transparent about conditions under which data sets can be re-used (expert knowledge needed, software
needed).
Citability Engage in establishing uniform data citation standards. Support and promote persistent identifiers.
Curation/Preservation Transparency about curation of submitted data. Promote good data management practice. Collaborate with data creators Instruct researchers on discipline specific best practices in data creation (preservation formats, documentation of
experiment,…)
Q. What exactly should the role of the library be and what are the skills we need?
Data Citation: Getting Credit!
• Challenges:• granularity: which bits inside the dataset is being referred to• versioning: in case of dynamic or regularly updated data, which
version is cited• retrievability: indicate via DOIs or accession numbers where the
data are retrievable
Overview of best practices reported in literature and through interviews with experts
Some Findings
• Citations with persistent identifiers should be listed in the references/bibliography to enable tracking of citation metrics.
• Publishers need to provide guidance for authors and referees on citation of data.
• Researchers need to nurture awareness in their community of the benefits of data citation, and follow citation guidelines given by publishers and data centres.
• Many researchers do not appear to see the value and benefits of data citation. How different communities can work together to promote this activity and the status of datasets as primary research outputs and publishable works in their own right, is an issue that still needs to be addressed.
Our Relationship
Many researchers do not appear to see the value and benefits of data citation. There is a gap, which could be filled by libraries, in advocacy for data sharing, the use of subject specific repositories, and best practice in data citation. These, if filled, would increase the number of researchers
sharing and reusing data.
The issue still to be
addressed is how different
communities can work together
to promote this activity and
the status of datasets as
primary research outputs and
publishable works
in their own right.
Now & Next
• For ODE:• Verify hypotheses as drivers and barriers• Translate findings for various target groups
• For LIBER:• Continue to find ways of supporting data sharing• Return to the framework for the collaborative data infrastructure
Now and Next
• Authentication & authorisation
• New skills
Addressing Trust and Data Curation
• AAA Study• Authentication and authorisation infrastructure for European
researchers• On the Riding the Wave wish list: “Distributed and collaborative
authentication, authorisation and accounting”• Safe depositing of data• Authenticity and provenance• Ensure recognition• Safe environments for collaboration
Addressing Trust and Data Curation
• Alliance for Permanent Access to the Record of Science in Europe Network (APARSEN)
• look across the excellent work in digital preservation which is carried out in Europe and to try to bring it together under a common vision
• Trust, Sustainability, Usability, Access
Back to surfing…
What was the result of all this sharing?
http://www.brain-cloud.net/wp-content/uploads/2011/05/fergal-smith.jpg
Has enabeled surfers to do things they only dreamed about
• Big wave hunters….
http://theweek.com/article/index/227955/the-biggest-wave-ever-surfed-the-mind-blowing-video
Further Reading
Riding the Wave (2011)
http://www.cordis.europa.eu/fp7/ict/e.../hlg-sdi-report.pdf
ODE/APARSEN Publicationshttp://www.alliancepermanentaccess.org/index.php/community/current-projects/ode/
AAA Studyhttps://confluence.terena.org/display/aaastudy/AAA+Study+Home+Page
Credits
Slide reused from presentations by:
Salvatore Mele (CERN)
Eefke Smit (STM)
Hans Pfeiffenberger (Helmholtz)
Most images sourced through The European Library
Thank you again!