Why Data Citation Currently Misses the Point

Mark A. Parsons and Peter A. Fox

19 December 2014

Why Data Citation Currently Misses the Point

References: 1Joint Declaration of Data Citation Principles, https://www.force11.org/datacitation 2Chawla D S. 2014. Could digital badges clarify the roles of co-authors? ScienceInsider http://news.sciencemag.org/scientific-community/2014/11/could-digital-badges-clarify-roles-co-authors. See also http://projectcredit.net. 3ESIP Data Stewardship Committee http://wiki.esipfed.org/index.php/Preservation_and_Stewardship 4Donovan C and S Hanney. 2011. The payback framework explained. Research Evaluation 20 (3): 181-183. http://dx.doi.org/10.3152/095820211X13118583635756

What’s the use case?

In recent years, the data management community has begun to codify data citation practices and associated technologies, especially persistent identifiers. Nonetheless, digital data sets are rarely cited formally or specifically. Moreover, citation has done little to open up data in the unindexed deep Web. There are many reasons for this, but we believe one problem is that data managers expect too much from classic bibliographic-style data citation and assume false parallels between data and literature.

The idea of formalizing data citation emerged in the 1990s, primarily as a mechanism to encourage and reward data sharing by giving credit to data “authors”. The approach seemed a logical extension of the publishing incentive, but it has not provided a strong incentive to share or done much to expose hidden data. We need to reconsider the use case, but instead work-around efforts such as “data journals” have emerged to try and make data publishing more akin to literature publishing. Meanwhile, the community is recognizing other purposes of citation, notably to help ensure a scientific result can be verified. After much discussion amongst competing views, the community converged around a core set of data citation principles1. These principles are an important step forward, but they are primarily oriented toward formal, scholarly citation. They hint at, but do not fully consider, the myriad ways data are used.

In this poster, we take a broader view on what we are trying to accomplish with data citation by exploring several use cases around attribution, provenance, and impact. We seek to start a conversation on how we can robustly address the myriad use cases that begin to uncover the deep Web.

We suggest that we need more sophisticated, diverse, and nuanced approaches to actually address the many use cases of identifying, tracking, and enhancing data use.

1. Attribution and Credit Provide fair and recognized attribution for all personnel involved in creating a data set.

Some concerns: • Who is the

“author” of a data set?

• What is the appropriate credit mechanism for all involved?

• Who gets credit for what?

Some ideas: Project CRediT has defined a taxonomy of contributor roles for publications and suggests using digital badges that detail what each author did for the work and link to their profiles elsewhere on the Web2. Can we do this for data?

Project CRediT (Contributor Roles Taxonomy)

why not change the world? ®

2. Tracking and Provenance Identify and trace all observations used in forcing and constraining a model run.

Some concerns: • How to capture the

purpose of the data in the model e.g. forcing, assimilation, boundary conditions?

• How to reference the precise version and subset used.

• How far back does one need to go (see figure below). • What are references (PIDs) pointing too?

Some ideas: This is really an issue of provenance not just reference. Full reproducibility requires being able to trace data, processes, and tools. While persistent identifiers are crucial they are insufficient. A fuller semantic description of the provenance is required as well as richer context description. See provenance work of ESIP3.

3. Impact and Return on Investment Provide a means to track the use, impact, and value of a data set.

Some concerns: • Data are used in many contexts that do not result in a formal

article, e.g. land use planning, disaster response, agricultural prediction, policy analysis, education, etc.

• How to attribute a particular outcome to a particular person. • Qualitative impact may

be as important as quantitative, but it is hard to measure. Consider the impact of this Apollo 8 image on public consciousness.

Some ideas: In health and social sciences researchers have developed a “Payback Framework” with a logical model of the complete research process and categories of payback from research4. Can we extend this and apply it to data? If we assign credit badges as suggested in use case 1, can we aggregate the links to those badges through search engines rather than relying on constrained citation indices?

“Payback Categories” 1. Knowledge 2. Benefits to future research and research use 3. Benefits from informing policy and product development 4. Environmental and public sector benefits 5. Broader economic benefits

Rensselaer Polytechnic Institute — rpi.edu

Data citation in theory and practice

The Data Citation Principles cover purpose, function and attributes of citations. These principles recognize the dual necessity of creating citation practices that are both human understandable and machine-actionable. 1. Importance

Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.

2. Credit and Attribution Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data.

3. EvidenceIn scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited.

4. Unique IdentificationA data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.

5. Access Data citations should facilitate access to the data

themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed

use of the referenced data. 6. Persistence

Unique identifiers, and metadata describing the data, and its disposition, should persist -- even beyond the lifespan of the data they describe.

7. Specificity and Verifiability Data citations should facilitate identification of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verifying that the specific timeslice, version and/or granular portion of data retrieved subsequently is the same as was originally cited.

8. Interoperability and Flexibility Data citation methods should be sufficiently flexible to accommodate the variant practices among communities, but should not differ so much that they compromise interoperability of data citation practices across communities.

Joint Declaration of Data Citation Principles1

Figure courtesy Curt Tilmes, NASA

1. Conceptualization 2. Methodology 3. Software 4. Validation 5. Formal analysis 6. Investigation 7. Resources 8. Data curation

9. Writing – original draft 10. Writing – review &

editing 11. Visualization 12. Supervision 13. Project administration 14. Funding acquisition

Initial Conclusions • Much data use and production occur outside of the regular scholarly discourse (i.e. the literature). • The principles of data citation are strong, and the increasing use of persistent identifiers is a

significant advance, but we must think beyond bibliographic-style citation. • It is important to have a citation approach that can readily be accepted by scholarly publishers, but we

should not assume that that approach addresses other concerns. We must separate the various concerns around citation by considering multiple use cases.

• Indeed we must consider use cases in the first place! What problem are we truly trying to solve? • Other disciplines are taking a more nuanced look at these issues outside the realm of publication.

Geosciences should too.

Stage 0 ID Topic

Stage 1 Inputs

Stage 2 Research process

Stage 3 Primary outputs

Stage 4 Secondary

outputs: policy,

products

Stage 5 Adoption

Stage 6 Final

outcomes

Stock or reservoir of knowledge

Interface A project

specification

Interface B dissemination

direct impact from processes and outputs to adoption

The political , professional, and industrial environment and wider society

direct feedback paths

The logical model of the Payback Framework

PROJECTCREDITNET (CONTRIBUTOR TAXONOMY): J. SCOTT, L. ALLEN, A. BRAND ET AL.; BIOMED CENTRAL DESIGN (BADGE DESIGNS)/CREATIVE COMMONS 4.0

https://www.force11.org/datacitation

http://news.sciencemag.org/scientific-community/2014/11/could-digital-badges-clarify-roles-co-authors

http://projectcredit.net

http://wiki.esipfed.org/index.php/Preservation_and_Stewardship

http://dx.doi.org/10.3152/095820211X13118583635756

http://rpi.edu

https://www.force11.org/datacitation

http://news.sciencemag.org/scientific-community/2014/11/could-digital-badges-clarify-roles-co-authors

http://projectcredit.net

http://wiki.esipfed.org/index.php/Preservation_and_Stewardship

http://dx.doi.org/10.3152/095820211X13118583635756

http://rpi.edu

Why Data Citation Currently Misses the Point

Data & Analytics

data use

data publishing

data managers

data authors

data journals

hidden data

data citation practices

digital data sets